Survivor Bias

I’ve been so busy lately I haven’t had a chance to write anything.

Of course part of the problem isn’t having ideas to write about, but time to write about them.

I think perhaps I should focus more on writing SOMETHING, even if it’s just a short post, than trying to write the Great American Blog post.

In this case I’m going to actually post a link to a great article on survivorship bias.  This is the sort of article I wish I had written myself.  As I’ve mentioned part of my point here is to get one to think about HOW we think.

The story of the bomber survivors was first related to be by a good friend in college, but without a source. Now at least I have a source for it.

In a similar vein, and the article touches upon it, people will talk about how great the older homes in weather prone areas were built because they’re standing decades after they’re built despite hurricanes or floods or blizzards. These folks completely miss the other 90% of homes from those eras that didn’t survive.

Years ago, my father bought and rehabilitated what we believe to have been the oldest house in town (in fact technically it was older than the town and probably where the town charter was signed.)

There really wasn’t anything about the construction that stood out that made it survive. Just luck at this point.  A single fire at any point in time could have made the second oldest house in town the oldest.

In closing, this article doesn’t represent most of my thoughts over the past 6 months only the ones that survived to the publishing stage.

Newspapers and paradigm shifts

When I was fairly young, I learned a detail about newspaper advertising.  The space on the lower-outside right-hand page was worth more than lower left inside page (i.e. along the fold).

If you think about how folks read and flip threw newspapers, this makes sense.  It’s an area more likely to be seen than others.

With news, there’s the term “above the fold” and “below the fold”  Obviously, you want the big news article on the front page, above the fold where it’s most likely to be seen.

When laying out a newspaper, there is over a century of experience in how to do things.  You don’t jump a front page news article to a page in the middle of the sports section; for the most part, you don’t run box-scores on the front page (unless perhaps it’s an upset at the Super Bowl or something else that will garner eyeballs); you don’t scatter sections of your newspaper across didn’t pages.

Years ago, I was proud to be part of one of the first newspaper web application service providers, “PowerAdz” (which later become PowerOne Media, and then later most of it was bought by TownNews.)

Even back then, I realized much of what was known about newspaper layout was going to have to change. There was no longer a physical fold in the newspaper.  There was a bottom edge to a browser window, and that still meant you needed the important news at the top.  But, how long should it run down the “page”.  How many pixels did the viewer have before the bottom edge of the window?  What was the width of your front page?

You also weren’t limited by a physical size to a page.  Articles could run on as long as readers were willing to scroll.  Or was having a reasonable sized page with links to following pages better?

Much of this is still in flux. And I suspect will continue to be for years to come.  Heck, just the fact that articles can have hyperlinks to other articles, or background information makes news on web pages very different from the traditional print medium.

What reminded me of this today was seeing yet another comment on a CNN fluff piece that was linked off of the front page.  The commentator was complaining that “this is news?”

Someone replied it was under the Entertainment section. Another rebutted “yeah, but it’s on the the front news page.”

That reminded me of these thoughts. What is the front page any more? Even though you can click to different sections of CNN, it’s not like a traditional newspaper where you have physically separate section, each with its own front page.  Now it’s all virtual and a front page is simply as you define it.

I think ultimately we have to let go of our definition of the front page of a news site and accept that links to news, fluff pieces and the like will all end up there.  Sure, there will be sections within the page, but to complain there’s sports, or entertainment, or other non-traditional news links off the front page will be like complaining you don’t have to unscroll the papyrus in the correct direction to read it: a sign of an older time.

Times change, but more importantly the medium changes, even if the message doesn’t.

 

Git ‘r Done (part 2)

Someone recently forwarded the following article to me: “Get Shit Done: The Worst Startup Culture Ever”.  Before reading it I was a bit ready to disagree. (see my previous post on getting stuff done.)

But after reading this article, I have to agree with the premise of the article; and point out I think there’s two different ways of looking at what “Get Stuff Done” can mean.

At my current assignment, a coworker and I were joking about how some people had some many letters after their name like PMP or CAPM, PMI-SP and the like.

So we joked we needed some letters and we settled on GSD – Get stuff done.  At times on this particular project we seemed to be the only ones accomplishing much or caring about accomplishing much. We had one person who was more concerned with the agenda of the meeting every day (yes, daily meetings to see why the project wasn’t getting done.  With 5-6 people in that room, that’s 25 or more person-hours per week of discussing why things weren’t getting done.)

So in that context, “decide what your goal is, and actually GETTING IT DONE” I think “Get ‘r Done” is an important concept.

On the other hand, I have seen (and fallen prey to myself, both as a manager and as a employee) of the “Get ‘r Done” attitude in the above article.

The project above I was working on never got done.  It wasn’t for lack of effort on the part of myself and several others that it didn’t get done.. It was though for the lack of effort on the part of management that it never got done.  At one point they asked me what could be done to make sure the project could be completed on time. I gave them several examples of areas where they could put some pressure on another group to streamline some procedures.

I was basically told that wasn’t going to happen, and that I had to work harder and “get ‘r done”.  At this phase of the project, I needed 4-5 items from another group and the other group had a policy that each item needed a separate ticket.  Each ticket had to be done sequentially and could only be submitted when the previous ticket was closed out.  Oh, and their policy was 2 weeks per ticket.  Period.

So, by my math, that’s 8-10 weeks. That assumes every ticket goes smoothly, which had not been our experience with this other group.

The project due date was in 6 weeks.

So, I was being told to get things done, in an impossible fashion.  Talk about demotivating.

In general, I’ve been my best as a manager, when I’ve been given the tools to let my team get the job done. It may be buying them dinner one night as morale boost. It may be making sure no extra work gets thrust upon them, or keeping certain other managers from trying to add to their work queue. In one case, it was buying a new NAS so we had enough storage space that we weren’t getting paged every night about diskspace issues. When properly motivated, people can move mountains and better yet, can often do it in a normal work week.

So, if you want to get it done, make sure your team has the tools to do their job, aren’t being distracted, and aren’t being given reasons to have low morale.  They’ll move mountains for you. But ask them to work harder without any of the above, and sooner or later you’ll find yourself without a team, and your boss simply asking you to work harder!

By the way, on that NAS, I think that $5K investment probably helped keep a key employee of mine from jumping ship for greener pastures.  That NAS was probably a better investment than if we had tried to offer him another $5K to keep him happy despite the lack of sleep from all the pages and other issues.

Moral: You want them to “get ‘r done”, give them the tools they need, remove barriers and keep morale up.  They’ll get it done.

Link

As a middle manager in several start-ups I’ve had to deal with being short of resources of all kinds.  But, at the height of the first dot-com bubble, I had a great team.  No, not all of them were equals, but each pulled their weight and each could be relied on to perform well, in their area of expertise.

One guy, was a great troubleshooter.  He’d leave no stone unturned and you could tell if a problem was bothering him since he’d fixate on it until he understood it AND had solved the root cause.  It wasn’t good enough for him to fix the current problem. He wanted to make sure it couldn’t happen again.  However, what he wasn’t good at, was the rote, boring procedures. “Install this package in exactly this way, following these steps.” He’d tend to go off script and sometimes that caused problems.

On the other hand, I had another guy who was about 2 decades older and not from an IT background.  Troubleshooting wasn’t his forte and he honestly didn’t have the skill set to do a great job at it.

However, he excelled at the mundane, routine, rote tasks. Now this may sound like a slight, but far from it. The truth is in most cases in IT, you’re dealing with the routine, rote tasks. In an ideal world, you might ever have emergencies.

Now, this wasn’t to say he couldn’t solve most problems as they came up.  Simply if it was overly complex or rather obscure, it wasn’t his forte.

I learned when I wanted to get stuff done, that assigning the routine stuff to him worked far better than assigning it to the first guy.  And just the reverse.  If I had some weird problem I needed debugged that wasn’t easy to solve, the first guy was the guy to through at the problem.

Each excelled in their own way and the team did best when I remembered how to best utilize their talents.

Impossible Things

“I daresay you haven’t had much practice,” said the Queen. “When I was your age, I always did it for half-an-hour a day. Why, sometimes I’ve believed as many as six impossible things before breakfast.

I wanted to relate a story that happened to a colleague of mine.  She maintains servers in two separate datacenters. All the servers are on the same domain.  So, in theory updating the password in one datacenter does so in both datacenters.  And this is the way it has been for the past year. Things worked as expected.

Recently she noticed however that after updating her password, the one datacenter was still using the old password and the second datacenter was using the new one.

In a proper domain this shouldn’t be possible, but apparently it was.  She spent some time confirming it before calling the IT department.

When she explained the problem to IT, their response was, “That’s not possible.  We synch the servers every hour and run an exception report every night that would show that.”

IT managed to fix the problem (after finally acknowledging it.)

However, the troubling problem though to me is not that the passwords got out synch. Sometimes the impossible does happen. What’s more troubling is that the sync apparently never fixed the problem and the exception report never showed an issue in over the month the problem existed.

The moral here?

It’s not that the impossible sometimes happens (though that would be good for a future blog post).  It’s when your alerts and warnings fail, perhaps it’s time you look at your alerts and warnings since they’re obviously not doing you much good.

Link

Ok, perhaps this isn’t Buckland, but alerts can be important.

I’ve been wanting to write something on alerts for quite awhile, and this isn’t quite it. Rather I’ll reference another URL on alerting.  This sums up quite well much of what I’ve wanted to say for awhile.

To single out one important rule: if you’re going to get an alert, be prepared to act upon it!

Knowing that you pegged your server CPU at 100% every once in a while might be useful, but probably not something to wake people up about.  And if it is hitting 100% infrequently, there’s probably nothing worth doing. On the other hand, if it’s routinely hitting 100% CPU, perhaps your action plan is to spin up another web server, or move load to a different database. Or, perhaps your plan is even to do nothing. But, planning to do nothing and accepting that, is very different from not planning and simply do nothing because you have no idea of waht to do.

Note, alerting is very different from monitoring and logging.  If my CPU is hitting 100% once a week for 5 seconds, and then twice a week for 6 seconds, and then 4-5 times a day, for 10 seconds, I want to start making plans. But again, I probably don’t want to wake someone up.

Monitor, yes. Alert: maybe.

That’s it for tonight.

 

The obvious isn’t so obvious

Ok, apologies to my loyal reader or two for not blogging in a LONG time.  Either life has been too busy, or when things have calmed down, I haven’t had anything to blog about.

Normally I don’t want to blog weird technical stuff, but this time I will.

I’m currently working for a client doing some DBA work.  Lots of fun actually.  

Said client has an issue. They want to get have a standby datacenter.  Great idea.  Log-shipping will work well here for them.  Just one catch: corporate won’t allow anything through the firewall unless it’s SSH.  Hmm.

No real problem, I figure I can do “log-shipping in the blind”.  Basically take the transaction logs from the primary server, use rsync to get through the firewall and apply them to the secondary server.  If need be, I’ll custom write the jobs.  It should be easy.

Here’s one example:

http://sqlblog.com/blogs/merrill_aldrich/archive/2011/05/19/case-study-secure-log-shipping-via-ssl-ftp.aspx

The key part (for me) is the part there “Half a Log-Shipping Config”

Pretty straightforward, no DBA would balk at that.

So, I set it all up. Setup some scheduled tasks.  Make a backup, copy it to the secondary server and restore it.  So far so good.  Take logs created by existing log backup job, and watch as they automatically get shipped to the datacenter.  

“Hey, this is going great. I’ll be done before lunch.”

Manually apply a log or two to make sure there are no issues.

“Works perfect.  Lunch is starting to look real good. Maybe I’ll break early and go out.”

Setup scheduled task to run the job created above.

Fails. “Oh fudge. (Hmm, maybe some fudge after lunch?)”

Oh wait, just a typo in the directory.  

Rebuild job. Run it. Success!

Great, let’s check things out.  

“Hmm, the .TUF (transaction undo file) isn’t there.”

Let’s look at the jobs.

Now here, you’ll have to bear with me. Again, the corporate rules will NOT permit me to cut-paste even simple text error messages from a server in the datacenter.

But basically get a bunch of messages along the lines of:

2013-06-26 05:00:01.92 Skipped log backup file. Secondary DB: ‘MyDemo’, File: ‘\\NAS\Logshipping\MyDemo_201306261156.trn’

A bunch of these.

Well, a lot of Googling suggested that since log flies were empty (this database doesn’t get much traffic) SQL Server was smart enough to know there was nothing to apply.

Sure enough, manually applying them showed there was nothing for them to do. I needed a real transaction log with a transaction in it. No problem. Go to original database. Do a quick update.  Create transaction log and wait for the automated log copier to get it to the right place.

“Hmm, maybe I’ll just do a late lunch today.”

Get it to the secondary server. Run the job.

“Hmm. Skipped the files I expected it to skip.  No problem.”

Gets to the file I expect it to apply:

(now I’m retyping rather than cutting/pasting)

2013-06-26 13:14:05.43 Found first log backup to restore. Secondary DB: ‘MyDemo’; File: 

‘\\NAS\Logshipping\MyDemo_201306261605.trn’ 

2013-06-26 13:14:05.44 The restore option was successful. Secondary Database ‘MyDemo’ Number of log backup files restored: 0

“Huh? “

Ayup. You read it right. It found the right file. It said the restore was successful. Then it said 0 files were restored.

Now, I eventually broke for lunch.  And dinner. And bed. And breakfast. And another lunch and dinner and more sleep. Also a couple of bike rides and a bunch of meetings and other stuff.

But truth is, I was stumped. I tried everything. Rebuilt the job. Tried manually updating the tables on the secondary to tell it I had already applied a log (that got me a weird datetime error I never quite figured out)

I could manually apply the logs just fine.

Log-shipping between two servers within the datacenter worked just fine.

Why wasn’t this working? I mean a log file is a log file right?

Well, apparently yes and no.

The ONE difference I noticed was that the transaction logs from the working test had their filenames using UTC time stamps, including seconds.

The ones I was shipping were from a standard maintenance plan I had setup and times were in local time w/o the time stamp.

BTW, this http://www.sqlservercentral.com/Forums/Topic351549-146-1.aspx#bm438143 is what also helped me wonder about this.

“It couldn’t be THAT simple could it?”

So I setup a database in the primary datacenter, full logging, and setup log-shipping on that side. Now, I still couldn’t provide any data about the secondary, but I could at least script the primary side that does the backups.

Set that up.  Shipped over a database, some logs. Now I’d like to say it worked on the first try, but I ran into some issues (completely of my own doing, so not relevant here).

But 3rd time is the charm as they say.  Got a backup over to the secondary.  Let the autosync job move over some transaction logs (including several with transactions).

Then, on the secondary ran the job I had previously handcrafted and. SUCCESS.

So, yes, you can do “blind” log-shipping (which I knew).

You can setup the secondary by hand.

But apparently you can’t use your existing transaction log job. You’re better off setting up the log-backups on the primary using the normal tools and then shipping those.

So lesson learned.

Sometimes, it’s enough to know there’s a right answer to keep you driving towards it.

And now time for dinner.

 

Eliminating the impossible

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”  Sherlock Holmes

Since I haven’t found this issue elsewhere, and since it’s been awhile since I’ve blogged, I figured I’d post.

So, the scenario is:

Windows 2008 R2 Cluster that was pre-existing before I arrived on the scene.  2-nodes.  setup to run SQL Server.  SQL Server would run fine on Node A, but a failover to Node B would fail.

Some back history to the setup that wasn’t complete nor detailed.  But the problem was suspected to be with DNS or Active Directory.

I arrive on the scene and one of my jobs is to setup additional clustered SQL instances on this Windows Cluster. I do so, expecting to have the exact same issue. Nope. Things work fine once I figure out the rights my user needed but didn’t have in order to ADD the second node (Logon as a service btw). (For the time being I built a 1-node cluster, yes, you can do that, and then once I had the rights, simply added the 2nd node.)

So, now I’m in the situation with a 2-node cluster and 3 SQL instances.  Two fail over as expected.  One (the original) does not.

Time to put on my debugging hat on.

I won’t bore you with the details.  Suffice to say I tried a lot.

Compared ipconfig /all results – Everything the same (what wasn’t the same, I made the same where it made sense to.  Still no joy.)

Pinged the WINS and DNS servers from both boxes. OK, here was a difference.  Node A could ping both its primary and its secondary WINS server.  Node B could NOT ping its secondary WINS server.  Interesting. But, didn’t really seem like the issue since it couldn’t explain why the other 2 instances would fail over just fine.

Checked out the registry.  Same in both cases.

Start to look at error logs.  At first nothing.  Then realize that according to the timestamps, a SQLError Log IS being created on Node B.  I look even more closely. The service is actually STARTING!  But then it’s stopping. And in between there’s a bunch of errors about not being able to log in.  Very strange.

So now I try to tackle the problem from a different angle. I fail over the disk and IP resources but don’t tell the cluster service to startup SQL Server.

Then, I go to the command line and start the service manually.

Works fine. Connections can be made, etc.  Of course the cluster service doesn’t think it’s up, but that’s to be expected and ok at this point.

But, this is only a partial test.  Since maybe it’s my user that can do this, but not the service account.

So, go to the services screen, change SQL Server to startup using my account and confirm that works.  Great.

Change it back to the designated service account and start it manually from there.  Starts just fine.

BUT, no login errors.

Finally that part clicks.  The thing trying to login and do a query is the CLUSTER Service itself. It’s simply the heartbeat Cluster Service uses to make sure the node started. No wonder, it is attempting to start the node and then failing. It never hears the heartbeat.

Since it takes about a minute for the startup to actually fail, I confirm that I can connect to SQL Server in that minute window.  Sure enough, no problem, at least until the Cluster Service fails it.

So basically SQL Server is in fact running properly and starting up properly. It’s simply that the Cluster Service can’t confirm it is running so it shuts SQL Server down.

I started to try several various things that all ended up in a blind alley.

Then as I was poking around the SQL Server Configuration Manager on Node B it dawned on me to look at the SQL Native Client and compare it to Node A. The one critical difference was that Node B had some aliases setup.  They looked correct, but following a troubleshooting axiom of mine “One of these things is not like the other” I decided to rename them (not delete them, since another axiom is “don’t break anything you can’t fix”) so they wouldn’t be used.

I then tested the failover, fully not expecting this to solve the problem. The failover worked just fine. Wow. That surprised me.Of course I never trust anything I can’t replicate.  Changed the aliases back to their original form. Test failover. It fails. Change them back to the updated names and things work again.

I had my solution.

Now, my blog is intended to be more about thinking than actual technical issues, but for this I’ll make an exception.  So for future reference, Google and more:

The error I received in the SQL Error logs was:

2013-02-20 08:36:47.74 Logon Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: 192.168.3.44]
2013-02-20 08:36:47.74 Logon Error: 17806, Severity: 20, State: 2.

No Googling for this helped.(It’s a common error in other contexts, none were helpful here that I found.)

But otherwise, this was your basic troubleshooting.

  • Eliminate possibilities
  • Try variations
  • When you think you’ve solved it, replicate it.

And, no matter how improbable it is (I never would have guessed Aliases) if you’ve eliminated everything else, it must be that.

Interesting post from someone I know

Well so much for getting back on the bandwagon of posting more.

But in my defense, I’ve been busy this past month.

Someone I worked with briefly with has posted a post that I really liked and I think is on-topic to some of the stuff here, namely “how we think.”

http://www.ratha.com/monotropism#comment-244

I’ve always been fascinated by not only intelligence, but HOW we think. Why can some folk solve a problem faster than others? Why can some folks walk into a room and “see” everything there and others miss many details.

Years ago I developed the idea that folks had varying amounts of I/O “bandwidth” and more and more research seems to be bearing this out at a very general level. Ratha’s post explores a specific concept here of monotropism vs. polytropism and how some folks tend to focus on one thing to the extreme.

Following Directions

Not much to say here other than this link may have saved me a lot of work.

I would have saved myself even more work if I had paid close attention to the last step:

Don’t miss this step, it’s very important: Select the new document; PressCtrl + A; Press F9.

Just to add, what this does is make sure the correct images get merged in.