Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

American Cave Accidents

I’ve mentioned in the past that I teach cave rescue and I work with some great people.  This week I received the biannual American Cave Accidents report that the NSS puts out. (I’ll mention I’m friends with the editor and several of the folks who helped advise on this issue.)

This report in theory covers all reported cave accidents in the last 2 years in the US (and sometimes Canada and Mexico and other parts of the Americas.) In reality, anyone who is familiar with caving knows it misses some.

As I was reading today, I noticed in  several of the reports, the names of several people that I recognized; most are fellow instructors, or at the very least students.

Now, one could apply correlation is causation and determine that perhaps the presence of NCRC trained people causes caving accidents.  Thankfully I’m pretty sure that’s not true. But it got me thinking, “Why does it seem such a high proportion of the accident reports involve people I know?”  I certainly do NOT know that many cavers.

Then it dawned on me: these people are perhaps far more likely to actually submit accident reports. I know of at least two accidents where the people involved did NOT want to report.  Why? Because they thought either it was a bit embarrassing or because they didn’t think there was anything to learn from their experience.

I posit those who go through the NCRC training realize more than many others the value of sharing their experiences and that it is perhaps it is worth risking a little embarrassment (and to be honest there never really is any embarrassment in my opinion) if others can learn from their experiences.

I happened to be talking to another friend of mine tonight who is also in the EMS field and he also brought up as a separate discussion how much value there was in shared knowledge like this and how some folks don’t like to do so.

The more you can broaden your horizons and experiences you can hear about and share in your field of expertise will make you a better person in that field.

Take the time to share your mistakes and your success. Share your unusual stories, you never know when someone else can benefit from them.

Eliminating the impossible

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”  Sherlock Holmes

Since I haven’t found this issue elsewhere, and since it’s been awhile since I’ve blogged, I figured I’d post.

So, the scenario is:

Windows 2008 R2 Cluster that was pre-existing before I arrived on the scene.  2-nodes.  setup to run SQL Server.  SQL Server would run fine on Node A, but a failover to Node B would fail.

Some back history to the setup that wasn’t complete nor detailed.  But the problem was suspected to be with DNS or Active Directory.

I arrive on the scene and one of my jobs is to setup additional clustered SQL instances on this Windows Cluster. I do so, expecting to have the exact same issue. Nope. Things work fine once I figure out the rights my user needed but didn’t have in order to ADD the second node (Logon as a service btw). (For the time being I built a 1-node cluster, yes, you can do that, and then once I had the rights, simply added the 2nd node.)

So, now I’m in the situation with a 2-node cluster and 3 SQL instances.  Two fail over as expected.  One (the original) does not.

Time to put on my debugging hat on.

I won’t bore you with the details.  Suffice to say I tried a lot.

Compared ipconfig /all results – Everything the same (what wasn’t the same, I made the same where it made sense to.  Still no joy.)

Pinged the WINS and DNS servers from both boxes. OK, here was a difference.  Node A could ping both its primary and its secondary WINS server.  Node B could NOT ping its secondary WINS server.  Interesting. But, didn’t really seem like the issue since it couldn’t explain why the other 2 instances would fail over just fine.

Checked out the registry.  Same in both cases.

Start to look at error logs.  At first nothing.  Then realize that according to the timestamps, a SQLError Log IS being created on Node B.  I look even more closely. The service is actually STARTING!  But then it’s stopping. And in between there’s a bunch of errors about not being able to log in.  Very strange.

So now I try to tackle the problem from a different angle. I fail over the disk and IP resources but don’t tell the cluster service to startup SQL Server.

Then, I go to the command line and start the service manually.

Works fine. Connections can be made, etc.  Of course the cluster service doesn’t think it’s up, but that’s to be expected and ok at this point.

But, this is only a partial test.  Since maybe it’s my user that can do this, but not the service account.

So, go to the services screen, change SQL Server to startup using my account and confirm that works.  Great.

Change it back to the designated service account and start it manually from there.  Starts just fine.

BUT, no login errors.

Finally that part clicks.  The thing trying to login and do a query is the CLUSTER Service itself. It’s simply the heartbeat Cluster Service uses to make sure the node started. No wonder, it is attempting to start the node and then failing. It never hears the heartbeat.

Since it takes about a minute for the startup to actually fail, I confirm that I can connect to SQL Server in that minute window.  Sure enough, no problem, at least until the Cluster Service fails it.

So basically SQL Server is in fact running properly and starting up properly. It’s simply that the Cluster Service can’t confirm it is running so it shuts SQL Server down.

I started to try several various things that all ended up in a blind alley.

Then as I was poking around the SQL Server Configuration Manager on Node B it dawned on me to look at the SQL Native Client and compare it to Node A. The one critical difference was that Node B had some aliases setup.  They looked correct, but following a troubleshooting axiom of mine “One of these things is not like the other” I decided to rename them (not delete them, since another axiom is “don’t break anything you can’t fix”) so they wouldn’t be used.

I then tested the failover, fully not expecting this to solve the problem. The failover worked just fine. Wow. That surprised me.Of course I never trust anything I can’t replicate.  Changed the aliases back to their original form. Test failover. It fails. Change them back to the updated names and things work again.

I had my solution.

Now, my blog is intended to be more about thinking than actual technical issues, but for this I’ll make an exception.  So for future reference, Google and more:

The error I received in the SQL Error logs was:

2013-02-20 08:36:47.74 Logon Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: 192.168.3.44]
2013-02-20 08:36:47.74 Logon Error: 17806, Severity: 20, State: 2.

No Googling for this helped.(It’s a common error in other contexts, none were helpful here that I found.)

But otherwise, this was your basic troubleshooting.

  • Eliminate possibilities
  • Try variations
  • When you think you’ve solved it, replicate it.

And, no matter how improbable it is (I never would have guessed Aliases) if you’ve eliminated everything else, it must be that.