On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.

When things aren’t the same

A recent thread on the Nanog mailing list regarding the outages this past weekend reminded me of one of my war stories that I thought I’d relate.

About 7 years ago, I was involved in a project to help a different division of the company upgrade and replace their cluster and SAN with new hardware.  I was in a fairly hands-off mode, but towards the end was more involved.  We had outsourced the hardware and software installation to another company, but we were about to take over and move the production system to it.

I tend to be a bit paranoid when it comes to changes.  Changes can introduce new, unknown points of failure.  There’s a reason for the saying that “better the devil you know.”

Anyway, before we were about to go live, I asked, “Has anyone tried a remote reboot of the box, just to make sure it’ll failover correctly and come up?”

“Oh sure, we did that when we were in the datacenter with the vendor a couple of weeks ago.”

Now, there had been several changes to the system since then, including us adding more components of our software.  So I did a quick poll and asked if everyone was comfortable with moving forward without testing it.  The answer was in the positive.

Still, as I said, I’m a bit paranoid.  So I basically asked, “Ok, that’s nice, but do me a favor, reboot the active node anyway, just to see what happens.”

So they did.  And we waited.  And we waited.  And we waited.  As I recall, the failover DID work perfectly.  But the rebooted node never came up.  So we waited.  And waited.  Now, unfortunately due to budgetary reasons, we didn’t have IP-KVM setup.  We had to wait until we could get to the remote datacenter to see what was going on.

Meanwhile, the vendor that did the install was called and they assured us it was nothing they had done. It had to be a hardware issue, since they had installed things perfectly.

That’s your typical vendor response by the way.

Finally we get someone out to the datacenter.  What do they see, but the machine waiting to mount a network share to get some files.  This puzzled us, since we had no network shares for this purpose.  We pondered it a bit.  Then it dawned on me that the only thing different between our reboot and the earlier one, was the vendor was in the datacenter.

So, we made the necessary changes to not try to mount network files, tested the reboot a few times and all was well.

One of my team called the vendor and asked, “hey, when you do an install, does your tech mount stuff off of their laptop?”  “Oh sure, that’s how we get some of the critical files where we want them, why?”

So, one little change, having the vendor there (with their laptop on the network) and not having the vendor their made all the difference.

If you’re going to test, test in real-world scenarios as much as you can.  Make sure nothing extra is plugged in, even if you think it won’t make a difference.

If you can, test it just the way you would expect it to fail in the real world.  If that means you have redundant power supplies, test it by tripping the breaker on one leg.  If you find yourself saying, “Oh no, that’s too risky” guess what, you’re not nearly as protected as you think you are.  If you can suggest doing the test with confidence you’re 1/2 way to success.  If you actually DO the test and things work, you’re all the way.

In conclusion, one seemingly innocuous change can make a huge difference.