A recent thread on the Nanog mailing list regarding the outages this past weekend reminded me of one of my war stories that I thought I’d relate.
About 7 years ago, I was involved in a project to help a different division of the company upgrade and replace their cluster and SAN with new hardware. I was in a fairly hands-off mode, but towards the end was more involved. We had outsourced the hardware and software installation to another company, but we were about to take over and move the production system to it.
I tend to be a bit paranoid when it comes to changes. Changes can introduce new, unknown points of failure. There’s a reason for the saying that “better the devil you know.”
Anyway, before we were about to go live, I asked, “Has anyone tried a remote reboot of the box, just to make sure it’ll failover correctly and come up?”
“Oh sure, we did that when we were in the datacenter with the vendor a couple of weeks ago.”
Now, there had been several changes to the system since then, including us adding more components of our software. So I did a quick poll and asked if everyone was comfortable with moving forward without testing it. The answer was in the positive.
Still, as I said, I’m a bit paranoid. So I basically asked, “Ok, that’s nice, but do me a favor, reboot the active node anyway, just to see what happens.”
So they did. And we waited. And we waited. And we waited. As I recall, the failover DID work perfectly. But the rebooted node never came up. So we waited. And waited. Now, unfortunately due to budgetary reasons, we didn’t have IP-KVM setup. We had to wait until we could get to the remote datacenter to see what was going on.
Meanwhile, the vendor that did the install was called and they assured us it was nothing they had done. It had to be a hardware issue, since they had installed things perfectly.
That’s your typical vendor response by the way.
Finally we get someone out to the datacenter. What do they see, but the machine waiting to mount a network share to get some files. This puzzled us, since we had no network shares for this purpose. We pondered it a bit. Then it dawned on me that the only thing different between our reboot and the earlier one, was the vendor was in the datacenter.
So, we made the necessary changes to not try to mount network files, tested the reboot a few times and all was well.
One of my team called the vendor and asked, “hey, when you do an install, does your tech mount stuff off of their laptop?” “Oh sure, that’s how we get some of the critical files where we want them, why?”
So, one little change, having the vendor there (with their laptop on the network) and not having the vendor their made all the difference.
If you’re going to test, test in real-world scenarios as much as you can. Make sure nothing extra is plugged in, even if you think it won’t make a difference.
If you can, test it just the way you would expect it to fail in the real world. If that means you have redundant power supplies, test it by tripping the breaker on one leg. If you find yourself saying, “Oh no, that’s too risky” guess what, you’re not nearly as protected as you think you are. If you can suggest doing the test with confidence you’re 1/2 way to success. If you actually DO the test and things work, you’re all the way.
In conclusion, one seemingly innocuous change can make a huge difference.