I wrote once before about a day being a “Monday” and a week later about it not being a “Monday”. Well, yesterday was another Monday. And it reminded me of the value of DR planning and how scaling to your actual needs and budget are important.
There’s an old saying a Cobbler’s Children has no shoes and there’s some truth to that. And, well my kids have shoes, but yesterday reminded me I still want to improve my home DR strategy.
I had actually planned on sleeping in late since it’s the week between Christmas and New Years and my largest client is basically doing nothing and everyone else in the house is sleeping in this week. But that said, old habits die hard and after one of the cats woke me up to get fed, I decided to check my email. That’s when I noticed some of the tabs open in Chrome were dead. I’m not sure what I looked at next, but it caused me to ping my home server: nothing.
While that’s very unusual, it wouldn’t be the first time it did a BSOD. I figured I’d go to the basement, reboot, grab the paper, some breakfast and be all set. Well, I was partly right. Sure enough when I looked at the screen there was an error on it, but not a BSOD, but a black and white text screen with a bunch of characters and a line with an error on it. I rebooted, waited for the Server 2012 logo and then went out to get the newspaper. I came back, it was still booting, but I decided to wait for it to complete. Instead, it threw another BSOD (a real BSOD this time). I did a reboot and seconds later up came a BIOS message “PARITY ERROR”.
I figured it must be a bad RAM chip and while 16 GB wouldn’t be great, I could live with that if I had to cut down. But, things only got worse. Now the server wouldn’t even boot. I don’t mean as in I kept getting parity errors or a BSOD but as in, nothing would happen, no BIOS, nothing. Best as I can tell my server had succumbed to a known issue with the motherboard.
The technical term for this is “I was hosed”. But, in true DR spirit, I had backup plans and other ideas. The biggest issue is, I had always assumed my issue would be drive failure, hence backups, RAID, etc. I did not expect a full motherboard failure.
On one hand, this is almost the best time of the year for such an event. Work is slow, I could work around this, it wouldn’t normally be a big issue. However, there were some confounding issues. For one, my daughter is in the midst of applying to colleges and needs to submit portfolio items. These are of course saved on the server. Normally I’d move the server data drive to another machine and say “just go here” but she’s already stressed enough, I didn’t want to add another concern. And then much to my surprise, when I called ASRock customer service, they’re apparently closed until January! Yes, they apparently have no one available for a week. So much for arguing for an RMA. And finally of course, even if I could do an RMA, with the current situation with shipping packages, who knew when I would get it.
So, backup Plan A was to dig out an old desktop I had in house and move the drives over. This actually worked out pretty well except for one issue. The old desktop only has 2 GB of RAM in it! My server will boot, but my VMs aren’t available. Fortunately for this week that’s not an issue.
And Plan B was to find a cheap desktop at Best Buy, have my wife pick it up and when she got home, move the server disks to that and have a reasonably powered machine as a temporary server. That plan was great, but, for various reasons I haven’t overcome yet, the new machine won’t boot from the server drive (it acts like it doesn’t even see it.) So, for now I’m stuck with Plan A for now.
I’ve since moved on to Plan C and ordered a new Mobo (ironically another ASRock, because despite this issue, it’s been rock solid for 4+ years) and expect to get it by the 5th. If all goes well I’ll be up and running with a real server by then, just in time for the New Year.
Now, Plan D is still get ASRock to warranty the old one (some people have successfully argued for this because it appears to be a known defect). If that works, then I’ll order another case, more RAM and another OS license and end up with a backup server.
Should I have had a backup server all along? Probably. If nothing else, having a backup domain controller really is a best practice. But the reality is, this type of failure is VERY rare, and the intersection of circumstances that really requires me to have one is more rare. So I don’t feel too bad about not having a fully functional backup server until now. At the most, I lost a few hours of sleep yesterday. I didn’t lose any client time, business or real money. So, the tradeoff was arguably worth it.
The truth is, a DR plan needs to scale with your needs and budget. If downtime simply costs you a few hours of your time coming up with a workaround (like mine did), then perhaps sticking with the work around if you can’t afford more is acceptable. Later you can upgrade as you needs require it and your budget allows for it. For example, I don’t run a production 24×7 SQL Server, so I’m not worried about clustering, even after I obtain my backup server.
If you can work in a degraded fashion for some time and can’t afford a top-notch DR solution, that might be enough. But consider that closely before going down that route.
On the other hand, if like my largest client downtime can cost you thousands or even millions of dollars, than you had darn well invest in a robust DR solution. I recently worked with them on testing the DR plan for one of their critical systems. As I mentioned, it probably cost them tens of thousands of dollars just for the test itself. But, they now have a VERY high confidence that if something happened, their downtime is under 4 hours and they would lose very little data. And for the volume of business, it’s worth it. For mine, a few hours of downtime and a few days of degraded availability is ok and cost effective. But, given I have a bit of extra money, I figure it’s now worth mitigating even that.
In closing because this IS the Internet… a couple of cat pictures.