How often have any of us resorted to fixing a server issue by simply rebooting the server? Yes, we’re all friends here, you can raise your hands. Don’t be shy. We all know we’ve done it at some point.
I ask the question because of a recent tweet I saw with the hashtag #sqlhelp where Allan Hirt made a great comment:
Finding root cause is nice, but my goal first and foremost is to get back up and running quickly. Uptime > root cause more often than not.
This got me thinking, when is this true versus when is it not? And I think the answer ends up being the classic DBA answer, “it depends”.
I’m going to pick two well studied disasters that we’re probably all familiar with. But we need some criteria. In my book IT Disaster Response: Lessons Learned in the Field I used the definition:
Disaster: An unplanned interruption in business that has an adverse impact on finances or other resources.
Let’s go with that. It’s pretty broad, but it’s a starting point. Now let’s ignore minor disasters like I mention in the book, like the check printer running out of toner or paper on payroll day. Let’s stick with the big ones; the ones that bring production to a halt and cost us real money. And we’re not going to restrict ourselves to IT or databases, but we’ll come back to that.
The first example I’m going to use is the Challenger Disaster. I would highly recommend folks read Diane Vaughen’s seminal work: The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. That said, we all know that when this occurred, NASA did a complete stand-down of all shuttle flights until a full RCA was complete and many changes were made to the program.
On the other hand, in the famous Miracle on the Hudson, airlines did not stop flying after the water landing. But this doesn’t mean a RCA wasn’t done. It in fact was; just well after the incident.
So, back to making that decision. Here, it was an easy decision. Shuttle flights were occurring every few months and other than delaying some satellite launches (which ironically may have led to issues with the Galileo probe’s antenna) there wasn’t much reason to fly immediately afterwards. Also, while the largest points were known, i.e. something caused a burn-thru of the SRB, it took months to determine all the details. So, in this case, NASA could and did stand-down for as long as it took to rectify the issues.
In the event of the Miracle on the Hudson, the cause was known immediately. That said, even then an RCA was done to determine the degree of the damage, if Sullenberg and Skiles had done the right thing, and what procedural changes needed to be made. For example one item that came out of the post-landing analysis was that the engine restart checklist wasn’t really designed for low altitude failures such as they experienced.
Doing a full RCA of the bird strike on US Airways 1549 and stopping all over flights would have been an economic catastrophe. But it was more than simply that. It was clear, based on the millions of flights per year, that this was a very isolated incident. The exact scenario was unlikely to happen again. With Challenger, there had only been 24 previous flights, and ALL of them had experienced various issues, including blow-bys of the primary O-ring and other issues with the SRBs.
So back to our servers. When can we just “get it running” versus taking downtime to do a complete RCA vs other options?
I’d suggest one criteria is, “how often has this happened compared to our uptime?”
If we’ve just brought a database online and within the first week it has crashed, I’m probably going to want to do more of an immediate RCA. If it’s been running for years and this is first time this issue has come up, I’m probably going to just get it running again and not be as adamant about an immediate RCA. I will most likely try to do an RCA afterwards, but again, I my not push for it as hard.
If the problem starts to repeat itself, I’m more likely to push for some sort of immediate RCA the next time the problem occurs.
What about the seriousness of the problem? If I have a server that’s consistently running at 20% CPU and every once in awhile it leaps up to 100% CPU for a few seconds and then goes back to 20% will I respond the same way as if it crashes and it takes me 10 minutes to get it back up? Maybe. Is it a web-server for cat videos that I make a few hundred off of every month? Probably not. Is it a stock-trading server where those few seconds costing me thousands of dollars? Yes, then I almost certainly will be attempting an RCA of some short.
Another factor would be, what’s involved in an RCA? Is it just a matter of copying some logs to someplace for later analysis and that will simply take a few seconds or minutes, or am I going to have to run a bunch of queries, collect data and do other items that may keep the server off-line for 30 minutes or more?
Ultimately, in most cases, it’s going to come down to balancing money and in the most extreme cases, lives. Determining the RCA now, may save money later, but cost money now. On the other hand, not doing an RCA now might save money now, but might cost money later. Some of it is a judgement call, some of it depends on factors you use to make your decision.
And yes, before anyone objects, I’m only very briefly touching upon the fact that often an RCA can still be done after getting things working again. I’m just touching upon the cases where it has to be done immediately or evidence may be lost.
So, are your criteria for when you do an RCA immediately vs. getting things running as soon as you can? I’d love to hear them.