The Thai Cave Rescue

“When does a cave rescue become a recovery?’ That was the question a friend of mine asked me online about a week ago. This was before the boys and their coach had been found in the Thai cave.

Before I continue, let me add a huge caveat: this is an ongoing dynamic situation and many of the details I mention here may already be based on inaccurate or outdated information. But that’s also part of the point I ultimately hope to make: plans have to evolve as more data is gathered.

My somewhat flippant answer was “when they’re dead.” This is a bit of dark humor answer but there was actually some reasoning behind it. Before I go on, let me say that at that point I actually still had a lot of hope and reason to believe they were still alive. I’m very glad to find that they were in fact found alive and relatively safe.

There’s a truth about cave rescue: caves are literally a black-hole of information. Until you find the people you’re searching for, you have very little information.  Sometimes it may be as little as, “They went into this cave and haven’t come out yet.” (Actually sometimes it can be even less than that, “We think they went into one of these caves but we’re not even sure about that.”)

So when it comes to rescue, two of the items we try to teach students when teaching cave rescue is to look for clues, and to try to establish communications. A clue might be a footprint or a food wrapper. It might be the smell of a sweaty caver wafting in a certain direction. A clue might be the sound of someone calling for help. And the ultimate clue of course is the caver themselves. But there are other clues we might look for: what equipment do we think they have? What experience do they have? What is the characteristics of the cave? These can all drive how we search and what decisions we make.

Going back to the Thai cave situation, based on the media reports (which should always be taken with a huge grain of salt) it appeared that the coach and boys probably knew enough to get above the flood level and that the cave temps were in the 80s (Fahrenheit).  These are two reasons I was hopeful. Honestly, had they not gotten above the flood zone, almost certainly we’d be talking about a tragedy instead. Had the cave been a typical northeast cave where the temps are in the 40s (F) I would have had a lot less hope.

Given the above details then, it was reasonable to believe the boys were still alive and to continue to treat the situation as a search and eventually rescue situation.  And fortunately, that’s the way it has turned out. What happens next is still open for speculation, but I’ll say don’t be surprised if they bring in gear and people and bivouac in place for weeks or even months until the water levels come down.

During the search process, apparently a lot of phone lines were laid into parts of the cave so that easier communications could be made with the surface. Now that they have found the cavers, I’d be shocked if some sort of realtime communications is not setup in short order. This will allow he incident commander to make better informed decisions and to be able to get the most accurate and up to date data.

So, let me relate this to IT and disasters. Typically a disaster will start with, “the server has crashed” or something similar. We have an idea of the problem, but again, we’re really in a black-hole of information at that moment. Did the server crash because a hard drive failed, or because someone kicked the power cord or something else?

The first thing we need to do is to get more information. And we may need to establish communications. We often take that for granted, but the truth is, often when a major disaster occurs, the first thing to go is good communications. Imagine that the crashed server is in a datacenter across the country. How can you find out what’s going on? Perhaps you call for hands on support. But what if the reason the server has crashed is because the datacenter is on fire? You may not be able to reach anyone!  You might need to call a friend in the same city and have them go over there.  Or you might even turn on the news to see if there’s anything on worth noting.

But the point is, you can’t react until you have more information. Once you start to have information, you can start to develop a reaction plan. But let’s take the above situation and imagine that you find your datacenter has in fact burned down. You might start to panic and think you need to order a new server.  You start to call up your CFO to ask her to let you buy some new hardware when suddenly you get a call from your tech in the remote. They tell you, “Yeah, the building burned down, but we got real lucky and our server was in an area that was undamaged and I’ve got it in the trunk of my car, what do you want me to do with it?”

Now your previous data has been invalidated and you have new information and have to develop a new plan.

This is the situation in Thailand right now. They’re continually getting new information and updating their plans as they go. And this is the way you need to handle you disasters, establish communications, gather data and create a plan and update your plan as the data changes. And don’t give up hope until you absolutely have to.

A Lost Sked

Not much time to write this week. I’m off in Alabama crawling around in the bowels of the Earth teaching cave rescue to a bunch of enthusiastic students. The level I teach focuses on teamwork. And sometimes you find teams forming in the most interesting ways.

Yesterday our focus was on some activities in a cave (this one known as Pettyjohn’s) that included a type of a litter known as a Sked. When packaged it’s about 9″ in diameter and 4′ tall. It’s packaged in a bright orange carrier. It’s hard to miss.

And yet, at dinner, the students were a bit frantic; they could not account for the Sked. After some discussion they determined it was most likely left in the cave.

As an instructor, I wasn’t overly concerned, I figured it would be found and if not, it’s part of the reason our organization has a budget for lost or broken equipment, even if it’s expensive.

That said, what was quite reassuring was that the students completely gelled as a team. There was no finger pointing, no casting blame. Instead, they figured out a plan, determined who would go back to look for it and when. In the end, the Sked was found and everyone was happy.

The moral is, sometimes an incident like this can turn into a group of individuals who are blaming everyone else, or it can turn a group into a team where everyone is sharing responsibility. In this case it was it was the latter and I’m quite pleased.

RCA or “get it running!”

How often have any of us resorted to fixing a server issue by simply rebooting the server?  Yes, we’re all friends here, you can raise your hands. Don’t be shy. We all know we’ve done it at some point.

I ask the question because of a recent tweet I saw with the hashtag #sqlhelp where Allan Hirt made a great comment:

Finding root cause is nice, but my goal first and foremost is to get back up and running quickly. Uptime > root cause more often than not.

This got me thinking, when is this true versus when is it not? And I think the answer ends up being the classic DBA answer, “it depends”.

I’m going to pick two well studied disasters that we’re probably all familiar with. But we need some criteria.  In my book IT Disaster Response: Lessons Learned in the Field I used the definition:

Disaster: An unplanned interruption in business that has an adverse impact on finances or other resources.

Let’s go with that.  It’s pretty broad, but it’s a starting point. Now let’s ignore minor disasters like I mention in the book, like the check printer running out of toner or paper on payroll day. Let’s stick with the big ones; the ones that bring production to a halt and cost us real money.  And we’re not going to restrict ourselves to IT or databases, but we’ll come back to that.

The first example I’m going to use is the Challenger Disaster. I would highly recommend folks read Diane Vaughen’s seminal work: The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. That said, we all know that when this occurred, NASA did a complete stand-down of all shuttle flights until a full RCA was complete and many changes were made to the program.

On the other hand, in the famous Miracle on the Hudson, airlines did not stop flying after the water landing. But this doesn’t mean a RCA wasn’t done. It in fact was; just well after the incident.

So, back to making that decision.  Here, it was an easy decision. Shuttle flights were occurring every few months and other than delaying some satellite launches (which ironically may have led to issues with the Galileo probe’s antenna) there wasn’t much reason to fly immediately afterwards.  Also, while the largest points were known, i.e. something caused a burn-thru of the SRB, it took months to determine all the details. So, in this case, NASA could and did stand-down for as long as it took to rectify the issues.

In the event of the Miracle on the Hudson, the cause was known immediately.  That said, even then an RCA was done to determine the degree of the damage, if Sullenberg and Skiles had done the right thing, and what procedural changes needed to be made.  For example one item that came out of the post-landing analysis was that the engine restart checklist wasn’t really designed for low altitude failures such as they experienced.

Doing a full RCA of the bird strike on US Airways 1549 and stopping all over flights would have been an economic catastrophe.  But it was more than simply that. It was clear, based on the millions of flights per year, that this was a very isolated incident. The exact scenario was unlikely to happen again.  With Challenger, there had only been 24 previous flights, and ALL of them had experienced various issues, including blow-bys of the primary O-ring and other issues with the SRBs.

So back to our servers.  When can we just “get it running” versus taking downtime to do a  complete RCA vs other options?

I’d suggest one criteria is, “how often has this happened compared to our uptime?”

If we’ve just brought a database online and within the first week it has crashed, I’m probably going to want to do more of an immediate RCA.  If it’s been running for years and this is first time this issue has come up, I’m probably going to just get it running again and not be as adamant about an immediate RCA. I will most likely try to do an RCA afterwards, but again, I my not push for it as hard.

If the problem starts to repeat itself, I’m more likely to push for some sort of immediate RCA the next time the problem occurs.

What about the seriousness of the problem? If I have a server that’s consistently running at 20% CPU and every once in awhile it leaps up to 100% CPU for a few seconds and then goes back to 20% will I respond the same way as if it crashes and it takes me 10 minutes to get it back up? Maybe.  Is it a web-server for cat videos that I make a few hundred off of every month? Probably not. Is it a stock-trading server where those few seconds costing me thousands of dollars?  Yes, then I almost certainly will be attempting an RCA of some short.

Another factor would be, what’s involved in an RCA? Is it just a matter of copying some logs to someplace for later analysis and that will simply take a few seconds or minutes, or am I going to have to run a bunch of queries, collect data and do other items that may keep the server off-line for 30 minutes or more?

Ultimately, in most cases, it’s going to come down to balancing money and in the most extreme cases, lives.  Determining the RCA now, may save money later, but cost money now. On the other hand, not doing an RCA now might save money now, but might cost money later.  Some of it is a judgement call, some of it depends on factors you use to make your decision.

And yes, before anyone objects, I’m only very briefly touching upon the fact that often an RCA can still be done after getting things working again. I’m just touching upon the cases where it has to be done immediately or evidence may be lost.

So, are your criteria for when you do an RCA immediately vs. getting things running as soon as you can? I’d love to hear them.

And credit for the Photo by j zamora on Unsplash

Crane Operators

Talking online with friends the other day, someone brought up that crane operators in NYC can make $400-$500K a year. Yes, a year. I figured I’d confirm that before writing this post and it appears to be accurate.

At first glance one may think this is outrageous, or perhaps they chose the wrong field. I mean I enjoy being a DBA and a disaster geek, but I can’t say I’ve ever made $400K in one year!  And for what, I mean you lift things up and them down. Right?

Let me come back to that.

So, last night, I got paid quite a tidy bundle (but not nearly that much) for literally logging into a client computer, opening up VisualCron and clicking on a task and saying, “disable task”. On one hand, it seemed ridiculous;  not just because of what they were paying me, but because this process was the result of several meetings, more than one email and a review process.  All to say, “stop copying this file.”

But, this file was part of a key backup process for a core part of the client’s business. I had initially setup an entire process to ensure that a backup was being copied from an AIX server in one datacenter to a local NAS and then to the remote datacenter.  It is a bit more complex than it sounds.  But it worked. And the loss of a timely backup would impact their ability to recover by hours if not days. This could potentially cost them 100s of thousands of dollars if not into the millions.

So the meetings and phonecalls and emails weren’t just “which button should Greg click” but covered questions like, “do we have the backups we think we have?” “Are they getting to the right place(s)?” “Are they getting there in a timely fashion?”  And even, “when we uncheck this, we need to make sure the process for the day is complete and we don’t break it.”

So, me unchecking that button after hours, as much as it cost the company was really the end of a complex chain of events designed to make sure that they didn’t risk losing a LOT of money if things went wrong. Call it an insurance payment if you will.

Those crane operators in NYC? They’re not simply lifting up a beam here and there and randomly placing it someplace. They’re maneuvering complex systems in tight spaces with heavy loads where sudden gusts can set things swaying or spinning and a single mistake can do $1000s in damage or even kill people.

It’s not such much what they’re being paid to do, as much as how much they are being paid to avoid the cost of a mistake. I wasn’t paid just to unclick a button. I was paid (as were the others in the meetings) to make sure it was the right button and at the right time and that it wouldn’t cost even more.

Sometimes we’re not paid for what we do, as much as we’re paid for what we’re not doing.

 

Mistakes were made

I generally avoid Reddit, I figure I have enough things in my life sucking my time. But from time to time one link comes across my screen that I find interesting. This is one of them.

The user accidentally deleted a production database. Now, I think we can all agree that deleting a database in production is a “bad thing”. But, whose fault is this really?

Yes, one could argue the employee should have been more careful, but, let’s backup.

The respondents in the thread raise several good points.

  • Why were the values in the documentation ones that pointed to a LIVE, production database? Why not point to a dev copy or even better yet, one that doesn’t really exist. They expect the person to update/change the parameters anyway, so worst case if they put in the fake ones in is, nothing happens.
  • Why didn’t Production have backups? This is a completely separate question, but a very important one!
  • Why fire him? As many pointed out, he had just learned a VERY valuable lesson, and taught the company a very valuable lesson too!

I’ll admit, I’d something similar in my career at one of my employers. Except I wasn’t an intern, I was the Director of IT, and my goal in fact WAS to do something on the live database. The mistake I made was a minor one in execution (reversed the direction of an arrow on the GUI before hitting the Execute button) but disastrous in terms of impact. And of course there wasn’t a recent enough backup.

I wasn’t fired for that.  I did develop and enforce our change control documents after that and always ensured, as much as possible that we had adequate backups. (Later in a my career, a larger, bigger muckup did get me… “given the opportunities to apply my skills elsewhere”, but there were other factors involved and I arguably wasn’t fired JUST for the initial issue.)

As the Director of IT, I made a point of telling my employees that story. And I explained to them, that I expected them to make mistakes. If they didn’t they probably weren’t trying hard enough. But I told them the two things that I wouldn’t accept would be lying about a mistake (trying to cover it up, or blame others, etc) and repeatedly making the same mistake.

I wrote in an earlier post that mistakes were avoidable. But as I pointed out, it’s actually more complex than that. Some mistakes are avoidable. Or, at least they can be managed. For example, it is unfortunately likely that at some point, someone, somewhere, will munge production data.  Perhaps they won’t delete it all, or perhaps they’ll do a make a White Ford Taurus type mistake, but it will happen.  So you have safeguards in place. First, limit the number of people in a position to make such a mistake. Second, have adequate backups. There’s probably other steps you can do to reduce the chance of error and mitigate it when it does eventually happen. Work on those.

But don’t fire the person who just learned a valuable lesson. They’re the one that is least likely to make that mistake again.  Me, I’d probably fire the CTO for not having backups and for having production values in documentation like that AND for firing the new guy.

Small Disasters

Today was an interesting confluence of events. I was exchanging emails with an associate who is in the middle of getting a Master’s in Disaster Management and we were talking about scale and scope of disasters.

At about the same time I was monitoring email from one of my clients. The thread started out with a fairly minor report: Viewpoint Drive – Water Main Break. Not a huge, earth shattering disaster. Simply a notice that there was a waterline break in a nearby road and asked people if they noticed issues to let management know.

Within an hour there was a follow-up email stating that there was no longer adequate water pressure in the building and that folks should go home and finish their workday there. Furthermore, employees were told that for the next day the company was securing water bottles for drinking water and would be bringing in portable toilets.

Now, when people think about disasters, often they think about fires and other things that might destroy a building. But, that’s pretty rare.  It’s the other things that companies don’t necessarily plan for. Your company may have adequate backups of all its servers (but are you sure?) but does it have a plan for not having water?

I’ve worked with managers who have basically said, “eh, we can work around that.” Truth is, legally in most cases they can’t. If the building doesn’t have potable water and working sanitation facilities many municipalities won’t allow it to be occupied.

So does your company have a plan? Are the people who can authorize expenditures in on the loop? Who is going to declare a disaster and put the plan into motion? Who will sign for the porta-potties when they show up?  These are some of the things you have to think about.

So disasters about more than just a good set of backups. Sometimes it’s about the toilets. Think about that.

 

Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.

Git ‘r Done (part 2)

Someone recently forwarded the following article to me: “Get Shit Done: The Worst Startup Culture Ever”.  Before reading it I was a bit ready to disagree. (see my previous post on getting stuff done.)

But after reading this article, I have to agree with the premise of the article; and point out I think there’s two different ways of looking at what “Get Stuff Done” can mean.

At my current assignment, a coworker and I were joking about how some people had some many letters after their name like PMP or CAPM, PMI-SP and the like.

So we joked we needed some letters and we settled on GSD – Get stuff done.  At times on this particular project we seemed to be the only ones accomplishing much or caring about accomplishing much. We had one person who was more concerned with the agenda of the meeting every day (yes, daily meetings to see why the project wasn’t getting done.  With 5-6 people in that room, that’s 25 or more person-hours per week of discussing why things weren’t getting done.)

So in that context, “decide what your goal is, and actually GETTING IT DONE” I think “Get ‘r Done” is an important concept.

On the other hand, I have seen (and fallen prey to myself, both as a manager and as a employee) of the “Get ‘r Done” attitude in the above article.

The project above I was working on never got done.  It wasn’t for lack of effort on the part of myself and several others that it didn’t get done.. It was though for the lack of effort on the part of management that it never got done.  At one point they asked me what could be done to make sure the project could be completed on time. I gave them several examples of areas where they could put some pressure on another group to streamline some procedures.

I was basically told that wasn’t going to happen, and that I had to work harder and “get ‘r done”.  At this phase of the project, I needed 4-5 items from another group and the other group had a policy that each item needed a separate ticket.  Each ticket had to be done sequentially and could only be submitted when the previous ticket was closed out.  Oh, and their policy was 2 weeks per ticket.  Period.

So, by my math, that’s 8-10 weeks. That assumes every ticket goes smoothly, which had not been our experience with this other group.

The project due date was in 6 weeks.

So, I was being told to get things done, in an impossible fashion.  Talk about demotivating.

In general, I’ve been my best as a manager, when I’ve been given the tools to let my team get the job done. It may be buying them dinner one night as morale boost. It may be making sure no extra work gets thrust upon them, or keeping certain other managers from trying to add to their work queue. In one case, it was buying a new NAS so we had enough storage space that we weren’t getting paged every night about diskspace issues. When properly motivated, people can move mountains and better yet, can often do it in a normal work week.

So, if you want to get it done, make sure your team has the tools to do their job, aren’t being distracted, and aren’t being given reasons to have low morale.  They’ll move mountains for you. But ask them to work harder without any of the above, and sooner or later you’ll find yourself without a team, and your boss simply asking you to work harder!

By the way, on that NAS, I think that $5K investment probably helped keep a key employee of mine from jumping ship for greener pastures.  That NAS was probably a better investment than if we had tried to offer him another $5K to keep him happy despite the lack of sleep from all the pages and other issues.

Moral: You want them to “get ‘r done”, give them the tools they need, remove barriers and keep morale up.  They’ll get it done.