Failure is Required

Last week one of my readers, Derek Lyons correctly called me out on some details on my post about Lock outs. Derek and I go back a long ways with a mutual interest in the space program. His background is in nuclear submarines and some of the details of operations and procedures he’s shared with me over the years have been of interest.  The US nuclear submarine program is built around “procedures” and since the adoption of their SUBSAFE program, has only suffered one hull-loss and that was with the non-SUBSAFE-certified USS Scorpion.

The space program is also well known for its heavy reliance on procedures and attention to detail and safety. Out of the Apollo 13 incident, we have the famous quote, “Failure is not an option” attributed to Gene Kranz in the movie (but there’s no record of him saying it at the time.)

Anyway, his comments got me thinking about failures in general.

And I’d argue that with certain activities and at a certain level, this is true. When it comes to bringing a crew home from the Moon, or launching nuclear missiles, or performing critical surgeries, failure is not an option.

But sometimes, not only is it an option I’d say it’s almost a requirement. I was reminded of this at a small event I was asked to help be a panelist at last week.  It turned out there were 3 of us panelists and just 2 students from a local program to help folks learn to code: AlbanyCanCode. The concept of agile development was brought up and the fact that agile development basically relies on failing fast and early.  For software development, the concept of failing fast really only costs you time. And agile proponents will argue that in fact it saves you time and money since you find your failures much earlier meaning you spend less time going down the wrong path.

But I’m going to shift gears here to an area that’s even more near and dear to my heart: cave rescue.  At an overarching, one might say strategic level, failure is not an option. We teach in the NCRC that our goal is to get the patient(s) out in as good or better shape than we found them as quickly and safely as possible.  In other words, if we end up killing a patient, but get them out really quickly, that’s considered a failure; whereas if we take twice as long, but get them out alive, that’s considered a success.

But how do we do that?  Where does failure come into play?

One of the first lessons I was taught by one of my mentors was to avoid “the mother of all discussions.” This lesson hit home during an incident in my Level 1 training here in New York. We had a mock patient in a Sked. Up to this point it had been walking passage through a stream with about 1″ of water. But we had hit a choke point where the main part of the ceiling came down to about 12″ above the floor passage.  There was alternative route that would involve lifting the patient up several feet and then over some boulders and through some narrow and low (but not 12″ low passage) and then we’d be back to walking passage.  I and two others were near the head of the litter.  At this point we had placed the litter on the ground (out of the water).  We scouted ahead to see how far the low passage went and noticed it went about a body length.  A very short distance.

Meanwhile the rest of our party were back in the larger passage having the mother of all discussions. They were discussing whether we should could drag the litter along the floor, lift it up to go high, or perhaps even for this part, remove the patient from the litter and have them drag themselves a bit.  There may have been other ideas too.

My two partners and I looked at each other, looked at the low passage, looked at the patient, shrugged our shoulders and dragged the patient through the low passage to the other side.

About 10 seconds later someone from the group having the mother of all discussions exclaimed, “where’s the patient?”

“Over here, we got him through, now can we move on?”

They crawled through and we completed the exercise.

So, our decision was a success. But what if it had been a failure. What if we realized that the patient’s nose was really 13″ higher than the floor in the 12″ passage. Simple, we’d have pulled the patient back out. Then we could have shut down the mother of all discussions and said, “we have to go high, we know for a fact the low passage won’t work.”

Failure here WAS an option and by actually TRYING something, we were able to quickly succeed or fail and move on to the next option.

Now obviously one has to use judgement here. What if the water filled passage was 14″ deep. Then no, my partners and I certainly would NOT have tried to move the patient with just the three of us. But perhaps we might have convinced the group to try.

The point is, sometimes it can often be faster and easier to actually attempt a concept than it is to discuss it to death and consider every possibility.

Time and time again I’ve seen students in our classes fall into the mother of all discussions rather than actually attempt something. If they actually attempt something they can learn very quickly if it will work or not. If it works, great, the discussion can now end and they can move on to the next challenge. If it doesn’t work, great, they’ve narrowed down their options and can discuss more intelligently about the remaining options (and then perhaps quickly iterate through those too.)

So today’s take away, is don’t be afraid of failure. Embrace it. Enjoy it. Experience it. It will lead to learning.  Just make sure you understand the price of failure.  Failure may be an option and is sometimes mandatory, but in other cases, the old saw is true, failure is not an option, especially if failure means the loss of life.

 

Less than our Best

I’ve mentioned in the past that I participate a lot in SQL Saturday events and also teach cave rescue. These are ways I try to give back to at least two communities I am a member of. I generally take this engagement very seriously; for two reasons.

The first, which is especially true when I teach cave rescue, is that I’m teaching critical skills that may or may not put a life on the line. I can’t go into teaching these activities without being prepared or someone may get injured or even killed.

The second is, that the audience deserves my best. In some cases, they’ve paid good money to attend events I’m talking or teaching at. In all cases, they’re taking some of their valuable time and giving it to me.

All the best SQL Saturday speakers and NCRC instructors I know feel generally the same about their presentations. They want to give their best.

But here’s the ugly truth: Sometimes we’re not on our A game. There could be a variety of reasons:

  • We might be jet-lagged
  • We may have partied a bit too much last night (though for me, this is not an issue, I was never much of a party animal, even when I was younger)
  • You might have lost your power and Internet the day before during the time you were going to practice and found yourself busy cutting up trees
  • A dozen other reasons

You’ll notice one of those became singular. Ayup, that was my excuse. At the SQL Saturday Albany event, due to unforeseen circumstances the day before, the time I had allocated to run through my presentation was spent removing trees from the road, clearing my phone line and trying to track down the cable company.

So, one of my presentations on Saturday was not up to the standard I would have liked it to be. And for that, to my audience, I apologize (and did so during the presentation).

But here’s the thing: the feedback I received was still all extremely positive. In fact the only really non-positive feedback was in fact very constructive criticism that would have been valid even had I been as prepared as I would have liked!

I guess the truth is, sometimes we hold ourselves to a higher standard than the audience does. And I think we should.

PS: a little teaser, if all goes as planned, tomorrow look for something new on Red-Gate’s Simple Talk page.

A Lost Sked

Not much time to write this week. I’m off in Alabama crawling around in the bowels of the Earth teaching cave rescue to a bunch of enthusiastic students. The level I teach focuses on teamwork. And sometimes you find teams forming in the most interesting ways.

Yesterday our focus was on some activities in a cave (this one known as Pettyjohn’s) that included a type of a litter known as a Sked. When packaged it’s about 9″ in diameter and 4′ tall. It’s packaged in a bright orange carrier. It’s hard to miss.

And yet, at dinner, the students were a bit frantic; they could not account for the Sked. After some discussion they determined it was most likely left in the cave.

As an instructor, I wasn’t overly concerned, I figured it would be found and if not, it’s part of the reason our organization has a budget for lost or broken equipment, even if it’s expensive.

That said, what was quite reassuring was that the students completely gelled as a team. There was no finger pointing, no casting blame. Instead, they figured out a plan, determined who would go back to look for it and when. In the end, the Sked was found and everyone was happy.

The moral is, sometimes an incident like this can turn into a group of individuals who are blaming everyone else, or it can turn a group into a team where everyone is sharing responsibility. In this case it was it was the latter and I’m quite pleased.

Mistakes were made

I generally avoid Reddit, I figure I have enough things in my life sucking my time. But from time to time one link comes across my screen that I find interesting. This is one of them.

The user accidentally deleted a production database. Now, I think we can all agree that deleting a database in production is a “bad thing”. But, whose fault is this really?

Yes, one could argue the employee should have been more careful, but, let’s backup.

The respondents in the thread raise several good points.

  • Why were the values in the documentation ones that pointed to a LIVE, production database? Why not point to a dev copy or even better yet, one that doesn’t really exist. They expect the person to update/change the parameters anyway, so worst case if they put in the fake ones in is, nothing happens.
  • Why didn’t Production have backups? This is a completely separate question, but a very important one!
  • Why fire him? As many pointed out, he had just learned a VERY valuable lesson, and taught the company a very valuable lesson too!

I’ll admit, I’d something similar in my career at one of my employers. Except I wasn’t an intern, I was the Director of IT, and my goal in fact WAS to do something on the live database. The mistake I made was a minor one in execution (reversed the direction of an arrow on the GUI before hitting the Execute button) but disastrous in terms of impact. And of course there wasn’t a recent enough backup.

I wasn’t fired for that.  I did develop and enforce our change control documents after that and always ensured, as much as possible that we had adequate backups. (Later in a my career, a larger, bigger muckup did get me… “given the opportunities to apply my skills elsewhere”, but there were other factors involved and I arguably wasn’t fired JUST for the initial issue.)

As the Director of IT, I made a point of telling my employees that story. And I explained to them, that I expected them to make mistakes. If they didn’t they probably weren’t trying hard enough. But I told them the two things that I wouldn’t accept would be lying about a mistake (trying to cover it up, or blame others, etc) and repeatedly making the same mistake.

I wrote in an earlier post that mistakes were avoidable. But as I pointed out, it’s actually more complex than that. Some mistakes are avoidable. Or, at least they can be managed. For example, it is unfortunately likely that at some point, someone, somewhere, will munge production data.  Perhaps they won’t delete it all, or perhaps they’ll do a make a White Ford Taurus type mistake, but it will happen.  So you have safeguards in place. First, limit the number of people in a position to make such a mistake. Second, have adequate backups. There’s probably other steps you can do to reduce the chance of error and mitigate it when it does eventually happen. Work on those.

But don’t fire the person who just learned a valuable lesson. They’re the one that is least likely to make that mistake again.  Me, I’d probably fire the CTO for not having backups and for having production values in documentation like that AND for firing the new guy.