“Today is D-Day”

As I’m writing this, word has rocketed around the world that the 12 soccer players and their coach have been safely rescued from Tham Luang cave. We are awaiting word that all the rescuers themselves, including one of the doctors that had spent time with the boys since they were found, are still on their way out.

Unfortunately, one former Thai SEAL diver, Saman Kunan, who had rejoined his former teammates to help in the rescue, lost his life. This tragic outcome should not be forgotten, nor should it cast too large of a shadow on the amazing success.

What I want to talk about though is not the cave or the rescue operations, but the decision making progress. The title for this post comes from Narongsak Osottanakorn’s statement several days ago when they began the evacuation operations.

 

The term D-Day actually predates the famous Normandy landings that everyone associates it with. However, success of the Normandy landings and their importance in the ultimate outcome of WWII has forever cemented that phrase in history.

One of the hardest parts of any large scale operation like this is making the decision on whether to act. During the Apollo Program, they called them GO/NO GO decisions. Famously you can see this in the movie Apollo 13 where Gene Kranz goes around the room asking for a Go/No Go for launch. (it was pointed in a Tindellgram out before the Apollo 11 landing, that the call after the Eagle landed should be changed to Stay/No Stay – so there was no confusion on if they were “go to stay” or “go to leave”.)

While I’ve never been Flight Commander for a lunar mission, nor a Supreme Allied Commander for a European invasion, I have had to make life or death decisions on much smaller operations. A huge issue is not knowing the outcome. It’s like walking into a casino. If you knew you were always going to win, it would be an easy decision on how to bet. But obviously that’s not possible. The best you can do is gather as much information as you can, gather the best people you can around you, trust them and then make the decision.

What compounds the decision making progress in many cases, and especially in cave rescue is the lack of communication and lack of information. It can be very frustrating to send rescuers into the cave and not know, sometimes for hours, what is going on. Compound this with what is sometimes intense media scrutiny (which was certainly present here with the entire world watching), and one can feel compelled to rush the decision making progress. It is hard, but generally necessary to resist this. In an incident I’m familiar with, I recall a photograph of the cave rescue expert advising rescue operations, standing in the rain, near the cave entrance waiting for the waters to come down so they could send search teams in.  Social media was blowing up with comments like, “they need to get divers in there now!” “Why aren’t the authorities doing anything?”  The fact is, the authorities were doing exactly what the cave rescue expert recommended; waiting for it to be safe enough to act. Once the waters came down, they could send people and find the trapped cavers.

The incident in Thailand is a perfect example of the confluence of these factors:

  • There was media pressure from around the world with people were asking why they were taking so long to begin rescuing the boys and once they did start to rescue them, why it took them three days. Offers and suggestions flowed in from around the world and varied from the absurd (one suggestion we received at the NCRC was the use of dolphins) to the unfortunately impractical (let’s just say Mr. Musk wasn’t the only one, nor the first, to suggest some sort of submarine or sealed bag).
  • There was always a lack of enough information. Even after the boys had been found, it could take hours to get information to the surface, or from the surface back to the players. This hinders the decision making process.
  • Finally of course are the unknowns:
    • When is the rain coming?
    • How much rain?
    • How will the boys react to being submerged?
    • What can they eat in their condition?

And finally, there is, in the back of the minds of folks making the decisions the fact that if the outcome turned tragic, everyone will second guess them.

Narongsak Osottanakorn and others had to weigh all the above with all the facts that they had, and the knowledge that they couldn’t have as much information as they might want and make life-impacting decisions. For this I have a great deal of respect for them and don’t envy them.

Fortunately, in this case, the decisions led to a successful outcome which is a huge relief to the families and the world.

For any operation, especially complex ones, such as this rescue, a moon landing or an invasion of the beaches of Normandy, the planning and decision making process is critically important and often over shadowed by the folks executing the operation. As important as Neil Armstrong, Buzz Aldrin and Michael Collins (who all to often gets overlooked, despite writing one of the better autobiographies of the Apollo program) were to Apollo 11, without the support of Gene Kranz, Steve Bales, and hundreds of others on the ground, they would have very likely had to abort their landing.

So, let’s not forget the people behind the scenes making the decisions.

 

The Thai Cave Rescue

“When does a cave rescue become a recovery?’ That was the question a friend of mine asked me online about a week ago. This was before the boys and their coach had been found in the Thai cave.

Before I continue, let me add a huge caveat: this is an ongoing dynamic situation and many of the details I mention here may already be based on inaccurate or outdated information. But that’s also part of the point I ultimately hope to make: plans have to evolve as more data is gathered.

My somewhat flippant answer was “when they’re dead.” This is a bit of dark humor answer but there was actually some reasoning behind it. Before I go on, let me say that at that point I actually still had a lot of hope and reason to believe they were still alive. I’m very glad to find that they were in fact found alive and relatively safe.

There’s a truth about cave rescue: caves are literally a black-hole of information. Until you find the people you’re searching for, you have very little information.  Sometimes it may be as little as, “They went into this cave and haven’t come out yet.” (Actually sometimes it can be even less than that, “We think they went into one of these caves but we’re not even sure about that.”)

So when it comes to rescue, two of the items we try to teach students when teaching cave rescue is to look for clues, and to try to establish communications. A clue might be a footprint or a food wrapper. It might be the smell of a sweaty caver wafting in a certain direction. A clue might be the sound of someone calling for help. And the ultimate clue of course is the caver themselves. But there are other clues we might look for: what equipment do we think they have? What experience do they have? What is the characteristics of the cave? These can all drive how we search and what decisions we make.

Going back to the Thai cave situation, based on the media reports (which should always be taken with a huge grain of salt) it appeared that the coach and boys probably knew enough to get above the flood level and that the cave temps were in the 80s (Fahrenheit).  These are two reasons I was hopeful. Honestly, had they not gotten above the flood zone, almost certainly we’d be talking about a tragedy instead. Had the cave been a typical northeast cave where the temps are in the 40s (F) I would have had a lot less hope.

Given the above details then, it was reasonable to believe the boys were still alive and to continue to treat the situation as a search and eventually rescue situation.  And fortunately, that’s the way it has turned out. What happens next is still open for speculation, but I’ll say don’t be surprised if they bring in gear and people and bivouac in place for weeks or even months until the water levels come down.

During the search process, apparently a lot of phone lines were laid into parts of the cave so that easier communications could be made with the surface. Now that they have found the cavers, I’d be shocked if some sort of realtime communications is not setup in short order. This will allow he incident commander to make better informed decisions and to be able to get the most accurate and up to date data.

So, let me relate this to IT and disasters. Typically a disaster will start with, “the server has crashed” or something similar. We have an idea of the problem, but again, we’re really in a black-hole of information at that moment. Did the server crash because a hard drive failed, or because someone kicked the power cord or something else?

The first thing we need to do is to get more information. And we may need to establish communications. We often take that for granted, but the truth is, often when a major disaster occurs, the first thing to go is good communications. Imagine that the crashed server is in a datacenter across the country. How can you find out what’s going on? Perhaps you call for hands on support. But what if the reason the server has crashed is because the datacenter is on fire? You may not be able to reach anyone!  You might need to call a friend in the same city and have them go over there.  Or you might even turn on the news to see if there’s anything on worth noting.

But the point is, you can’t react until you have more information. Once you start to have information, you can start to develop a reaction plan. But let’s take the above situation and imagine that you find your datacenter has in fact burned down. You might start to panic and think you need to order a new server.  You start to call up your CFO to ask her to let you buy some new hardware when suddenly you get a call from your tech in the remote. They tell you, “Yeah, the building burned down, but we got real lucky and our server was in an area that was undamaged and I’ve got it in the trunk of my car, what do you want me to do with it?”

Now your previous data has been invalidated and you have new information and have to develop a new plan.

This is the situation in Thailand right now. They’re continually getting new information and updating their plans as they go. And this is the way you need to handle you disasters, establish communications, gather data and create a plan and update your plan as the data changes. And don’t give up hope until you absolutely have to.

Swiss Cheese

This blog post will try to tie together several of my favorite things: Cheese, caving, and accidents.

I was making lunch the other day and I was looking at the stick of sliced Swiss cheese I had. I should note, I love Swiss cheese, especially with a good roast beef sandwich.

But first, an existential question.  “What is a cave?”

Oh, that’s easy, it’s a passage through rock in the ground.  In other words it’s the area where there’s no rock.  Great. Let’s start simple. I think we can agree if it’s dark and I can walk through it, it’s a cave. What if I have to crawl? Yeah, that’s still a cave. What if I have to shimmy through and can barely fit? Yeah, that’s still a cave. What if I can’t fit, but one of my much smaller friends can fit through? Yeah, that’s a cave. But what if the entire thing is too small for anyone to crawl through but small animals can? What if two rooms that are large enough for humans to be in are connected by a passage too tight for a human, but say you can shine a light through, or can make a “voice connection” and hear people at the other end? Is that still part of the cave? As an aside, humans have mapped over 190 miles of Jewel Cave (and more all the time, big shout out to my friends who are mapping it!) But airflow studies estimate that we’ve only mapped about 3-5% of it. Let that sink in. But, what if the other 95% is too small for a human to fit in. I don’t think anyone would not call that part of the cave.

But here’s the real question. So we’ve mapped the cave. We know where the passages (i.e. lack of rock) are.  We find a plug of mud and remove that.  We’ve made more cave! Yeah! But what if we remove ALL the rock around the existing passage. When does the cave disappear? I mean now we just have a lot more “absence of rock”.  But I think we’d agree at some point we no longer have a cave!

So back to Swiss cheese.  One of the distinguishing details of such cheese are the holes, or more properly named the eyes. Did you know there’s actual Federal guidelines on what can be called Swiss cheese. Ayup, you can’t simply have a cheese with eyes in it. So I guess Swiss cheese is sort of like a cave. We actually have to think about it to give it some definition we can agree on.  Take away all the cheese, eyes and all, and you have no more cheese and I’m quite sad.

But what about accidents? Well, there’s a model of risk analysis called the Swiss cheese model. Basically, very few accidents occur out of the blue or entirely without a relation to other factors. The idea is you have multiple slices of Swiss cheese and all the holes have to line up for the accident to occur. For example, in my own personal experience, years ago I came close to all the “pieces” of the cheese lining up; while driving through New Jersey, I came fairly close to hydroplaning off an exit ramp into the woods.  Let’s look at some of the slices of cheese that came into play.

  • I was tired. Had I been more awake I’d have been paying a bit more attention.
  • It was dark. I might have noticed exactly how wet the exit ramp was during daylight.
  • I was travelling too fast.
  • I had nearly missed the ramp, I might have been travelling slower (see above) had I noticed the ramp sooner.

The instant I hit the ramp, I knew I was in trouble. I think the ONE slice that didn’t line up was, experience. Had I been 20 years younger with less experience driving, I suspect I’d have ended up off the road. I was at the very edge of being able to brake and maneuver and I called upon all my years of experience to stay on the correct side of that edge. One thin slice of “cheese” saved me that night.

When one looks through accident reports, of almost any industry or activity, one can start to look for where the slices lined up and how any one could be changed. One reason I read the American Cave Accidents report when I receive it is to learn where the slices could have been moved so I can make sure I don’t line up my slices of cheese.

So, the question for you is where do your slices of cheese line up?

And other question is, what sort of cheese do you put on YOUR roast beef sandwich? And do you make sure your Swiss cheese eyes don’t line up so every bite is ensured a bit of cheese?

 

 

 

 

Teamwork

I spent the majority of last week with some of the greatest people I know: my fellow NCRC (National Cave Rescue Commission) instructors and students.  Let’s start by the obvious, it takes a strange twist of mind to be the sort of person that actually enjoys crawling through dark holes in the ground. Now take that same group of people and add  a sense of altruism and you’ve got folks willing to rescue others trapped in caves.

Let me interject in here by the way, that caving is actually a fantastically safe sport. When an accident in a cave makes the news, it’s because it’s so rare and generally so unique as to attract attention. The NSS puts out a report every 2 years detailing the accidents across the US. It’s well worth the read if you can find a copy. Now back to the rest of my blog post.

I have been one of the many instructors who teach the “Level 2” class. I love this level because students have gotten beyond the basics and are starting to learn the “why” we do things more than simply the “how”.  And we start to really work on their team leadership and teamwork skills (as I mentioned last week in my post about the lost Sked).

Good teamwork isn’t just a bunch of people working to solve a task. It’s them working together and often anticipating the needs.  Two events last week illustrated a failure and a success.  At one point, I was a mock patient and the class had been broken into two teams. The first one had to get me out of a tight crawl-way (packaged in a Sked) and up to an open area. From there another team had set up a haul system to get me to the top of a 60′ or so (I’ve never really measured it) block of rock.  This was towards the end of the week when the students (and instructors) are a bit tired and overwhelmed with everything they had learned.  We had allocated 90 minutes for the exercise.

During the first part, I just felt like something was off. Nothing serious, nothing I could put my finger on. But the magic the students had shown all week just wasn’t quite there.

And then there I was, 2 hours into the exercise, laying on top of the tight area, waiting to be hauled up. Someone on the lower team said that I was ready to go.  Someone at the top of the block said they were ready to go.  And then I sat for 2-3 minutes until finally things started happening again.  But for a few minutes the days of teamwork had fallen apart. They weren’t trying to anticipate needs or even work cooperatively.  It wasn’t a huge issue, but it did get a reaction from our oldest, most irascible Level 2 instructor.  I think the word ‘disappointed’ was used at least twice.  It might have been a bit harsh, but… it worked.

As the final exercise of the day, the instructors decided to have a bit of fun with the students. We hid the previously lost (and then found) Sked.  The instructor in charge then informed the students that their lost “patient” for this particular exercise was about 4′ tall, last seen wearing bright orange, and was afraid of the dark. As details were added, it suddenly dawned on them what we had in mind.

But they took the task seriously. They did an excellent search and when they were then told the “patient” wasn’t responsive to stimuli, and they couldn’t rule out a c-spine injury, therefore the Sked had to be packaged in another litter, they did so with gusto and honestly, one of the best packaging jobs I had seen all week. All this while laughing.

They took an absolutely silly scenario, laughed while doing it, yet exhibited amazing teamwork. They were back on their game!

Sometimes teams can start to fall apart, but reminding them of how good they can be and providing a bit of levity can help elevate them back to their best!

 

A Lost Sked

Not much time to write this week. I’m off in Alabama crawling around in the bowels of the Earth teaching cave rescue to a bunch of enthusiastic students. The level I teach focuses on teamwork. And sometimes you find teams forming in the most interesting ways.

Yesterday our focus was on some activities in a cave (this one known as Pettyjohn’s) that included a type of a litter known as a Sked. When packaged it’s about 9″ in diameter and 4′ tall. It’s packaged in a bright orange carrier. It’s hard to miss.

And yet, at dinner, the students were a bit frantic; they could not account for the Sked. After some discussion they determined it was most likely left in the cave.

As an instructor, I wasn’t overly concerned, I figured it would be found and if not, it’s part of the reason our organization has a budget for lost or broken equipment, even if it’s expensive.

That said, what was quite reassuring was that the students completely gelled as a team. There was no finger pointing, no casting blame. Instead, they figured out a plan, determined who would go back to look for it and when. In the end, the Sked was found and everyone was happy.

The moral is, sometimes an incident like this can turn into a group of individuals who are blaming everyone else, or it can turn a group into a team where everyone is sharing responsibility. In this case it was it was the latter and I’m quite pleased.

Sharking

The title refers to a term I had not given much thought to in years, if not perhaps decades. But first let me mention what prompted the memory.

This weekend my daughter was competing at the State Odyssey of the Mind competition in Binghamton, NY. While waiting for her team to compete, I noticed a member of one of the other teams walking around with a stuffed, cloth sharkfin pinned to the back of a sport jacket.

This reminded me of a t-shirt my mom made for me years back with a similar design.

So, you may be asking yourself, “why?” and perhaps asking “what’s the point of this particular blog post”.  I’ll endeavor to answer both. But first we have to jump back into the time machine and again go back to my days at RPI. The year is 1989 and I’m now helping out with the Student Orientation (SO) staff. We were a bunch of students who would return to RPI over the summer and help the incoming Freshman class get oriented while they visited RPI in prep coming in as students in the fall.

Back then, the ratio at RPI was pretty lopsided, it was 5 men for every one 1 women. This among other things lead to some women using the phrase, “The odds are good, but the goods are odd.” In a strictly mathematical sense this was in a way accurate, if a woman wanted to date, she had 5 men vying for her attention. The reality of course was much different. It meant that if a woman didn’t want to date, she still had 5 men vying for her attention. (Of course it was far more than that since things didn’t divvy up nearly as cleanly.)

This was a tough social environment and combine that with fairly geeky students who often didn’t develop good social skills in high school and you often ended up with a lot of awkward situations and honestly, some pretty bad behavior all around; hence the goods being odd.

And unfortunately, some SO staff weren’t immune from being problematic. We tried to self-police, but there were always the 1-2 men who would be extra friendly to the incoming women and like a shark swimming the waters, look for their easy prey. We called this sharking. We would look out for it among ourselves and try to stop anyone SO advisor we thought was doing it and if they were particular egregious, make sure they weren’t invited back the next year. But the problem definitely existed.

My mom, bless her heart made me a shirt with a shark fin on the back, not because I personally was a shark, or to mock the problem, but more to highlight the problem and help us be more self-aware.

So, this weekend I was reminded of sharking.

So why bring it up? Because, being a member of several communities, including IT savvy communities, caving, and others, I still see this as an ongoing problem; someone in a position of power or influence, preying upon the newcomers; often young women. Now it often can start out with the best of intentions and without the person meaning to. You see someone new, they ask for help. You decide to mentor them. You’re just being helpful, right? But then it becomes the extra friendly touch, the slight innuendo in a comment, the off-color joke or even the outright blatant consent violations.

Watch out for it. Don’t do it and if you catch others doing it, say something. Nip it in the bud. If you’re mentoring, mentor. Provide them with professional guidance and advice. Don’t use it as an opportunity to prey upon their naivete and lack of knowledge or experience. Remember, as a mentor, you are in a position of power and influence and so you should be like Spiderman and only use that power and influence for the greater good and to help them, not to help yourself.

And if you do for some reason find yourself slipping beyond the role of a mentor and your mentee also appears to be comfortable with this (hey, it does happen, we’re all human), then STOP BEING THEIR MENTOR.  Make it clear that you can’t do both. A mentor, by definition and nature, is a position of influence. Don’t mix that with relationships in a professional setting. Just don’t.

As many of you know, I love teaching, it’s a reason I’m a cave rescue instructor and a reason I teach at SQL Saturdays and at other events.  I encourage folks to teach and help mentor others.  But please, be aware of boundaries and keep it professional.

Oh and a final note, I’m not immune to my own follies and mistakes and if you ever catch me crossing a line, by all means call me out on it. I don’t want to be “that guy”.

 

Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?