Teamwork

I spent the majority of last week with some of the greatest people I know: my fellow NCRC (National Cave Rescue Commission) instructors and students.  Let’s start by the obvious, it takes a strange twist of mind to be the sort of person that actually enjoys crawling through dark holes in the ground. Now take that same group of people and add  a sense of altruism and you’ve got folks willing to rescue others trapped in caves.

Let me interject in here by the way, that caving is actually a fantastically safe sport. When an accident in a cave makes the news, it’s because it’s so rare and generally so unique as to attract attention. The NSS puts out a report every 2 years detailing the accidents across the US. It’s well worth the read if you can find a copy. Now back to the rest of my blog post.

I have been one of the many instructors who teach the “Level 2” class. I love this level because students have gotten beyond the basics and are starting to learn the “why” we do things more than simply the “how”.  And we start to really work on their team leadership and teamwork skills (as I mentioned last week in my post about the lost Sked).

Good teamwork isn’t just a bunch of people working to solve a task. It’s them working together and often anticipating the needs.  Two events last week illustrated a failure and a success.  At one point, I was a mock patient and the class had been broken into two teams. The first one had to get me out of a tight crawl-way (packaged in a Sked) and up to an open area. From there another team had set up a haul system to get me to the top of a 60′ or so (I’ve never really measured it) block of rock.  This was towards the end of the week when the students (and instructors) are a bit tired and overwhelmed with everything they had learned.  We had allocated 90 minutes for the exercise.

During the first part, I just felt like something was off. Nothing serious, nothing I could put my finger on. But the magic the students had shown all week just wasn’t quite there.

And then there I was, 2 hours into the exercise, laying on top of the tight area, waiting to be hauled up. Someone on the lower team said that I was ready to go.  Someone at the top of the block said they were ready to go.  And then I sat for 2-3 minutes until finally things started happening again.  But for a few minutes the days of teamwork had fallen apart. They weren’t trying to anticipate needs or even work cooperatively.  It wasn’t a huge issue, but it did get a reaction from our oldest, most irascible Level 2 instructor.  I think the word ‘disappointed’ was used at least twice.  It might have been a bit harsh, but… it worked.

As the final exercise of the day, the instructors decided to have a bit of fun with the students. We hid the previously lost (and then found) Sked.  The instructor in charge then informed the students that their lost “patient” for this particular exercise was about 4′ tall, last seen wearing bright orange, and was afraid of the dark. As details were added, it suddenly dawned on them what we had in mind.

But they took the task seriously. They did an excellent search and when they were then told the “patient” wasn’t responsive to stimuli, and they couldn’t rule out a c-spine injury, therefore the Sked had to be packaged in another litter, they did so with gusto and honestly, one of the best packaging jobs I had seen all week. All this while laughing.

They took an absolutely silly scenario, laughed while doing it, yet exhibited amazing teamwork. They were back on their game!

Sometimes teams can start to fall apart, but reminding them of how good they can be and providing a bit of levity can help elevate them back to their best!

 

A Lost Sked

Not much time to write this week. I’m off in Alabama crawling around in the bowels of the Earth teaching cave rescue to a bunch of enthusiastic students. The level I teach focuses on teamwork. And sometimes you find teams forming in the most interesting ways.

Yesterday our focus was on some activities in a cave (this one known as Pettyjohn’s) that included a type of a litter known as a Sked. When packaged it’s about 9″ in diameter and 4′ tall. It’s packaged in a bright orange carrier. It’s hard to miss.

And yet, at dinner, the students were a bit frantic; they could not account for the Sked. After some discussion they determined it was most likely left in the cave.

As an instructor, I wasn’t overly concerned, I figured it would be found and if not, it’s part of the reason our organization has a budget for lost or broken equipment, even if it’s expensive.

That said, what was quite reassuring was that the students completely gelled as a team. There was no finger pointing, no casting blame. Instead, they figured out a plan, determined who would go back to look for it and when. In the end, the Sked was found and everyone was happy.

The moral is, sometimes an incident like this can turn into a group of individuals who are blaming everyone else, or it can turn a group into a team where everyone is sharing responsibility. In this case it was it was the latter and I’m quite pleased.

Legacy

“…the good is oft interred with their bones.”

Or these days, lives on in the Internet. I never quite agreed with Shakespeare in this line. I think the good lives on beyond the grave.

In the book, Lies my Teacher Taught Me author James Loewen talks about how certain African tribes divide people into three categories: those alive, the sasha or living-dead, and zamani or the dead.

This was brought to mind yesterday when I was trying to debug an issue which turned out to be a bug in SQL Server 2016 SP2.  While trying to debug it, I needed to add a user to an SSIS setup. This has been a problem in the past, but I recalled I had used #SQLHELP on Twitter to ask the question and gotten a great answer. So, a quick search later found the response I was looking for. The fully correct answer (since MSFT’s page leaves out a step) was available at: http://sqlsoldier.net/wp/sqlserver/howdoigrantaccesspermissionsforssistousers

Now, many of my readers won’t recognize the name, but some will: @SQLSoldier, a member of the #SQLFamily that passed away recently. At the time of his passing I had forgotten that he had reached out to help me last year. The search yesterday though brought it back to me. I never had the honor of meeting Mr. Davis in person, but I know many others spoke highly of him. It was comforting to me to know that even months later his legacy was still helping me (and presumably others).

After thinking about that, I got thinking about my dad.  Soon after he passed in 2015, I picked up the hefty Milwaukee right-angle drill that had been his and was now mine. I was working on the addition (that he had helped design before his death) that has since become my office. I had always liked that particular tool. It has a certain heft and power to it.  At the time, with his death so close at hand, it was a form of grief therapy for me. I had often used this in my youth, helping him out with various construction projects. To this day I’ll pick up one of the tools I inherited, or start a house project using the skills he taught me and I realize, he’s sasha, living-dead. He’s still lives on in me.

SQLSoldier is also living-dead, he’s very real in the hearts and minds of those who knew him and his legacy lives on, still helping others, such as myself.

As my age is now entering its 2nd half-century,  I wonder more and more what my legacy will be. I hope that when I’m sasha, my legacy can still help and aid others.

And with that, I will conclude with a scene from one of my favorite actors in one of my favorite movies:  “What will your verse be?”

 

RCA or “get it running!”

How often have any of us resorted to fixing a server issue by simply rebooting the server?  Yes, we’re all friends here, you can raise your hands. Don’t be shy. We all know we’ve done it at some point.

I ask the question because of a recent tweet I saw with the hashtag #sqlhelp where Allan Hirt made a great comment:

Finding root cause is nice, but my goal first and foremost is to get back up and running quickly. Uptime > root cause more often than not.

This got me thinking, when is this true versus when is it not? And I think the answer ends up being the classic DBA answer, “it depends”.

I’m going to pick two well studied disasters that we’re probably all familiar with. But we need some criteria.  In my book IT Disaster Response: Lessons Learned in the Field I used the definition:

Disaster: An unplanned interruption in business that has an adverse impact on finances or other resources.

Let’s go with that.  It’s pretty broad, but it’s a starting point. Now let’s ignore minor disasters like I mention in the book, like the check printer running out of toner or paper on payroll day. Let’s stick with the big ones; the ones that bring production to a halt and cost us real money.  And we’re not going to restrict ourselves to IT or databases, but we’ll come back to that.

The first example I’m going to use is the Challenger Disaster. I would highly recommend folks read Diane Vaughen’s seminal work: The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. That said, we all know that when this occurred, NASA did a complete stand-down of all shuttle flights until a full RCA was complete and many changes were made to the program.

On the other hand, in the famous Miracle on the Hudson, airlines did not stop flying after the water landing. But this doesn’t mean a RCA wasn’t done. It in fact was; just well after the incident.

So, back to making that decision.  Here, it was an easy decision. Shuttle flights were occurring every few months and other than delaying some satellite launches (which ironically may have led to issues with the Galileo probe’s antenna) there wasn’t much reason to fly immediately afterwards.  Also, while the largest points were known, i.e. something caused a burn-thru of the SRB, it took months to determine all the details. So, in this case, NASA could and did stand-down for as long as it took to rectify the issues.

In the event of the Miracle on the Hudson, the cause was known immediately.  That said, even then an RCA was done to determine the degree of the damage, if Sullenberg and Skiles had done the right thing, and what procedural changes needed to be made.  For example one item that came out of the post-landing analysis was that the engine restart checklist wasn’t really designed for low altitude failures such as they experienced.

Doing a full RCA of the bird strike on US Airways 1549 and stopping all over flights would have been an economic catastrophe.  But it was more than simply that. It was clear, based on the millions of flights per year, that this was a very isolated incident. The exact scenario was unlikely to happen again.  With Challenger, there had only been 24 previous flights, and ALL of them had experienced various issues, including blow-bys of the primary O-ring and other issues with the SRBs.

So back to our servers.  When can we just “get it running” versus taking downtime to do a  complete RCA vs other options?

I’d suggest one criteria is, “how often has this happened compared to our uptime?”

If we’ve just brought a database online and within the first week it has crashed, I’m probably going to want to do more of an immediate RCA.  If it’s been running for years and this is first time this issue has come up, I’m probably going to just get it running again and not be as adamant about an immediate RCA. I will most likely try to do an RCA afterwards, but again, I my not push for it as hard.

If the problem starts to repeat itself, I’m more likely to push for some sort of immediate RCA the next time the problem occurs.

What about the seriousness of the problem? If I have a server that’s consistently running at 20% CPU and every once in awhile it leaps up to 100% CPU for a few seconds and then goes back to 20% will I respond the same way as if it crashes and it takes me 10 minutes to get it back up? Maybe.  Is it a web-server for cat videos that I make a few hundred off of every month? Probably not. Is it a stock-trading server where those few seconds costing me thousands of dollars?  Yes, then I almost certainly will be attempting an RCA of some short.

Another factor would be, what’s involved in an RCA? Is it just a matter of copying some logs to someplace for later analysis and that will simply take a few seconds or minutes, or am I going to have to run a bunch of queries, collect data and do other items that may keep the server off-line for 30 minutes or more?

Ultimately, in most cases, it’s going to come down to balancing money and in the most extreme cases, lives.  Determining the RCA now, may save money later, but cost money now. On the other hand, not doing an RCA now might save money now, but might cost money later.  Some of it is a judgement call, some of it depends on factors you use to make your decision.

And yes, before anyone objects, I’m only very briefly touching upon the fact that often an RCA can still be done after getting things working again. I’m just touching upon the cases where it has to be done immediately or evidence may be lost.

So, are your criteria for when you do an RCA immediately vs. getting things running as soon as you can? I’d love to hear them.

And credit for the Photo by j zamora on Unsplash