On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.

Git ‘r Done

It’s rare I post items so quick in succession, but I’m trying to post a bit more often and these topics work together.

I mentioned in my previous post about the group of people I work with on the NCRC Educational Committee.  But I wanted to follow that up with a comment about a goal

In cave rescue, our goal is to get the patient to the surface as quickly and safely as possible in as good or better shape than we found them.

Ultimately that goal should drive pretty much everything we do on a rescue.

Sometimes though, students fail to see it that way. On one hand it is amusing when we watch students take a simple problem and over-complicate it. Sometimes two instructors will look at each other and ask, “why are they doing it THIS way and not THAT way?” During training it’s easy to refocus them and remind them what the goal is. At the end of the week of training we have a mock rescue where the students are on their own. At this point, if they lose focus of the goal, they may take longer than they expect or wish to.

During one practice I was on as a studnet, a discussion began about how to move the litter with the patient in it under a tight low roof along a stream passage. After a minute or two of discussion, 2 of the other students and I looked at each other and realized the other members of the group were too focused on convincing the other members that their way was the best way to move the patient. The entire time they were trying to “win” the conversation, which had apparently become their current goal; the patient wasn’t moving.

So, the three of us simply moved the patient quickly and safely to the other side of the obstruction. After about a minute, the conversation stopped and the folks on the other side of the discussion realized the patient they were arguing about moving, had been moved. Things improved from there.

Now, by no means should it sound like I’ve never lost focus (see my post The Hunger Games for an example of a potentially more dangerous situation where I definitely lost focus of the correct goal.)

But this leads to a question: “How should the goal be accomplished?”

To give an example, perhaps I can build and operate a beautiful 4:1 haul system where every leg collapses the optimal amount and I can operate it with just 2 people.  Or, I can put up a 1:1 haul system that’s inelegant and requires 6-7 people to operate it.  Both will move the patient, but which one is “better?”

Well, honestly, “it depends”.  If I have plenty of extra people and I’m short on time, I’ll go with the 1:1 almost every time.  It’s simple and it works.  It can be setup in just a few minutes and requires very little equipment.

But what if I’m tight on people and I have the time?  Perhaps then the 4:1 is the proper solution.

This is where experience and judgement come into play.  Both systems “Git ‘r Done” and both can help me with my goal of getting the patient to daylight. And that is my goal in a rescue.  My goal in a rescue is NOT to build a beautiful 4:1.  My goal is to build a system that gets the patient out safely and quickly.  If a 4:1 will work best to accomplish the goal, I’ll do that.  If it won’t. I’ll forgo it, no matter how sweet and sexy it may seem to me.

We sometimes teach students a handy metric of two questions to ask themselves:

  1. Does it work?
  2. Is it safe?

A 4:1 that isn’t fully rigged when the patient arrives fails the first question, no matter how elegant and well it may operate when its finally rigged.

On the other hand, if the 1:1 is fully rigged, but I don’t have enough folks to operate it, it also fails the first question.  However, if I have enough folks to operate it, I’m not going to start discussing how there might be a better way to rig it with fewer people.  Note that “is it the BEST or OPTIMAL solution” isn’t part of the metric.  In this case, it really doesn’t matter.

Keeping these questions in mind can often negate extra conversation (such as the example above of the patient not moving while folks were discussing the BEST way to move him.)

So, when you face the task of solving a problem, especially one with time pressure and that is most likely a one-off, ask yourself if the solution you currently have is safe and if it works.  If you can answer yes to both, then Git ‘r done.