The Thai Cave Rescue

“When does a cave rescue become a recovery?’ That was the question a friend of mine asked me online about a week ago. This was before the boys and their coach had been found in the Thai cave.

Before I continue, let me add a huge caveat: this is an ongoing dynamic situation and many of the details I mention here may already be based on inaccurate or outdated information. But that’s also part of the point I ultimately hope to make: plans have to evolve as more data is gathered.

My somewhat flippant answer was “when they’re dead.” This is a bit of dark humor answer but there was actually some reasoning behind it. Before I go on, let me say that at that point I actually still had a lot of hope and reason to believe they were still alive. I’m very glad to find that they were in fact found alive and relatively safe.

There’s a truth about cave rescue: caves are literally a black-hole of information. Until you find the people you’re searching for, you have very little information.  Sometimes it may be as little as, “They went into this cave and haven’t come out yet.” (Actually sometimes it can be even less than that, “We think they went into one of these caves but we’re not even sure about that.”)

So when it comes to rescue, two of the items we try to teach students when teaching cave rescue is to look for clues, and to try to establish communications. A clue might be a footprint or a food wrapper. It might be the smell of a sweaty caver wafting in a certain direction. A clue might be the sound of someone calling for help. And the ultimate clue of course is the caver themselves. But there are other clues we might look for: what equipment do we think they have? What experience do they have? What is the characteristics of the cave? These can all drive how we search and what decisions we make.

Going back to the Thai cave situation, based on the media reports (which should always be taken with a huge grain of salt) it appeared that the coach and boys probably knew enough to get above the flood level and that the cave temps were in the 80s (Fahrenheit).  These are two reasons I was hopeful. Honestly, had they not gotten above the flood zone, almost certainly we’d be talking about a tragedy instead. Had the cave been a typical northeast cave where the temps are in the 40s (F) I would have had a lot less hope.

Given the above details then, it was reasonable to believe the boys were still alive and to continue to treat the situation as a search and eventually rescue situation.  And fortunately, that’s the way it has turned out. What happens next is still open for speculation, but I’ll say don’t be surprised if they bring in gear and people and bivouac in place for weeks or even months until the water levels come down.

During the search process, apparently a lot of phone lines were laid into parts of the cave so that easier communications could be made with the surface. Now that they have found the cavers, I’d be shocked if some sort of realtime communications is not setup in short order. This will allow he incident commander to make better informed decisions and to be able to get the most accurate and up to date data.

So, let me relate this to IT and disasters. Typically a disaster will start with, “the server has crashed” or something similar. We have an idea of the problem, but again, we’re really in a black-hole of information at that moment. Did the server crash because a hard drive failed, or because someone kicked the power cord or something else?

The first thing we need to do is to get more information. And we may need to establish communications. We often take that for granted, but the truth is, often when a major disaster occurs, the first thing to go is good communications. Imagine that the crashed server is in a datacenter across the country. How can you find out what’s going on? Perhaps you call for hands on support. But what if the reason the server has crashed is because the datacenter is on fire? You may not be able to reach anyone!  You might need to call a friend in the same city and have them go over there.  Or you might even turn on the news to see if there’s anything on worth noting.

But the point is, you can’t react until you have more information. Once you start to have information, you can start to develop a reaction plan. But let’s take the above situation and imagine that you find your datacenter has in fact burned down. You might start to panic and think you need to order a new server.  You start to call up your CFO to ask her to let you buy some new hardware when suddenly you get a call from your tech in the remote. They tell you, “Yeah, the building burned down, but we got real lucky and our server was in an area that was undamaged and I’ve got it in the trunk of my car, what do you want me to do with it?”

Now your previous data has been invalidated and you have new information and have to develop a new plan.

This is the situation in Thailand right now. They’re continually getting new information and updating their plans as they go. And this is the way you need to handle you disasters, establish communications, gather data and create a plan and update your plan as the data changes. And don’t give up hope until you absolutely have to.

Swiss Cheese

This blog post will try to tie together several of my favorite things: Cheese, caving, and accidents.

I was making lunch the other day and I was looking at the stick of sliced Swiss cheese I had. I should note, I love Swiss cheese, especially with a good roast beef sandwich.

But first, an existential question.  “What is a cave?”

Oh, that’s easy, it’s a passage through rock in the ground.  In other words it’s the area where there’s no rock.  Great. Let’s start simple. I think we can agree if it’s dark and I can walk through it, it’s a cave. What if I have to crawl? Yeah, that’s still a cave. What if I have to shimmy through and can barely fit? Yeah, that’s still a cave. What if I can’t fit, but one of my much smaller friends can fit through? Yeah, that’s a cave. But what if the entire thing is too small for anyone to crawl through but small animals can? What if two rooms that are large enough for humans to be in are connected by a passage too tight for a human, but say you can shine a light through, or can make a “voice connection” and hear people at the other end? Is that still part of the cave? As an aside, humans have mapped over 190 miles of Jewel Cave (and more all the time, big shout out to my friends who are mapping it!) But airflow studies estimate that we’ve only mapped about 3-5% of it. Let that sink in. But, what if the other 95% is too small for a human to fit in. I don’t think anyone would not call that part of the cave.

But here’s the real question. So we’ve mapped the cave. We know where the passages (i.e. lack of rock) are.  We find a plug of mud and remove that.  We’ve made more cave! Yeah! But what if we remove ALL the rock around the existing passage. When does the cave disappear? I mean now we just have a lot more “absence of rock”.  But I think we’d agree at some point we no longer have a cave!

So back to Swiss cheese.  One of the distinguishing details of such cheese are the holes, or more properly named the eyes. Did you know there’s actual Federal guidelines on what can be called Swiss cheese. Ayup, you can’t simply have a cheese with eyes in it. So I guess Swiss cheese is sort of like a cave. We actually have to think about it to give it some definition we can agree on.  Take away all the cheese, eyes and all, and you have no more cheese and I’m quite sad.

But what about accidents? Well, there’s a model of risk analysis called the Swiss cheese model. Basically, very few accidents occur out of the blue or entirely without a relation to other factors. The idea is you have multiple slices of Swiss cheese and all the holes have to line up for the accident to occur. For example, in my own personal experience, years ago I came close to all the “pieces” of the cheese lining up; while driving through New Jersey, I came fairly close to hydroplaning off an exit ramp into the woods.  Let’s look at some of the slices of cheese that came into play.

  • I was tired. Had I been more awake I’d have been paying a bit more attention.
  • It was dark. I might have noticed exactly how wet the exit ramp was during daylight.
  • I was travelling too fast.
  • I had nearly missed the ramp, I might have been travelling slower (see above) had I noticed the ramp sooner.

The instant I hit the ramp, I knew I was in trouble. I think the ONE slice that didn’t line up was, experience. Had I been 20 years younger with less experience driving, I suspect I’d have ended up off the road. I was at the very edge of being able to brake and maneuver and I called upon all my years of experience to stay on the correct side of that edge. One thin slice of “cheese” saved me that night.

When one looks through accident reports, of almost any industry or activity, one can start to look for where the slices lined up and how any one could be changed. One reason I read the American Cave Accidents report when I receive it is to learn where the slices could have been moved so I can make sure I don’t line up my slices of cheese.

So, the question for you is where do your slices of cheese line up?

And other question is, what sort of cheese do you put on YOUR roast beef sandwich? And do you make sure your Swiss cheese eyes don’t line up so every bite is ensured a bit of cheese?

 

 

 

 

Alarming

So a recent trip to the ER (no, nothing serious, wasn’t me, thanks for asking) reminded me of a topic near and dear to my heart: Alarms and Alerts. What prompted this thought was the number of beeps, boops, and chirps I heard while there that no one responded to.  This leads to the question: Why have them, if no one responds to them?

I have a simple rule for alarms: “Don’t put an alert on something unless you have a response pre-planned for it.”

This is actually more complex than it sounds. And it can sometimes lead to seemingly illogical conclusions if you follow it in a reductio ad absurdum fashion.

Let’s start with an example of one alert I heard while sitting waiting. It was a constant beep, about 90 times a minute. I soon tracked it down to a portable monitor attached to a patient that was soon to be moved upstairs.  It was the person’s pulse.  Besides a possible HIPAA violation (I was now in theory privy to private medical information) it really served no purpose other than to annoy the patient and those around them. “But Greg, perhaps they were afraid the patient would suddenly go into cardiac arrest or something else would happen.”  And I agree, but then let’s alert on the sudden change in conditions, not in what was, at the time, a stable pulse for the patient. This beeping went on for over 10 minutes. And no one was monitoring it, other than the patient and us annoyed strangers.

So, there was an alert, that apparently needed no response.

But let’s go to the other extreme. What about when an alert isn’t needed. Let’s say you’re driving your car and it throws a rod. (Yes, this happened to me once, well I wasn’t driving, my father was. It was his sister’s Volkswagon campervan). I can tell you there is NO alert when such an event happens. But, there’s no need for it. The vehicle stops. It won’t go. So an alert in that case is pretty superfluous.

But let’s tie this to IT. I’m going to give you an absurd example of when not to have an alert: When you run out of disk space.  Again, you might disagree. You’d think this would be the perfect time to have an alert. But go back to my rule. What if you have no plan for this? You’ve never gamed out the possibility.  Now, you’re out of disk space. You don’t have a plan. Does it really matter if you had an alert or not? If you can’t respond, the alert really hasn’t added anything.

The main lesson to take away from that example is, if you’re setting up an alert, make sure you do have a plan. (The other lesson of course is perhaps to have an alert BEFORE you run out of disk space!) The plan may be as simple as, “delete as many files as I can”. But of course that only works if you have files to delete. Or it might be “add another filegroup to the database for now and then figure out the long-term solution during our next planned outage.”  Or, in the worst case it might be, “update my resume.”  But the point is, if you have an alert, have SOME plan for it.

On the flip side, how many times do you have an alert that you look at and say, “oh yeah, we can ignore that, that always happens.”  Sure, that’s a plan, but honestly, ask yourself, do you need an alert in that case? Probably not. I hate getting woken up at 2:00 AM for an alert I don’t need to respond to.  So in this case if there is no plan because you don’t need a plan, eliminate the alert.

I could go on (and perhaps this will be a good topic for my next book) but I’ll add one last real-world case where people all to often ignore alerts: smoke and CO detectors; especially CO detectors.  If you have a CO detector and it alerts, do NOT assume it’s faulty and unplug it. Respond. Somehow. Don’t automatically assume it’s a faulty battery, especially if it’s the winter. If you have any doubt, please call the fire department. Trust me, they’d much rather respond to a call where you’re all alive and it’s a false CO alarm than to show up and find the alarm going off, but everyone is now dead.

So the take away is, alerts are only useful if they generate a useful response.

Oh and because the inner child in me can’t resist: be a lert because the world needs more lerts! 🙂

 

 

A Lost Sked

Not much time to write this week. I’m off in Alabama crawling around in the bowels of the Earth teaching cave rescue to a bunch of enthusiastic students. The level I teach focuses on teamwork. And sometimes you find teams forming in the most interesting ways.

Yesterday our focus was on some activities in a cave (this one known as Pettyjohn’s) that included a type of a litter known as a Sked. When packaged it’s about 9″ in diameter and 4′ tall. It’s packaged in a bright orange carrier. It’s hard to miss.

And yet, at dinner, the students were a bit frantic; they could not account for the Sked. After some discussion they determined it was most likely left in the cave.

As an instructor, I wasn’t overly concerned, I figured it would be found and if not, it’s part of the reason our organization has a budget for lost or broken equipment, even if it’s expensive.

That said, what was quite reassuring was that the students completely gelled as a team. There was no finger pointing, no casting blame. Instead, they figured out a plan, determined who would go back to look for it and when. In the end, the Sked was found and everyone was happy.

The moral is, sometimes an incident like this can turn into a group of individuals who are blaming everyone else, or it can turn a group into a team where everyone is sharing responsibility. In this case it was it was the latter and I’m quite pleased.

RCA or “get it running!”

How often have any of us resorted to fixing a server issue by simply rebooting the server?  Yes, we’re all friends here, you can raise your hands. Don’t be shy. We all know we’ve done it at some point.

I ask the question because of a recent tweet I saw with the hashtag #sqlhelp where Allan Hirt made a great comment:

Finding root cause is nice, but my goal first and foremost is to get back up and running quickly. Uptime > root cause more often than not.

This got me thinking, when is this true versus when is it not? And I think the answer ends up being the classic DBA answer, “it depends”.

I’m going to pick two well studied disasters that we’re probably all familiar with. But we need some criteria.  In my book IT Disaster Response: Lessons Learned in the Field I used the definition:

Disaster: An unplanned interruption in business that has an adverse impact on finances or other resources.

Let’s go with that.  It’s pretty broad, but it’s a starting point. Now let’s ignore minor disasters like I mention in the book, like the check printer running out of toner or paper on payroll day. Let’s stick with the big ones; the ones that bring production to a halt and cost us real money.  And we’re not going to restrict ourselves to IT or databases, but we’ll come back to that.

The first example I’m going to use is the Challenger Disaster. I would highly recommend folks read Diane Vaughen’s seminal work: The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. That said, we all know that when this occurred, NASA did a complete stand-down of all shuttle flights until a full RCA was complete and many changes were made to the program.

On the other hand, in the famous Miracle on the Hudson, airlines did not stop flying after the water landing. But this doesn’t mean a RCA wasn’t done. It in fact was; just well after the incident.

So, back to making that decision.  Here, it was an easy decision. Shuttle flights were occurring every few months and other than delaying some satellite launches (which ironically may have led to issues with the Galileo probe’s antenna) there wasn’t much reason to fly immediately afterwards.  Also, while the largest points were known, i.e. something caused a burn-thru of the SRB, it took months to determine all the details. So, in this case, NASA could and did stand-down for as long as it took to rectify the issues.

In the event of the Miracle on the Hudson, the cause was known immediately.  That said, even then an RCA was done to determine the degree of the damage, if Sullenberg and Skiles had done the right thing, and what procedural changes needed to be made.  For example one item that came out of the post-landing analysis was that the engine restart checklist wasn’t really designed for low altitude failures such as they experienced.

Doing a full RCA of the bird strike on US Airways 1549 and stopping all over flights would have been an economic catastrophe.  But it was more than simply that. It was clear, based on the millions of flights per year, that this was a very isolated incident. The exact scenario was unlikely to happen again.  With Challenger, there had only been 24 previous flights, and ALL of them had experienced various issues, including blow-bys of the primary O-ring and other issues with the SRBs.

So back to our servers.  When can we just “get it running” versus taking downtime to do a  complete RCA vs other options?

I’d suggest one criteria is, “how often has this happened compared to our uptime?”

If we’ve just brought a database online and within the first week it has crashed, I’m probably going to want to do more of an immediate RCA.  If it’s been running for years and this is first time this issue has come up, I’m probably going to just get it running again and not be as adamant about an immediate RCA. I will most likely try to do an RCA afterwards, but again, I my not push for it as hard.

If the problem starts to repeat itself, I’m more likely to push for some sort of immediate RCA the next time the problem occurs.

What about the seriousness of the problem? If I have a server that’s consistently running at 20% CPU and every once in awhile it leaps up to 100% CPU for a few seconds and then goes back to 20% will I respond the same way as if it crashes and it takes me 10 minutes to get it back up? Maybe.  Is it a web-server for cat videos that I make a few hundred off of every month? Probably not. Is it a stock-trading server where those few seconds costing me thousands of dollars?  Yes, then I almost certainly will be attempting an RCA of some short.

Another factor would be, what’s involved in an RCA? Is it just a matter of copying some logs to someplace for later analysis and that will simply take a few seconds or minutes, or am I going to have to run a bunch of queries, collect data and do other items that may keep the server off-line for 30 minutes or more?

Ultimately, in most cases, it’s going to come down to balancing money and in the most extreme cases, lives.  Determining the RCA now, may save money later, but cost money now. On the other hand, not doing an RCA now might save money now, but might cost money later.  Some of it is a judgement call, some of it depends on factors you use to make your decision.

And yes, before anyone objects, I’m only very briefly touching upon the fact that often an RCA can still be done after getting things working again. I’m just touching upon the cases where it has to be done immediately or evidence may be lost.

So, are your criteria for when you do an RCA immediately vs. getting things running as soon as you can? I’d love to hear them.

And credit for the Photo by j zamora on Unsplash

Choices

“If you choose not to decide You still have made a choice” – Rush Freewill

One of the things that we believe makes us uniquely human is the concept of freewill; that we can rise above our base instincts and make choices based on things other than pure instinct. While there’s some question if that’s unique to humans, let’s stick with it for now.

Overall, we think choice is good. I can choose to eat cake for breakfast, or I can choose to eat a healthy breakfast. I can choose if I want get up early and exercise, or sleep in.

Sometimes we may think it’s hard to decide between two such things as in the examples above, but the truth is, it’s not that hard.

But, what happens when the choices aren’t nearly as simple. What happens when we sit down with a menu with 3 items versus 30 or even 100? We can become paralyzed. With 3 options, our odds of making a “wrong” decision is only 66%. I say “wrong”because it’s often purely subjective and may not necessarily have much impact.  But when we have 100 different things to choose from, the odds of a “wrong” decision goes up to 99%. In other words, we’re faced with the concept that no matter what we do, we’re virtually guaranteed to make a “wrong” decision.

The Jam Experiment

One example of this effect was seen in what is often called the jam experiment. Simply put, when given the choice of 6 varieties of jam, consumers showed a bit less interest, but sales were higher. When the choice of 24 jams were presented, there was more interest, but sales actually dropped, significantly. People were apparently paralyzed by having too many choices.

Locally there’s an outdoor hamburger/hot-dog stand I like to frequent called Jack’s Drive In. People will stand in long lines, in all sorts of weather (especially on opening day, like this year when the line was 20 people deep and with the windchill it was probably about 20F!) One can quibble over the quality of the burgers and fries, but there’s no doubt they do a booming business. And part of the reason is because they have few choices and keep the line moving.  This makes it far faster for people to order and faster to cook.  With only a few choices, patrons don’t spend 5 minutes dithering over a menu.

Hint: If you’re ever in the area, simply tell them you want “Two burgers and a small french”.  Second hint: No matter how hungry you are, don’t as a former co-worker once did, try “6 Burgers and a large french”. You will regret that particular choice.

Choices to Europe

What brought this particular post on was all the choices I’m facing in trying to plan our family vacation. It’s rather simple really, “we want to visit Europe”. But, I also am hoping to speak at SQL Saturday in Manchester, UK. And we want to visit London (where my cousin lives) and Paris. And we can fly out of the NYC area. Or Boston. Or possibly other areas if the price was cheaper enough.  So suddenly what one would hope is a simple thing becomes very complicated. And of course every airline has their own website design, which complicates things.

Of course the simple choice would be not to fly. The second simplest would be not to care about cost.  Of course neither of those work. So, I’m stuck in deciding between 24 types of jam. Wish me luck!

Getting Unlost

There’s a concept I teach people when I teach outdoor skills. If you’re going to be wrong, be confidently wrong. There’s two reasons for this. For one, people are more likely to follow a leader who appears to be confident and knows what they’re doing. This can lead to better group dynamics and a better outcome.

But the second, for example, if you’re lost is, if for whatever reason you choose NOT to stay in one place (which by the way is often the best choice, especially for children) is that if you make a plan and stick to it, you’re far more likely to get unlost. This isn’t just wishful thinking.

Imagine you’re lost and you decide, “I’m going to hike North!”  And you start to hike north, and after 15 minutes you decide, “eh maybe that was the wrong decision. I should hike East!” And you do this for another 15 minutes, and then you decide, “Nah, now that I think about it, South is much better.” 15 minutes later you decide you’re going to the wrong way and West was the right way all along.  An hour later, you’re back where you started. But, if you had decided to stick with North the entire time, an hour later, depending on your pace, terrain and other factors, you could be 2-4 miles further north. “So what?” you might ask. Well, take a look at a map of almost any part of the country.  In most cases you’re less than 10 miles from some sort of road.  If you’ve spent 3 hours hiking, in a single direction, you’ve probably hit a road, or a powerline or some other sign of civilization. (note this is NOT advice to wander in the woods if you get loss or a promise this will work anyplace. There are definitely places in the US this advice is bad advice).  Also obviously, if you hit a gorge or other impassible geologic feature, you may have to change directions. Or you might get another clue (like hearing a chainsaw or engine or something human-caused in a specific direction).

Final Thoughts

So, if you’re going to make a choice, make it confidently. And don’t second-guess yourself until new, solid reasons come along.

So, keep your choices simple and stick to them.

And with that, I choose to stop typing now.

 

Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?

 

 

SQL Data Partners Podcast

I’ve been keeping mum about this for a few weeks, but I’ve been excited about it. A couple of months ago, Carlos L Chacon from SQL Data Partners reached out to me about the possibility of being interviewed for their podcast. I immediately said yes. I mean, hey, it’s free marketing, right?  More seriously, I said yes because when a member of my #SQLFamily asks for help or to help, my immediate response is to say yes.  And of course it sounded like fun.  And boy was I right!

What had apparently caught Carlos’s attention was my book: IT Disaster Response: Lessons Learned in the Field.  (quick go order a copy now.. that’s what Amazon Prime is for, right?  I’ll wait).

Ok, back? Great. Anyway, the book is sort of a mash-up (to use the common lingo these days) of my interests in IT and cave rescue and plane crashes. I try to combine the skills, lessons learned, and tools from one area and apply them to other areas. I’ve been told it’s a good read. I like to think so, but I’ll let you judge for yourself. Anyway, back to the podcast.

So we recorded the podcast back in January. Carlos and his partner Steve Stedman were on their end and I on mine. And I can tell you, it was a LOT of fun. You can (and should) listen to it here.  I just re-listened to it myself to remind myself of what we covered. What I found remarkable was the fact that as much as I was really trying to tie it back to databases, Carlos and Steve seemed as much interested, if not more in cave rescue itself. I was ok with that.  I personally think we covered a lot of ground in the 30 or so minutes we talked. And it was great because this is exactly the sort of presentation, combined  with my air plane crash one and others I’m looking to build into a full-day onsite consult.

One detail I had forgotten about in the podcast was the #SQLFamily questions at the end. I still think I’d love to fly because it’s cool, but teleportation would be useful too.

So, Carlos and Steve, a huge thank you for asking me to participate and for letting me ramble on about one of my interests.  As I understand it my Ray Kim has a similar podcast with them coming up in the near future also.

So thought for the day is, think how skills you learn elsewhere can be applied to your current responsibilities. It might surprise you and you might do a better job.

 

 

 

What a Lucky Man He Was….

Being a child of the 60s my musical tastes run the gamut from The Doors through Rachel Platten.  In this case, the title of course comes from the ELP song.

Anyway, today’s post is a bit more reflective than some. Since yesterday I’ve been fighting what should be simple code. Years back I wrote a simple website to handle student information for the National Cave Rescue Commission (NCRC).  The previous database manager had started with a database designed back in the 80s. It was certainly NOT web friendly. So after some work I decided it was time to make it a bit more accessible to other folks.  Fortunately ASP.NET made much of the work fairly easy.  It did what I wanted to do. But now, I’m struggling to figure out how to get and save profile information along with membership info.  Long story short, due to a design decision years back, this isn’t as automatic and easy as I’d like.  So, I’ve been banging my head against the keyboard quite a bit over the last 24 hours. It’s quite frustrating actually.

So, why do I consider myself lucky? Because I can take the time to work on this. Through years of hard work, education and honestly a bit of luck, I’m at the point where my wife and I can provide for our family to live a comfortable life and I can get away with working less than a full 40 hours a week. This is important to me as I get older. Quality of life becomes important.

I’ve talked about my involvement in cave rescue in the past and part of that is wearing of multiple hats. Some of which take more work than others.

I am for example:

  • Co-captain of the Albany-Schoharie Cave Rescue Team – This is VERY sporadic and really sort of unofficial and some years we will have no rescues at all locally.
  • I’m an Instructor with the NCRC – This means generally a week plus a few days every year I take time out to travel, at my own expense to a different part of the country and teach students the skills required to be effective in a cave rescue. For this, I get satisfaction. I don’t get paid and like I say I travel at my own expense.  Locally I generally take a weekend or two a year to teach a weekend course.
  • I’m a Regional Coordinator with the NCRC – Among other things this means again I travel at my own expense once a year, generally to Georgia, to meet with my fellow coordinators so we can conduct the business of the NCRC. This may include approving curriculum created by others, reviewing budgets and other business.
  • Finally, I’m the Database Coordinator. It’s really a bit more of IT Coordinator but the title is what it is. This means not only do I develop the database and the front end, I’m responsible for inputting data and running reports.

As you can see, this time adds up, quickly.  I’d say easily, in terms of total time, I dedicate a minimum of two weeks a year to the NCRC.  But it’s worth it. I can literally point at rescues and say, “those people are alive because of the work I and others do”. Sometimes it’s direct like when I’m actually on a rescue, sometimes it’s indirect when I know it’s a result of the training I and others have provided.  But it’s worth it.  I honestly can claim I work with some of the best people in the world. There are people here that I would literally put my life on the line for; in part because I know they’d do the same.

So, I’m lucky. I’m lucky that I can invest so much of my time in something I enjoy and love so much.  I’ll figure out this code and I’ll keep contributing, because it’s worth it, and because I’m lucky enough that I can.

How are you lucky?

 

 

Crane Operators

Talking online with friends the other day, someone brought up that crane operators in NYC can make $400-$500K a year. Yes, a year. I figured I’d confirm that before writing this post and it appears to be accurate.

At first glance one may think this is outrageous, or perhaps they chose the wrong field. I mean I enjoy being a DBA and a disaster geek, but I can’t say I’ve ever made $400K in one year!  And for what, I mean you lift things up and them down. Right?

Let me come back to that.

So, last night, I got paid quite a tidy bundle (but not nearly that much) for literally logging into a client computer, opening up VisualCron and clicking on a task and saying, “disable task”. On one hand, it seemed ridiculous;  not just because of what they were paying me, but because this process was the result of several meetings, more than one email and a review process.  All to say, “stop copying this file.”

But, this file was part of a key backup process for a core part of the client’s business. I had initially setup an entire process to ensure that a backup was being copied from an AIX server in one datacenter to a local NAS and then to the remote datacenter.  It is a bit more complex than it sounds.  But it worked. And the loss of a timely backup would impact their ability to recover by hours if not days. This could potentially cost them 100s of thousands of dollars if not into the millions.

So the meetings and phonecalls and emails weren’t just “which button should Greg click” but covered questions like, “do we have the backups we think we have?” “Are they getting to the right place(s)?” “Are they getting there in a timely fashion?”  And even, “when we uncheck this, we need to make sure the process for the day is complete and we don’t break it.”

So, me unchecking that button after hours, as much as it cost the company was really the end of a complex chain of events designed to make sure that they didn’t risk losing a LOT of money if things went wrong. Call it an insurance payment if you will.

Those crane operators in NYC? They’re not simply lifting up a beam here and there and randomly placing it someplace. They’re maneuvering complex systems in tight spaces with heavy loads where sudden gusts can set things swaying or spinning and a single mistake can do $1000s in damage or even kill people.

It’s not such much what they’re being paid to do, as much as how much they are being paid to avoid the cost of a mistake. I wasn’t paid just to unclick a button. I was paid (as were the others in the meetings) to make sure it was the right button and at the right time and that it wouldn’t cost even more.

Sometimes we’re not paid for what we do, as much as we’re paid for what we’re not doing.