The Value of DR Testing

Posted on April 4, 2023 by Greg Moore

Just a short blog post today since I’m actually in the middle of a call with a client as we test our failover scenario.

Right now I’m calling it a success even though the SQL Server hasn’t come up yet.

Why am I calling it a success? Because we learned that our current plan has a serious gaping hole concerning how the iSCSI drives failover. Yes, technically we failed to failover as quickly as we expected.

But, we’ve learned that before this system went into production. So that’s a success. This raises our confidence level for next time.

In all honesty, we often learn more from our failures from our successes. For example, before NASA would allow SpaceX to fly a crew on Crew Dragon, they required several abort tests, one of which involved launching a Falcon 9 and then in mid-flight firing the Crew Dragon abort engines. This resulted in the destruction of the Falcon 9 (which was expected) but proved the abort plans worked. Note however that for Orion on Artemis, NASA has decided such a test is not necessary. The decision making process behind this particular decision is worthy of a blog of its own.

In any case, with the current DR test, we expect to have things finally failed over in the next hour or two. Then we’ll update our playbook and have a lot more confidence.

Moral of your story: test your DR. Assume things will go wrong the first time because they will, but far better to have that before you go to production. This is not the first time I’ve had a failover not go as planned, but prior to production.

“It’s Just a Simple Change”

Posted on April 27, 2021 by Greg Moore

How often have we heard those words? Or used them ourselves?

“Oh this is just a simple change, it won’t break a thing.” And then all hell breaks lose.

Yet, we also hear the reverse at times. “This is pretty complex, I’ll be surprised if it works the first time, or if it doesn’t break something.” And yet then nothing bad seems to happen.

We may observe this, but we don’t necessarily stop to think about the why. I’ve seen this happen a lot in IT, but honestly, I’ve seen this happen elsewhere and often when we read about accidents in areas such as caving, this also holds true.

I argue that in this case the perception is often true. Let me put in one caveat. There’s definitely a bias in our memory where we don’t recall all the times where simple things don’t break things, but the times it does, it really stands out.

The truth is, whenever we deal with complex systems, even simple changes aren’t so simple. But we assume they are and then are surprised when they have side effects. “Oh updating that path here won’t break anything. I only call it one place, and I’ll update that.” And you’re good. But what you didn’t realize was another developer liked your script, so made a copy and is using it for their own purposes and now their code breaks because of the new path. So your simple change isn’t so simple.

Contrast that to the complex change. I’m in the middle up refactoring a stored procedure. It’s complex. I suspect it’ll break something in production. But, honestly, it probably won’t. Not because I’m am awesome T-SQL developer, but, because of our paranoia, we’ll be testing this in UAT quite a bit. In other words, our paranoia drives our testing to a higher level.

I think it behooves us to treat even simple changes with more respect than we do and test them.

In the world of caving we use something called SRT – Single Rope Technique. This is the method we use for ascending and descending a rope. When ascending, if you put your gear on wrong at the bottom, generally there’s no real risk other than possible embarrassment. After all, you’re standing on the ground. But obviously a the top, it’s critical to put your equipment on correctly, lest your first step be your last. Similarly, we practice something known as a change-over; changing from ascending to descending, or descending to ascending while on rope. When changing from climbing to descending you want to make sure you do it correctly lest you find yourself descending at 9.8m/s^2. To prevent accidents, we ingrain in students “load and test your descent device before removing your other attachment point.” Basically, while you’re still secured to something at the top, or to your ascending devices if you’re partway up the rope, put your entire weight on your descent device and lower yourself 1-2″. If you succeed, great, then you can detach yourself from whatever you are attached to at the top, or remove your ascending devices. If somehow you’ve screwed something up and the descent device comes off the top or otherwise fails, you’ve got a backup.

Now, I will interject, getting on rope at the top of a pit, or a changeover is something an experienced caver will have done possibly 100s if not 1000s of times. It’s “a simple change”. Yet we still do the test because a single failure can be fatal. And I have in fact seen a person fail to properly test their descent device. And moreover, this wasn’t in a cave, or other dark or cramped space. It was in broad daylight on the edge of the RPI Student Union! This was about as simple as it could get! Fortunately he heard it start to fail and grabbed the concrete railing for dear life. In this particular case a failure most likely would not have been fatal, but would have caused serious injury.

So, despite having gotten on rope 100s of times myself, I ALWAYS test. It’s a simple change. But the test is also simple and there’s no reason to skip it.

The morale of the story, even your simple changes should be tested, lest you find they’re not so simple, or their failures aren’t so minor.

“Houston, we’re venting something into Space…”

Posted on January 5, 2021 by Greg Moore

This post is the result of several different thoughts running through my head combined with a couple of items I’ve seen on social media in the past few days. The first was a question posted to #SQLHelp on Twitter in regards to if a DBA came into a situation with a SQL Server in an unknown configuration what one would do. The second was a comment a friend made about how “it can’t get any worse” and several of us cheekily corrected him saying it can always get worse. And of course I’m still dealing with my server that died last week.

To the question of what to do with an unknown SQL Server, there were some good answers, but I chimed in saying my absolute first thing would be to make backups. Several folks had made good suggestions in regards to looking at system settings and possibly changing them, possibly re-indexing, etc. My point though was, all that could wait. If the server had been running up until now, while fixing those might be very helpful, the lack of fixing things would not make things worse. On the other hand, if there were no up to date backups and the server failed, the owner would be in a world of hurt. Now, for full disclosure, I was “one-upped” when someone pointed out that assuming they did have backups, what one really wanted to do was a restore. I had to agree. The truth is, no one needs backups, what they really need are restores. But the ultimate point is really the same, without a tested backup, your server can only get much worse if something goes wrong.

I’ve had to apply this thinking to my own dead server. Right now it’s running in a Frankenbeast mode on an old desktop with 2GB of RAM. Suffice to say, this is far from ideal. New hardware is on order, but in the meantime, most things work well enough.

I actually have a newer desktop in the house I could in theory move my server to. It would be a vast improvement over the current Frankenbeast; 8GB of RAM and a far faster CPU. But, I can’t. It doesn’t see the hard drive. Or more accurately, it won’t see an OS on it. After researching, I believe the reason comes down to a technical detail about how the hard drive is setup (namely the boot partition is what’s known as a MBR and it needs to be GPT). I’ll come back to this in a minute.

In the meantime, let’s take a little detour to mid April, 1970. NASA has launched two successful Lunar landings and the third, Apollo 13 is on its way to the Moon. They had survived their launch anomaly that came within a hair’s breadth of aborting their mission before they even made orbit. Hopes were high. Granted, Ken Mattingly was back in Houston, a bit disappointed he had been bumped from the flight due to his exposure to rubella. (The vaccine had just been released in 1969 and as such, he had never been vaccinated, and had not had it as a child. Vaccines work folks. Get vaccinated lest you lose your chance to fly to the Moon!)

Stack of Swiss cheese slices showing holes lined up.

A routine mission operation was to stir the oxygen tanks during the flight. Unfortunately, due to a Swiss Cheese effect of issues, this nearly proved disastrous when it caused a spark which caused an “explosion” which blew out the tank and ruptured a panel on the Service Module and did further damage. Very quickly the crew found themselves in a craft quickly losing oxygen but more importantly, losing electrical power. Contrary to what some might think, the loss of oxygen wasn’t an immediate concern in terms of breathing or astronaut health. But, without oxygen to run through the fuel cells, it meant there was no electricity. Without electricity, they would soon lose their radio communication to Earth, the onboard computer used for navigation and control of the spacecraft and their ability to fire the engines. Things were quickly getting worse.

I won’t continue to go into details, but through a lot of quick thinking as well as a lot of prior planning, the astronauts made it home safely. The movie Apollo 13, while a somewhat fictionalized account of the mission (for example James Lovell said the argument among the crew never happened, and Ken Mattingly wasn’t at KSC for the launch), it’s actually fairly accurate.

As you may be aware, part of the solution was to use the engine on the Lunar Module to change the trajectory of the combined spacecraft. This was a huge key in saving the mission.

But this leads to two questions that I’ve seen multiple times. The first is why they didn’t try to use the Service Module (SM) engine, since it was far more powerful and had far more fuel and they in theory could have turned around without having to loop around the Moon. This would have saved some days off the mission and gotten the astronauts home sooner.

NASA quickly rejected this idea for a variety of reasons, one was a fairly direct reason: there didn’t appear to be enough electrical power left in the CSM (Command/Service Module) stack to do so. The other though was somewhat indirect. They had no knowledge of the state of the SM engine. There was a fear that any attempt to use it would result in an explosion, destroying the SM and very likely the CM, or at the very least, damaging the heatshield on the CM and with a bad heatshield that would mean a dead crew. So, NASA decided to loop around the Moon using the LM descent engine, a longer, but far less risky maneuver.

Another question that has come up was why they didn’t eject the now dead and deadweight, SM. This would have meant less mass, and arguably been easier for the LM to handle. Again, the answer is because of the heatshield. NASA had no data on how the heatshield on the CM would hold up after being exposed to the cold of space for days and feared it could develop cracks. It had been designed to be protected by the SM on the flight to and from the Moon. So, it stayed.

The overriding argument here was “don’t risk making things worse.” Personally, my guess is given the way things were, firing the main engine on the SM probably would have worked. And exposing the heatshield to space probably would have been fine (since it was so overspecced to begin with). BUT, why take the risk when they had known safer options? Convenience is generally a poor argument against potentially catastrophic outcomes.

So, in theory, these days it’s trivial to upgrade a MBR disk to a GPT one. But, if something goes wrong, or that’s not really the root cause of my issues, I end up going from a crippled, but working server to a dead server I have to rebuild from scratch. Fortunately, I have options (including now a new disk so I can essentially mirror the one disk, have an exact copy and try the MBR->GPT solution on that one) but they may take another day or two to implement.

And in the same vein, if it’s a known SQL Server, or an unknown one, you’re working on, PLEASE make backups before you make changes, especially anything dramatic that risks data loss. (and I’ll add a side note, if you can, avoid restarting SQL Server when diagnosing issues, you lose a LOT of valuable information in the DMV tables.

So things CAN get worse. But that doesn’t mean there’s any need to take steps that will. Be cautious. Have a backout plan.

“Monday Monday”

Posted on December 29, 2020 by Greg Moore

I wrote once before about a day being a “Monday” and a week later about it not being a “Monday”. Well, yesterday was another Monday. And it reminded me of the value of DR planning and how scaling to your actual needs and budget are important.

There’s an old saying a Cobbler’s Children has no shoes and there’s some truth to that. And, well my kids have shoes, but yesterday reminded me I still want to improve my home DR strategy.

I had actually planned on sleeping in late since it’s the week between Christmas and New Years and my largest client is basically doing nothing and everyone else in the house is sleeping in this week. But that said, old habits die hard and after one of the cats woke me up to get fed, I decided to check my email. That’s when I noticed some of the tabs open in Chrome were dead. I’m not sure what I looked at next, but it caused me to ping my home server: nothing.

While that’s very unusual, it wouldn’t be the first time it did a BSOD. I figured I’d go to the basement, reboot, grab the paper, some breakfast and be all set. Well, I was partly right. Sure enough when I looked at the screen there was an error on it, but not a BSOD, but a black and white text screen with a bunch of characters and a line with an error on it. I rebooted, waited for the Server 2012 logo and then went out to get the newspaper. I came back, it was still booting, but I decided to wait for it to complete. Instead, it threw another BSOD (a real BSOD this time). I did a reboot and seconds later up came a BIOS message “PARITY ERROR”.

I figured it must be a bad RAM chip and while 16 GB wouldn’t be great, I could live with that if I had to cut down. But, things only got worse. Now the server wouldn’t even boot. I don’t mean as in I kept getting parity errors or a BSOD but as in, nothing would happen, no BIOS, nothing. Best as I can tell my server had succumbed to a known issue with the motherboard.

The technical term for this is “I was hosed”. But, in true DR spirit, I had backup plans and other ideas. The biggest issue is, I had always assumed my issue would be drive failure, hence backups, RAID, etc. I did not expect a full motherboard failure.

On one hand, this is almost the best time of the year for such an event. Work is slow, I could work around this, it wouldn’t normally be a big issue. However, there were some confounding issues. For one, my daughter is in the midst of applying to colleges and needs to submit portfolio items. These are of course saved on the server. Normally I’d move the server data drive to another machine and say “just go here” but she’s already stressed enough, I didn’t want to add another concern. And then much to my surprise, when I called ASRock customer service, they’re apparently closed until January! Yes, they apparently have no one available for a week. So much for arguing for an RMA. And finally of course, even if I could do an RMA, with the current situation with shipping packages, who knew when I would get it.

So, backup Plan A was to dig out an old desktop I had in house and move the drives over. This actually worked out pretty well except for one issue. The old desktop only has 2 GB of RAM in it! My server will boot, but my VMs aren’t available. Fortunately for this week that’s not an issue.

And Plan B was to find a cheap desktop at Best Buy, have my wife pick it up and when she got home, move the server disks to that and have a reasonably powered machine as a temporary server. That plan was great, but, for various reasons I haven’t overcome yet, the new machine won’t boot from the server drive (it acts like it doesn’t even see it.) So, for now I’m stuck with Plan A for now.

I’ve since moved on to Plan C and ordered a new Mobo (ironically another ASRock, because despite this issue, it’s been rock solid for 4+ years) and expect to get it by the 5th. If all goes well I’ll be up and running with a real server by then, just in time for the New Year.

Now, Plan D is still get ASRock to warranty the old one (some people have successfully argued for this because it appears to be a known defect). If that works, then I’ll order another case, more RAM and another OS license and end up with a backup server.

Should I have had a backup server all along? Probably. If nothing else, having a backup domain controller really is a best practice. But the reality is, this type of failure is VERY rare, and the intersection of circumstances that really requires me to have one is more rare. So I don’t feel too bad about not having a fully functional backup server until now. At the most, I lost a few hours of sleep yesterday. I didn’t lose any client time, business or real money. So, the tradeoff was arguably worth it.

The truth is, a DR plan needs to scale with your needs and budget. If downtime simply costs you a few hours of your time coming up with a workaround (like mine did), then perhaps sticking with the work around if you can’t afford more is acceptable. Later you can upgrade as you needs require it and your budget allows for it. For example, I don’t run a production 24×7 SQL Server, so I’m not worried about clustering, even after I obtain my backup server.

If you can work in a degraded fashion for some time and can’t afford a top-notch DR solution, that might be enough. But consider that closely before going down that route.

On the other hand, if like my largest client downtime can cost you thousands or even millions of dollars, than you had darn well invest in a robust DR solution. I recently worked with them on testing the DR plan for one of their critical systems. As I mentioned, it probably cost them tens of thousands of dollars just for the test itself. But, they now have a VERY high confidence that if something happened, their downtime is under 4 hours and they would lose very little data. And for the volume of business, it’s worth it. For mine, a few hours of downtime and a few days of degraded availability is ok and cost effective. But, given I have a bit of extra money, I figure it’s now worth mitigating even that.

In closing because this IS the Internet… a couple of cat pictures.

Quiet Time and Errors

Posted on May 12, 2020 by Greg Moore

I wrote last week about finally taking apart our dryer to solve the loud thumping issue it had. The dryer noise had become an example of what is often called normalization of deviance. I’ve written about this before more than once. This is a very common occurrence and one I would argue is at times acceptable. If we reacted to every change in our lives, we’d be overwhelmed.

That said, like the dryer, some things shouldn’t be allowed to deviate too far from the norm and some things are more important than others. If I get a low gas warning, I can probably drive another 50 miles in my car. If I get an overheated engine warning, I probably shouldn’t try to drive another 50 miles. The trick is knowing that’s acceptable and what’s not.

Yesterday I wrote about some scripting I had done. This was in response to an issue that came up at a customer site. Nightly a script runs to restore a database from one server to a second server. Every morning we’d get an email saying it was successful. But, there was a separate email about a separate task that was designed to run against that database that indicated a failure. And that particular failure was actually pretty innocuous. In theory.

You can see where I’m going here. Because we were trusting the email of the restore job over the email from the second job, we assumed the restore was fine. It wasn’t. The restore was failing every night but sending us an email indicating a success.

We had unwittingly accepted a deviance from the norm. Fortunately the production need for this database hadn’t started yet. But it will soon. This is what lead my drive to rewrite and redeploy the scripts on Friday and on Monday.

And here’s the kicker. With the new script, we discovered the restore had also been failing on a second server (for a completely different reason!)

Going back to our dryer here, it really is amazing how much we had come to expect the thunking sound and how much quieter it is. I’ve done nearly a half-dozen loads since I finally put in the new rollers and every time I push the start button I still cringe, waiting to hear the first thunk. I had lived with the sound so long I had internalized the sounds as normal. It’s going to take awhile to overcome that reaction.

And it’s going to take a few days or even weeks before I fully trust the restore scripts and don’t cringe a bit every morning when I open my email for that client and check for the status of overnight jobs.

But I’m happy now. I have a very quiet dryer and I have a better set of scripts and setup for deploying them. So the world is better. On to the next problems!

Brake (sic) it to Fix it!

Posted on February 4, 2020 by Greg Moore

Last week I wrote about getting my brakes fixed. Turns out I made the right choice, besides shoes and pads, I needed new calipers. Replacing them was a bigger job than I would have wanted to deal with. Of course it cost a bit more than I preferred, but I figure being able to make my car stop is a good thing. I had actually suspected I’d need new calipers because one of the signs of my brakes needing to be replaced was the terrible grinding and shrieking sound my left front brake was making. That’s never a good sign.

As a result I started to try to brake less if I could help it. And I cringed a bit every time I did brake. It became very much a Pavlovian response.

About two weeks ago, a colleague of mine said, “Hey, did you notice the number of errors for that process went way up?” I had to admit, nope, I had not. I had stopped looking at the emails in detail. They were for an ETL process that I had written well over a year ago. About 6 months ago, due to new, and bad, data being put into the source system, the ETL started to have about a half dozen rows it couldn’t process. As designed, it sent out an email to the critical parties and I received copies. We talked about the errors and decided that they weren’t worth tracking down at the time. I objected because I figure if you have an error, you really should fix it. But I got outvoted and figured it wasn’t my concern at that point. As a result, we simply accepted that we’d get an email every morning with a list of rows in error.

But, as my colleague pointed out, about 3 months ago, the number of errors had gone up. This time it wasn’t about a half dozen, it was close to 300. And no one had noticed. We had become so used to the error emails, our Pavlovian response was to ignore them.

But, this number was too large to ignore. I ended up doing two things. The first, and one I could deploy without jumping through hoops was to update the error email. Instead of simply showing the rows of errors, it now included a query that placed a table at the top that showed how many errors and in which tables. This was much more effective because now a single glance can easily show if the number of errors has increased or gone down (if we get no email that means we’ve eliminated all the errors, the ultimate goal in my mind.)

I was able to track down the bulk of the 300 new errors to a data dictionary disagreement (everyone raise their hands who has had a customer tell you one thing about data only to discover that really the details are different) that popped up when a large amount of new data was added to the source system.

I’ve since deployed that change to the DEV environment and now that we’re out of the end of the month code freeze for this particular product, will be deploying to production this week.

Hopefully though the parties that really care about the data will then start paying attention to the new email and squawking when they see a change in the number of bad rows.

In the meantime, it’s going to take me awhile to stop cringing every time I press my brakes. They no longer make any bad sounds and I like that, but I’m not used to to absence of grinding noises again. Yet. In both cases, I and my client had accepted the normalization of deviance and internalized it.

I wrote most of this post in my head last week while remembering some other past events that are in part related to the same concept. As a result this post is dedicated to the 17 American Astronauts who have perished directly in the service of the space program (not to diminish the loss of the others who died in other ways).

Apollo 1 – January 27th, 1967
Challenger – STS-51L – January 28th, 1986
Columbia – STS-107 – Launch January 16th, 2003, break-up, February 1st, 2003.

Two Minds are Better Than One

Posted on August 27, 2019 by Greg Moore

I’m going to do something often not seen in social media. I’m going to talk about a mistake that I made. It’s all to common on various certain media sites to talk about how perfect our lives are and how great things are. You rarely hear about mistakes. I decided, in the theme of this blog of talking about how we approach and solve (or don’t solve) problems, I’d be up front and admit a mistake.

This all started with a leak in the downstairs shower. It had been growing over the years and I frankly had been ignoring it. Why put off to tomorrow what you can put off to next month or even year? But finally, in December of last year it became obvious that it was time to fix the leak. It had continued to grow, and now that my son was home from college for extended break, he had setup a work area in the basement, below the bathroom. I figured he didn’t really need to suffer from water dripping onto his desk.

So, he and I went into full demolition mode and ripped out the old tile and backer board to get to the plumbing.

New plumbing

You can see some of the work here. I also took the opportunity to run wiring to finally put in a bathroom fan. That’s a whole other story.

Anyway, the demolition and plumbing went well. Then we put up the backer board and sealed it. And left it like that. It didn’t look good, but it was waterproof and usable. It was “good enough”. So for about 8 months it sat like that. But with an upcoming pool party, I decided it was time to finally finish it off. One of the hold ups had been deciding on tile. Fortunately, on a shopping trip about a month earlier, my son, my wife and I found tile we liked (my daughter, who ironically still lives at home and will be using the shower more in the next few years than her brother, was in LA on vacation, so she ended up not really having much say in the matter).

So, earlier this month, while the rest of my family took a weekend to go to Six Flags New Jersey, I figured I’d surprise them with finally tiling the shower. I went to the big box store whose favorite color is orange and bought the required materials. Since tile we had selected is approximately 6″ wide and 24″ long, I had to make sure I got the right mastic. This was a bit different from stuff I’ve worked with in the past with a bit more synthetic materials in it and it mixed differently, and had slightly different drying characteristics. That, combined with growing darkness lead me to move quickly. The darkness was a factor since any tile cutting I was doing was outside.

And, all that lead to a simple mistake. On the end wall, there’s a window and as a result part of that wall needed tiles just less than 24″ long. On one hand, this is a huge plus since it means less seams and less places to grout. On the other, it meant in one spot having to work around the trim of the window. And that’s where I made my mistake. The trim had been put over the original tile, so there was in theory room behind the stool of the window to fit in a piece of tile. That had been my plan.

But, when it came to sliding in the nearly 24″ long piece of tile, it wouldn’t fit. It wouldn’t bend (obviously) to let me get it tucked behind the stool and due to the stickiness of the mastic, I couldn’t slid it in from the top.

So, I cut out a notch. In the back of my mind I somehow was thinking, that it wouldn’t look that bad and tile would cover it. Well, I was obviously wrong.

In hindsight, I realized I should have cut the tile in two pieces, created an extra seam (like the row below it that had to cover more than 24″ wide in any event) and then I could have slide in the smaller piece and then put in the remaining piece. It would have been perfect, looked great and more likely to be waterproof.

So, this gets me to the title. Had I been doing the work with someone else, I’m sure I’d have said, “Damn, this is gonna suck, any ideas?”

My mistake

And I’m sure someone else would have suggested, “Hey, cut the piece and slide it in.”

I like working alone, but sometimes, you need a second person to help. Or more than one brain as I’ve mentioned in the past. If nothing else, sometimes a duck can help.

A bunch of rubber ducks, including two from Idera!

So, the moral of the story is sometimes two heads are better than one. Oh well, I won’t be the one using that shower, so I won’t see my mistake all the time!

Oh and check out my latest Red-Gate Article on Secure Strings in PowerShell.

Shouldn’t that be plugged in?

Posted on April 2, 2019 by Greg Moore

That was the question a friend of mine in 6th grade asked. As a result I developed what I call the Charlie M. rule after my friend. It was sort of Show and Tell day in 6th grade and we were supposed to talk about our hobbies. I brought in a circle of HO scale track (18″ radius for those interested) and my locomotive (a model GP-38) and some cars and of course the transformer to power it all.

I set it all up in front of the class and dutifully tried to demonstrate it. Nothing moved. I checked to make sure the engine was properly on the tracks: check. I made sure the wires were connected to the transformer: check. I made sure the wires were connected to the track: check. I was stumped: check. Finally Charlie raised his hand and asked, “Shouldn’t that be plugged in?” Ayup, in all my nervousness and being hurried, I had forgot the most basic step, of plugging in the transformer.

I try to keep this in mind when troubleshooting: check the obvious. I ran into this again over the weekend when trying to get my BMW Z3 running again. (Side note: no, consulting does not pay that well. This is one of the few tangible items I have left from my dad’s estate). It had stopped running late last fall and at the time I spent a little time trying to make it run, without much success. Finally, with the family’s help I pushed and pulled it into a shelter for the winter and then left it for the winter.

I wasn’t planning on worrying about it until later this month, but then… well let’s just say when I put the large box with metal corners into the rear of the Subaru, I forgot to check the obvious and slammed the rear hatch down on the box. Well, the box, realizing it didn’t have enough room, decided to take advantage of the metal corner and proceeded to make more room by punching out the rear window of the Subaru. Oops. Such a simple mistake, but a large one.

So, while waiting for the Subaru to get fixed, I decided it was time to get the BMW on the road.

Now due to the symptoms, I knew it wasn’t a dead battery or bad gas. So taking advantage of what I call my extended brain, I asked others for help. We had narrowed the problem down to either the clutch interlock switch or the starter. Neither looked like it would be an easy self-service and I was getting frustrated. I finally decided that perhaps checking the ODB-II codes might yield more information. Strangely though, the reader didn’t power up; there were no codes to read. That struck me as a strange. So here I did check the obvious: I took the reader to the Subaru and made sure the reader worked. And it worked fine on the Subaru. I went back to my extended brain and mentioned that.

“Oh, have you checked the fuses?”

“Nah I thought about it, but everything seems to have power.”

“You sure, sounds like the onboard computer fuse might be blown.”

So, I trudged out and took off the fuse cover. Now, I don’t really believe in fate or signs from God, but it was weird, in the list of about 40 fuses, the first one my eyes fell on was Computer. “Nah, can’t be.”

I pulled it, and sure enough, it was burned out. I pulled it and replaced it. Got in the car and thought, “it can’t be that easy, can it?” A turn of the key and the next thing I knew, the 6 cylinders were purring.

All that work and frustration because I had overlooked the basics.

This is far from the first time I’ve overlooked the basics. And I bet you’ve done the same thing. I have a theory about why we do this, and it is in part because the basics ARE so fundamental that we assume it has to be something else. In my model train example, dirty track and loose wires, especially in an ad-hoc setup are arguably a more common issue than forgetting to plug in the transformer. In my BMW case, because literally everything else worked, I assumed the power was getting to the computer. And honestly, even now, thinking about it, I’m surprised the dash light startup didn’t change at all because of a lack of computer.

I’ve seen this in databases and elsewhere. I was recently trying to do a quick restore of a database from one machine to another and the obvious wasn’t working. It took me a bit to remember the client’s new security setup prevented this specific case for these two machines. Once I remembered that, the problem and subsequent solution were obvious.

This in part goes back to why I like using a rubber-duck at times. It can force you to review your assumptions and check the basics.

Having a problem? Employ the Charlie M. rule and check the basics.

Failure is Required

Posted on September 4, 2018 by Greg Moore

Last week one of my readers, Derek Lyons correctly called me out on some details on my post about Lock outs. Derek and I go back a long ways with a mutual interest in the space program. His background is in nuclear submarines and some of the details of operations and procedures he’s shared with me over the years have been of interest. The US nuclear submarine program is built around “procedures” and since the adoption of their SUBSAFE program, has only suffered one hull-loss and that was with the non-SUBSAFE-certified USS Scorpion.

The space program is also well known for its heavy reliance on procedures and attention to detail and safety. Out of the Apollo 13 incident, we have the famous quote, “Failure is not an option” attributed to Gene Kranz in the movie (but there’s no record of him saying it at the time.)

Anyway, his comments got me thinking about failures in general.

And I’d argue that with certain activities and at a certain level, this is true. When it comes to bringing a crew home from the Moon, or launching nuclear missiles, or performing critical surgeries, failure is not an option.

But sometimes, not only is it an option I’d say it’s almost a requirement. I was reminded of this at a small event I was asked to help be a panelist at last week. It turned out there were 3 of us panelists and just 2 students from a local program to help folks learn to code: AlbanyCanCode. The concept of agile development was brought up and the fact that agile development basically relies on failing fast and early. For software development, the concept of failing fast really only costs you time. And agile proponents will argue that in fact it saves you time and money since you find your failures much earlier meaning you spend less time going down the wrong path.

But I’m going to shift gears here to an area that’s even more near and dear to my heart: cave rescue. At an overarching, one might say strategic level, failure is not an option. We teach in the NCRC that our goal is to get the patient(s) out in as good or better shape than we found them as quickly and safely as possible. In other words, if we end up killing a patient, but get them out really quickly, that’s considered a failure; whereas if we take twice as long, but get them out alive, that’s considered a success.

But how do we do that? Where does failure come into play?

One of the first lessons I was taught by one of my mentors was to avoid “the mother of all discussions.” This lesson hit home during an incident in my Level 1 training here in New York. We had a mock patient in a Sked. Up to this point it had been walking passage through a stream with about 1″ of water. But we had hit a choke point where the main part of the ceiling came down to about 12″ above the floor passage. There was alternative route that would involve lifting the patient up several feet and then over some boulders and through some narrow and low (but not 12″ low passage) and then we’d be back to walking passage. I and two others were near the head of the litter. At this point we had placed the litter on the ground (out of the water). We scouted ahead to see how far the low passage went and noticed it went about a body length. A very short distance.

Meanwhile the rest of our party were back in the larger passage having the mother of all discussions. They were discussing whether we should could drag the litter along the floor, lift it up to go high, or perhaps even for this part, remove the patient from the litter and have them drag themselves a bit. There may have been other ideas too.

My two partners and I looked at each other, looked at the low passage, looked at the patient, shrugged our shoulders and dragged the patient through the low passage to the other side.

About 10 seconds later someone from the group having the mother of all discussions exclaimed, “where’s the patient?”

“Over here, we got him through, now can we move on?”

They crawled through and we completed the exercise.

So, our decision was a success. But what if it had been a failure. What if we realized that the patient’s nose was really 13″ higher than the floor in the 12″ passage. Simple, we’d have pulled the patient back out. Then we could have shut down the mother of all discussions and said, “we have to go high, we know for a fact the low passage won’t work.”

Failure here WAS an option and by actually TRYING something, we were able to quickly succeed or fail and move on to the next option.

Now obviously one has to use judgement here. What if the water filled passage was 14″ deep. Then no, my partners and I certainly would NOT have tried to move the patient with just the three of us. But perhaps we might have convinced the group to try.

The point is, sometimes it can often be faster and easier to actually attempt a concept than it is to discuss it to death and consider every possibility.

Time and time again I’ve seen students in our classes fall into the mother of all discussions rather than actually attempt something. If they actually attempt something they can learn very quickly if it will work or not. If it works, great, the discussion can now end and they can move on to the next challenge. If it doesn’t work, great, they’ve narrowed down their options and can discuss more intelligently about the remaining options (and then perhaps quickly iterate through those too.)

So today’s take away, is don’t be afraid of failure. Embrace it. Enjoy it. Experience it. It will lead to learning. Just make sure you understand the price of failure. Failure may be an option and is sometimes mandatory, but in other cases, the old saw is true, failure is not an option, especially if failure means the loss of life.

Locked Out

Posted on August 28, 2018 by Greg Moore

As I’ve mentioned, not only am I fascinated by disasters and their root causes and how we react, I’m also fascinated by how we take steps to prevent them. In my book IT Disaster Response: Lessons Learned in the Field I discuss the idea of blue-flagging on railroads. The important concepts were two-fold: 1) a method of indicating that the train should not be moved and 2) controls on who could remove that indication.

During my recent power outage, I came across something similar. It should be the featured image for this article. It’s basically an orange flag locked to a utility pole. Note the key word there, locked.

The photo doesn’t show the fact that this utility pole contained circuit breakers (I believe that’s the proper term in this context) for the overhead power lines. They had been tripped as a result of a tree further down the road taking out all three supply lines.

Close up of orange flag on utility pole, along with tag with lock-out information

Close-up with tag and flag.

So let’s analyze this a bit:

The orange flag itself was VERY visible. This ensures any other crews that might be in the area that there is something they need to notice.

There is a tag with detailed information. It’s hard to see in the above photo, but it includes who tagged it, the location, and date and some other information.

What’s not clear, is it’s padlocked to the pole.

Now, to be clear, this is NOT a physical lock-out like you see on some power panels (i.e. where the padlock physically prevents the circuit-breaker from being opened or closed).

In this case, a physical lock-out would most likely have to be placed 30′ in the air at the top of the pole where it wouldn’t be easily noticed.

But that said, this served its purpose. It alerted other crews to a danger in the area and presumably can only be removed by the person who put it there. And it contains information on that person so they can be reached if there are questions.

Since power was restored within 1 hour and I didn’t hear of any reports of line worker getting electrocuted, this appears to have worked.

Today’s take-away: when you have a change from the normal state of operation, what steps can you take to ensure that others don’t try to return items to a normal state of operations without confirming things first? By the way, for a good read-up on how bad things can go when intentions about a non-standard mode of operation don’t get properly communicated, I recommend reading up on the events leading up to the Chernobyl disaster.

Procedures are important. Deviating from them can have serious consequences. Do what you can to minimize the possibility of deviations.

Green Mountain Software

Formerly thoughts on SQL, Disasters and thinking. Now musings on PA School, and more.

Category Archives: checklists