“Houston, we’re venting something into Space…”

This post is the result of several different thoughts running through my head combined with a couple of items I’ve seen on social media in the past few days. The first was a question posted to #SQLHelp on Twitter in regards to if a DBA came into a situation with a SQL Server in an unknown configuration what one would do. The second was a comment a friend made about how “it can’t get any worse” and several of us cheekily corrected him saying it can always get worse. And of course I’m still dealing with my server that died last week.

To the question of what to do with an unknown SQL Server, there were some good answers, but I chimed in saying my absolute first thing would be to make backups. Several folks had made good suggestions in regards to looking at system settings and possibly changing them, possibly re-indexing, etc. My point though was, all that could wait. If the server had been running up until now, while fixing those might be very helpful, the lack of fixing things would not make things worse. On the other hand, if there were no up to date backups and the server failed, the owner would be in a world of hurt. Now, for full disclosure, I was “one-upped” when someone pointed out that assuming they did have backups, what one really wanted to do was a restore. I had to agree. The truth is, no one needs backups, what they really need are restores. But the ultimate point is really the same, without a tested backup, your server can only get much worse if something goes wrong.

I’ve had to apply this thinking to my own dead server. Right now it’s running in a Frankenbeast mode on an old desktop with 2GB of RAM. Suffice to say, this is far from ideal. New hardware is on order, but in the meantime, most things work well enough.

I actually have a newer desktop in the house I could in theory move my server to. It would be a vast improvement over the current Frankenbeast; 8GB of RAM and a far faster CPU. But, I can’t. It doesn’t see the hard drive. Or more accurately, it won’t see an OS on it. After researching, I believe the reason comes down to a technical detail about how the hard drive is setup (namely the boot partition is what’s known as a MBR and it needs to be GPT). I’ll come back to this in a minute.

In the meantime, let’s take a little detour to mid April, 1970. NASA has launched two successful Lunar landings and the third, Apollo 13 is on its way to the Moon. They had survived their launch anomaly that came within a hair’s breadth of aborting their mission before they even made orbit. Hopes were high. Granted, Ken Mattingly was back in Houston, a bit disappointed he had been bumped from the flight due to his exposure to rubella. (The vaccine had just been released in 1969 and as such, he had never been vaccinated, and had not had it as a child. Vaccines work folks. Get vaccinated lest you lose your chance to fly to the Moon!)

Stack of Swiss cheese slices showing holes lined up.

A routine mission operation was to stir the oxygen tanks during the flight. Unfortunately, due to a Swiss Cheese effect of issues, this nearly proved disastrous when it caused a spark which caused an “explosion” which blew out the tank and ruptured a panel on the Service Module and did further damage. Very quickly the crew found themselves in a craft quickly losing oxygen but more importantly, losing electrical power. Contrary to what some might think, the loss of oxygen wasn’t an immediate concern in terms of breathing or astronaut health. But, without oxygen to run through the fuel cells, it meant there was no electricity. Without electricity, they would soon lose their radio communication to Earth, the onboard computer used for navigation and control of the spacecraft and their ability to fire the engines. Things were quickly getting worse.

I won’t continue to go into details, but through a lot of quick thinking as well as a lot of prior planning, the astronauts made it home safely. The movie Apollo 13, while a somewhat fictionalized account of the mission (for example James Lovell said the argument among the crew never happened, and Ken Mattingly wasn’t at KSC for the launch), it’s actually fairly accurate.

As you may be aware, part of the solution was to use the engine on the Lunar Module to change the trajectory of the combined spacecraft. This was a huge key in saving the mission.

But this leads to two questions that I’ve seen multiple times. The first is why they didn’t try to use the Service Module (SM) engine, since it was far more powerful and had far more fuel and they in theory could have turned around without having to loop around the Moon. This would have saved some days off the mission and gotten the astronauts home sooner.

NASA quickly rejected this idea for a variety of reasons, one was a fairly direct reason: there didn’t appear to be enough electrical power left in the CSM (Command/Service Module) stack to do so. The other though was somewhat indirect. They had no knowledge of the state of the SM engine. There was a fear that any attempt to use it would result in an explosion, destroying the SM and very likely the CM, or at the very least, damaging the heatshield on the CM and with a bad heatshield that would mean a dead crew. So, NASA decided to loop around the Moon using the LM descent engine, a longer, but far less risky maneuver.

Another question that has come up was why they didn’t eject the now dead and deadweight, SM. This would have meant less mass, and arguably been easier for the LM to handle. Again, the answer is because of the heatshield. NASA had no data on how the heatshield on the CM would hold up after being exposed to the cold of space for days and feared it could develop cracks. It had been designed to be protected by the SM on the flight to and from the Moon. So, it stayed.

The overriding argument here was “don’t risk making things worse.” Personally, my guess is given the way things were, firing the main engine on the SM probably would have worked. And exposing the heatshield to space probably would have been fine (since it was so overspecced to begin with). BUT, why take the risk when they had known safer options? Convenience is generally a poor argument against potentially catastrophic outcomes.

So, in theory, these days it’s trivial to upgrade a MBR disk to a GPT one. But, if something goes wrong, or that’s not really the root cause of my issues, I end up going from a crippled, but working server to a dead server I have to rebuild from scratch. Fortunately, I have options (including now a new disk so I can essentially mirror the one disk, have an exact copy and try the MBR->GPT solution on that one) but they may take another day or two to implement.

And in the same vein, if it’s a known SQL Server, or an unknown one, you’re working on, PLEASE make backups before you make changes, especially anything dramatic that risks data loss. (and I’ll add a side note, if you can, avoid restarting SQL Server when diagnosing issues, you lose a LOT of valuable information in the DMV tables.

So things CAN get worse. But that doesn’t mean there’s any need to take steps that will. Be cautious. Have a backout plan.

Green Mountain Software

Formerly thoughts on SQL, Disasters and thinking. Now musings on PA School, and more.

“Houston, we’re venting something into Space…”

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply