“We’re up to plan F”

I managed to skip two weeks of writing, which is unusual for me, but I was busy with other business, primarily last week leading an NCRC weeklong class of cave rescue for Level 1 students. I had previously lead such a class over three weekends last year, and have helped teach the Level 2 class multiple times. Originally this past week was supposed to be our National weeklong class, but back in February we had agreed to postpone it due to the unknown status of the ongoing Covid pandemic. However, due to a huge demand and the success of vaccinations, we decided to do a “Regional” Class just limited to Level 1 students. This would help handle the pent up demand, create students for the Level 2 class that would be at National, and to do sort of a test run of our facilities before the much larger National.

There’s an old saying that no plan survives the first contact with the enemy. In cave rescue this is particularly true. It also appears to be true in cave rescue training classes!

The first hitch was the drive up the the camp we were using. The road had been stripped down to the base dirt level and they were doing construction. Not a huge issue, just a dusty one. But for cavers, dust is just mud without the water. But this would come into play later in the week.

Once at the camp, as I was settling in and confirming the facilities, the first thing I noticed was that the scissors lift we had used to rig ropes in the gym last time was gone. A few texts and I learned it had only been on loan to the camp the past two years and was no longer available. This presented our first real challenge. How to get ropes up over the beams 20-30′ in the air.

But shortly after I realized I had a far greater issue. The custom made rigging plates we use to tie off the end of the ropes to the posts were still sitting in my garage at home. I had completely forgotten them. This was resolved by a well timed call to an instructor heading towards the camp, who via a longer detour then he expected, was able to get them. Fortunately, had that call waited another 5 minutes, his detour would have probably doubled. So the timing was decent.

I figured the week was off to a good start at that point! Honestly though, we solved the problems and moved on. I went to bed fairly relaxed.

All went well until Monday. This was the day we were supposed to do activities on the cliffs. Several weeks ago, my son and I, along with two others had gone to the cliffs, which were on the same property as the camp, but accessible only by leaving the camp and accessing from a public road, in order to clear away debris and do other work to make them usable. I was excited to show them off. Unfortunately, due to the weather forecast of impending thunderstorms all day we made the decision to revise our schedule and move cliff day to the next day. There went Plan A. Plan B became “go the next day.”

On Tuesday I and a couple of other instructors got in my car to head to the cliffs in advance of the students so we could scope things out and plan the activities. We literally got to the bottom of the road from the main entrance to the camp where we were going to turn on to the road under construction, only to find a the road closed there with a gaping ditch dug across it. So much for Plan B. We went back to the camp, told students to hang on and then I headed out again, hoping to basically take a loop around and approach the access road to the cliffs from the opposite direction. After about a 3 mile detour we came to the other end of the road and found it closed there. Despite trying to sweet talk the flag person, we couldn’t get past (we could have lied and said we lived on the road, but after 8-10 other cars would have arrived in a caravan saying the same thing we thought that might be suspicious). There went Plan C. We called an instructor back at the camp and headed back.

We got there and turns out an instructor had already come up with Plan D, which was to see if we could access the cliffs by crossing a field the camp owned and going through the woods. It might involve some hiking, but it might be doable. While there are dirt-bike paths, there’s nothing there that worked for us. So that plan fell apart. We were up to Plan E now. Plan E was proposed to further swap some training, but we realized that would impact our schedule too much. Now on to Plan F. For Plan F, we decided to head to a local cave which we thought would have some suitable cliffs outside.

That worked. It would out quite well actually. We lost maybe an hour to 90 minutes with all the plans, but we ultimately came upon a plan that worked. We were able to teach the skills we wanted and accomplish our educational objectives.

Often we wake up with a plan in our heads for what we will do that day. Most days those plans work out. But, then there are the days where we have to adapt. Things go sideways. Something breaks, or something doesn’t go as planned. In the NCRC we have an unofficial motto, Semper Gumby – “Always be Flexible”. Sometimes you have to completely change plans (cancelling due to the threat of thunderstorms), others you may have to try to adapt (finding other possible routes to the cliffs) and finally you may need to reconsider how to meet your objectives in a new way (finding different cliffs).

My advice, don’t lock yourself into only one solution. It’s a recipe for failure.

“Man down!”

Last week I wrote about how in many crisis situations you should actually stop and take 5 minutes to assess the situation, take a deep breath, and maybe even make a cup of tea. The point was, in many cases, we’re not talking life or death, and by taking a bit longer to respond we can have a better response.

I pointed out that you don’t always have that luxury. That happened to my mom’s partner within days of me writing last week’s post. While at work at a local supermarket chain, he heard someone shout “Man down”. Next thing he knew, a young man was laying on the floor having seizures. He jumped into action and provided the appropriate, immediate first aid. This included telling someone to call 911. Apparently no one else, including his manager responded at first. But, he had learned how to respond in his basic training in the Army decades ago. That training stuck.

My mom called me to talk about this and wondered why no one else had responded (she knows of my interest in emergency response and the like). I pointed out it’s a variety of factors, but often comes down to people don’t know how to respond, or they’ve assumed someone else has already responded. This discussion prompted a quick Facebook post by me that that I’m expanding upon here.

Let me ask you this, if someone collapsed in front of you at the mall, would you know what to do? What would you do? Would you do it?

The reality is, unfortunately many would not respond. So here’s my advice.

Get some training

You do not need to become an EMT to respond. In fact most training can be done in just a few hours.

Take a First Aid and a CPR course. Make sure the CPR course includes a segment on how to use an AED (Automatic External Defibrillator). I’ve taken several such courses over the years and try to remain certified.

Take a Stop the Bleed class. This is a bit different from your standard First Aid class. I haven’t taken it yet, but plan to when I can find one near me (I may even look into getting one setup when I have a bit more free time).

“911, what’s your emergency?”

Call 911. Anyone can do this. I would recommend even teaching even your young children to do this if they find you or someone else unconscious. Even if they can’t communicate much details, 911 operators are trained to gather what information they can, have ways (usually electronically) of determining the address and dispatching help. (Please note, if your child or someone else calls 911 by accident, please do NOT hang up. Simply let them know it was a mistake. It happens, they understand. But if they aren’t made aware, they WILL dispatch resources).

TELL someone specific to call 911. If you’re about to render aid, do NOT assume someone has already called 911 or will. In a crowd, groupthink happens and everyone starts to freeze and/or assume someone else has it handled. My advice, don’t just say “someone call 911”. Point to a specific person and tell them to call 911. Odds are, they will do it. In many cases in an emergency, folks are simply looking for someone to take charge and to give them direction. Now, someone else may have already called 911, or it may end up being multiple people will be calling 911. THAT IS OK. That’s far better than no one calling if it’s an emergency. In the event of a heart attack minutes count. This means that the sooner 911 is called, the better.

Respond

This may sound obvious, but be prepared to act. Again, it’s a common trope that in large crowds, people tend NOT to act, because in part they expect someone else already has it covered. Be that person who does act.

Years ago in the northern Virginia area, I witnessed a car get t-boned on the far side of an intersection from me. There were 3 lanes of traffic in either direction. NO ONE stopped to check on the drivers. I had to wait for the light to change before I could cross the intersection and check on them. Fortunately, the driver of the car I checked on was fine, other than some very minor injuries from their air bag deploying. And by this time, another witness had finally stopped to check on the 2nd car. They too were fine. But several dozen people had witnessed the accident and only the two of us had responded. If they drivers had been seriously injured and no one had responded, things could have been much worse for them.

Carry gloves, maybe more

Carry nitrile gloves with you. Sounds perhaps a bit silly or trite, but they don’t take up room and you can toss them in your backpack, glove compartment (yes, really you can put gloves in there), your purse etc. If you do come across someone who is injured, especially if blood or other bodily fluids are present, don them. I even carry a tiny disposable rebreather mask for CRP in my work backpack. Takes up no room but it’s there if I need it.

When you enter public buildings, look to see if they have a sign about AED availability. Note it and if possible where it is. In addition to telling someone to call 911, be prepared to tell someone “Get the AED, I think there’s on next to the desk in reception.”

Get your employer involved

Get your work to sponsor training. And honestly, while many companies might offer video tutorials with a quick online quiz at the end, I think they’re a bare minimum. I think hands on training is FAR more effective. There’s a number of reasons for this as I understand it, including the fact that you’re often engaging multiple pathways to the brain (tactile as well as visual and auditory) and a certain level of stress can actually improve memorization.

Seeing a video about how to use an AED is very different from holding a training unit in your hands and feeling its weight and hearing it give you instructions directly. Applying a bandage is far more realistic when your mock patient is laying there groaning in pain. Even getting into the action of telling someone “Call 911” is far more impactful when you do it in a hands-on manner and not simply checking a box in an on-line quiz.

Find out what resources are available in the office. Is there a first aid kit? What’s in it? For larger offices, I would argue they should have an AED and perhaps a Stop the Bleed kit. When’s the last time the AED batteries were tested? Who is responsible for that?

This works

In the case of the “man down” that prompted this post, they are reportedly doing fine and suffered no injuries.

I know of a local case, at a school where a student collapsed. A coach and the school nurse responded. And while the nurse especially had more training, what saved the students life was having an AED on site and available. Even if the school nurse or coach had not been there, in theory any bystander could have responded in a similar fashion.

As I said above, you don’t have to be a highly trained EMT or the like to make an impact and save someone from further injury or even save a life. You simply need to have some basic training and be willing to respond.

Take 5 Minutes

This weekend I had the pleasure of moderating Brandon Leach‘s session at Data Saturday Southwest. The topic was “A DBA’s Guide to the Proper Handling of Corruption”. There were some great takeaways and if you get a chance, I recommend you catch it the next time he presents it.

But there was one thing that stood out that he mentioned that I wanted to write about: taking 5 minutes in an emergency. The idea is that sometimes the best thing you can do in an emergency is take 5 minutes. Doing this can save a lot of time and effort down the road.

Now, obviously, there are times when you can’t take 5 minutes. If you’re in an airplane and you lose both engines on takeoff while departing La Guardia, you don’t have 5 minutes. If your office is on fire, I would not suggest taking 5 minutes before deciding to leave the building. But other than the immediate life-threatening emergencies, I’m a huge fan of taking 5 minutes. Or as I’ve put it, “make yourself a cup of tea.” (note I don’t drink tea!) Or have a cookie!

Years ago, when the web was young (and I was younger) I wrote sort of a first-aid quiz web-page. Nothing fancy or formal, just a bunch of questions with hyperlinks to the bottom. It was self-graded. I don’t recall the exact wording of one of the questions but it was something along the lines of “You’re hiking and someone stumbles and breaks their leg, how long should you wait before you run off to get help.” The answer was basically “after you make some tea.”

This came about after hearing a talk from Dr. Frank Hubbell, the founder of SOLO talk about an incident in the White Mountains of New Hampshire where the leader of a Boy Scout troop passed out during breakfast. Immediately two scouts started to run down the trail to get help. While doing so, one slipped and fell off a bridge and broke his leg. Turns out the leader simply had passed out from low blood sugar and once he woke up and had some breakfast was fine. The pour scout with the broken leg though wasn’t quite so fine. If they had waited 5 minutes, the outcome would have been different.

The above is an example of what some call “Go Fever”. Our adrenaline starts pumping and we feel like we have to do something. Sitting still can feel very unnatural. This can happen even when we know rationally it’s NOT an emergency. Years ago during a mock cave rescue training exercise, a student was so pumped up that he started to back up and ran his car into another student’s motorcycle. There was zero reason to rush, and yet he had let go fever hit him.

Taking the extra 5 minutes has a number of benefits. It gives you the opportunity to catch your breath and organize the thoughts in your head. It gives you time to collect more data. It also sometimes gives the situation itself time to resolve.

But, and Brandon touched upon this a bit, and I’ve talked about it in my own talk “Who’s Flying the Plane”, often for this, you need strong support from management. Management obviously wants problems fixed, as quickly as possible. This often means management puts pressure on us IT folks to jump into action. This can lead to bad outcomes. I once had a manager who told my team (without me realizing it at the time) to reboot a SQL Server because it was acting very slowly. This was while I was in the middle of remotely trying to diagnosis it. Not only did this not solve the problem, it made things worse because a rebooting server is exactly 100% not responsive, but even when it comes up, it has to load a lot of pages into cache and will have a slow response after reboot. And in this case, as I was pretty sure would happen, the reboot didn’t solve the problem (we were hitting a flaw in our code that was resulting in huge table scans). While non-fatal, taking an extra 5 minutes would have eliminated that outage and gotten us that much closer to solving the problem.

Brandon also gave a great example of a corrupted index and how easy it can be to solve. If your boss is pressuring you for a solution NOW and you don’t have the opportunity to take those 5 minutes, you might make a poor decision that leads to a larger issue.

My take away for today is three fold:

  1. Be prepared to take 5 minutes in an emergency
  2. Take 5 minutes today, to talk to your manager about taking 5 minutes in an emergency. Let them know NOW that you plan on taking those 5 minutes to calm down, regroup, maybe discuss with others what’s going on and THEN you will respond. This isn’t you being a slacker or ignoring the impact on the business, but you being proactive to ensure you don’t make a hasty decision that has a larger impact. It’s far easier to have this conversation today, than in the middle of a crisis.
  3. If you’re a manager, tell your reports, that you expect them to take 5 minutes in an emergency.

Your Boss Doesn’t Care About Backups!

It’s true. Even if they don’t realize it. Or even if they claim they do. They really don’t.

I’ve made this point before. Of course this is hyperbole. But a recent post by Taryn Pratt reminded me of this. I would highly recommend you go read Taryn’s post. Seriously. Do it. It’s great. It’s better than my post. It actually has code and examples and the like. That makes it good.

That said, why the title here? Because again, I want to emphasize what your boss really cares about is business continuity. At the end of the day they want to know, “if our server crashes, can we recover?” And the answer had better be “Yes.” This means that you need to be able to restore those backups, Or have another form of recovery.

Log-Shipping

It seems to me that over the years log-shipping has sort of fallen out of favor. “Oh we have SAN snapshots.” “We have Availability Groups!” “We have X.” “No one uses log-shipping any more, it’s old school.”

In fact this recently came up in a DR discussion I had with a client and their IT group. They use a SAN replication software to replicate data from one data center to another. “Oh you don’t need to worry about shipping logs or anything, this is better.”

So I asked questions like was it block-level, file-level, byte-level or what? I asked how much latency there was? I asked how we could be sure that data was hardened on the receiving side. I actually never got really clear answers to any of that other than, “It’s never failed in testing.”

So I asked the follow up question, “How was it tested.” I’m sure their answer was supposed to reassure me. “Well during a test, we’d stop writing to the primary, shut it down and the redirect the clients to the secondary.” And yes, that’s a good test, but it’s far from a complete test. Here’s the thing, many disasters don’t allow the luxury of cleaning stopping writes to the primary. They can occur for many reasons, but in many cases the failure is basically instantaneous. This means that data was inflight. Where in flight? Was it hardened to the log? Was that data in flight to the secondary? Inquiring minds want to know.

Now this is not to say these many methods of disk based replication (as opposed to SQL based which is a different beast) aren’t effective or don’t have their place. It’s simply to say, they’re not perfect and one has to understand their limitations.

So back to log-shipping. I LOVE log-shipping. Let me start with a huge caveat. In an unplanned outage, your secondary will only be up to date as the most recent log backup. This could be an issue. But, the upside is, you should have a very good idea of what’s in the database and your chances of a corrupted block of data, or the like is very low.

But there’s two facts I love about it.

  1. Every time I restore a log file, I’ve tested the backup of that log file. This may seem obvious, but, it does give me a constant check on my backups. If my backups fail for any reason, lack of space, a bad block gets written and not noticed, etc. I’ll know as soon as my next restore fails. Granted, my FULL Backups aren’t being restored all the time, but I’ve got at least some more evidence that my backup scheme in general is working. (and honestly, if I really needed to, I could backup my copy and use that in a DR situation.)
  2. It can make me look like a miracle worker. I have, in the past, in a shop where developers had direct access to prod and had been known to mess up data, used log-shipping to save the day. Either on my DR box, or a separate box I’d keep around that was too slow CPU wise for DR, but had plenty of diskspace, I’d set it to delay applying logs for 3-4 hours. In the event of most DR events, it was fairly simple to catch-up on log-shipping and bring the DR box online. But more often than not, I used it (or my CPU weak but disk heavy box) in a different way. I’d get a report from a developer, “Greg, umm, I well, not sure how to say this, but just updated the automobile table so that everyone has a White Ford Taurus.” I’d simply reply, “give me an hour or so, I’ll see what I can do.” Now the reality is, it never took me an hour. I’d simply look at the log-shipped copy I had, apply any logs I needed to catch up to just before their error, then script out the data and fix the data in production. They were always assuming I was restoring the entire backup or something like that. This wasn’t the case, in part because doing so would have taken far more than an hour, and would have caused a complete production outage.

There was another advantage to my 2nd use of log-backups. I got practice at manually applying logs, WITH NOROLLBACK and the like. I’m a firm believer in Train as you Fight.

Yes, in an ideal world, a developer will never have such unrestricted access to Production ( and honestly it’s gotten better, I rarely see that these days) and you should never need to deal with an actual DR, but we don’t live in an ideal world.

So, at the end of the day, I don’t care if you do log-shipping, Taryn Pratt’s automated restores or what, but do restores; both automated and manually. Automated because it’ll test your backups. Manually because it’ll hone your skills for when your primary is down and your CEO is breathing down your neck as you huddle over the keyboard trying to bring things back.

Reminder

As a consultant, I’m always looking for new clients. My primary focus is helping to outsource your on-prem DBA needs. If need help, let me know!

Stuck, with Responsibility

So, by now, you may have all heard about the vehicle that got stuck trying to go through a somewhat narrow passage. No, I’m not talking about the container ship known as Ever Green. Rather I’m talking my car and the entrance to my garage!

Yes, due to circumstances I’ll elucidate, for a few minutes the driver’s side of my car and the left side of my garage door opening attempted to occupy the same spot in space and time. It did not end well. The one consolation is that this mishap was not visible from space!

Now I could argue, “but it wasn’t my fault! My daughter was driving.” But that’s not really accurate or fair. Yes, she was driving, but it was my fault. She’s still on her learner’s permit. This requires among other things, a licensed driver (that would be me) in the vehicle and observing what she was doing. She did great on the 8 mile drive home from high school. So great in fact that when she paused and asked about pulling into my garage, I said “go for it.”

To understand her hesitation, I have to explain that the garage is perpendicular to the driveway and a fairly tight turn. It’s certainly NOT a straight shot to get in. I’ve done it hundreds of times in the last 5 years (when the garage was added to the house) and so I’ve got it down. Generally my biggest concern is the passenger side front bumper “sweeping” into the garage door opening or the wall as I enter. I don’t actually give much thought on the driver’s side.

So, I gave her the guidance I thought necessary: “Ok, stay to the far right on the driveway, this gives you more room to turn.” “Ok good, start turning. Great. Ok. Ayup, you’ve cleared the door there, start to straighten out.” “Ok you’re doing…” Here the rest of the cockpit voice recorder transcript will be redacted other than for the two sounds, a “thunk” and then a “crunch”. The rest of the transcript is decidedly not family friendly.

The investigator, upon reviewing the scene and endlessly replaying the sounds in his head, came to the following conclusions:

  • The “thunk” was the sound of the fold-way mirror impacting the door frame and doing as was intended, folding away.
  • The “crunch” was the sound of the doors (yes, both driver’s side doors) impacting the said door frame.
  • Both the driver and the adult in charge were more focused on the front passenger bumper than they were on distance between the driver’s side and the door frame. Remedial training needs to be done here.

Anyway, I write all this because, despite what I said earlier, in a way this is a bit about the Ever Green and other incidents. Yes, my daughter was driving, but ultimately, it was my responsibility for the safe movement of the vehicle. Now, if she had had her license, then I might feel differently. But the fact is, I failed. So, as bad as she felt, I felt worse.

In the case of the Ever Green, it’s a bit more complex: the captain of a ship is ultimately responsible for the safe operation of their vessel. But also, in areas such as the Suez Canal, ships take on pilots who are in theory more familiar with the currents and winds and other factors that are local to that specific area that the captain may not be. I suspect there will be a bit of finger pointing. Ultimately though, someone was in charge and had ultimate responsibility. That said, their situation was different and I’m not about to claim it was simply oversight like mine. My car wasn’t being blown about by the wind, subject to currents or what’s known as the bank effect.

What’s the take take-away? At the end of day, in my opinion and experience, the best leaders are the ones that give the credit and take the blame. As a former manager, that was always my policy. There were times when things went great and I made sure my team got the credit. And when things went sideways, is when I stood up and took the blame. When a datacenter move at a previous job went sideways, I stepped up and took the blame. I was the guy in charge. And honestly, I think doing that helped me get my next job. I recall in the interview when the interviewer asked me about the previous job and I explained what happened and my responsibility for it. I think my forthrightness impressed him and helped lead to the hiring decision. The funny part is, when I was let go from the previous job, my boss also took responsibility for his failures in the operation. It’s one reason I still maintained a lot of respect for him.

So yes, my car doors have dents in them that can be repaired. The trim on my garage door needs some work. And next time BOTH my daughter and I will be more careful. But at the end of the day, no one was injured or killed and this mistake wasn’t visible from space.

Stuff happens. Take responsibility and move on.

“Houston, we’re venting something into Space…”

This post is the result of several different thoughts running through my head combined with a couple of items I’ve seen on social media in the past few days. The first was a question posted to #SQLHelp on Twitter in regards to if a DBA came into a situation with a SQL Server in an unknown configuration what one would do. The second was a comment a friend made about how “it can’t get any worse” and several of us cheekily corrected him saying it can always get worse. And of course I’m still dealing with my server that died last week.

To the question of what to do with an unknown SQL Server, there were some good answers, but I chimed in saying my absolute first thing would be to make backups. Several folks had made good suggestions in regards to looking at system settings and possibly changing them, possibly re-indexing, etc. My point though was, all that could wait. If the server had been running up until now, while fixing those might be very helpful, the lack of fixing things would not make things worse. On the other hand, if there were no up to date backups and the server failed, the owner would be in a world of hurt. Now, for full disclosure, I was “one-upped” when someone pointed out that assuming they did have backups, what one really wanted to do was a restore. I had to agree. The truth is, no one needs backups, what they really need are restores. But the ultimate point is really the same, without a tested backup, your server can only get much worse if something goes wrong.

I’ve had to apply this thinking to my own dead server. Right now it’s running in a Frankenbeast mode on an old desktop with 2GB of RAM. Suffice to say, this is far from ideal. New hardware is on order, but in the meantime, most things work well enough.

I actually have a newer desktop in the house I could in theory move my server to. It would be a vast improvement over the current Frankenbeast; 8GB of RAM and a far faster CPU. But, I can’t. It doesn’t see the hard drive. Or more accurately, it won’t see an OS on it. After researching, I believe the reason comes down to a technical detail about how the hard drive is setup (namely the boot partition is what’s known as a MBR and it needs to be GPT). I’ll come back to this in a minute.

In the meantime, let’s take a little detour to mid April, 1970. NASA has launched two successful Lunar landings and the third, Apollo 13 is on its way to the Moon. They had survived their launch anomaly that came within a hair’s breadth of aborting their mission before they even made orbit. Hopes were high. Granted, Ken Mattingly was back in Houston, a bit disappointed he had been bumped from the flight due to his exposure to rubella. (The vaccine had just been released in 1969 and as such, he had never been vaccinated, and had not had it as a child. Vaccines work folks. Get vaccinated lest you lose your chance to fly to the Moon!)

Stack of Swiss cheese slices showing holes lined up.

A routine mission operation was to stir the oxygen tanks during the flight. Unfortunately, due to a Swiss Cheese effect of issues, this nearly proved disastrous when it caused a spark which caused an “explosion” which blew out the tank and ruptured a panel on the Service Module and did further damage. Very quickly the crew found themselves in a craft quickly losing oxygen but more importantly, losing electrical power. Contrary to what some might think, the loss of oxygen wasn’t an immediate concern in terms of breathing or astronaut health. But, without oxygen to run through the fuel cells, it meant there was no electricity. Without electricity, they would soon lose their radio communication to Earth, the onboard computer used for navigation and control of the spacecraft and their ability to fire the engines. Things were quickly getting worse.

I won’t continue to go into details, but through a lot of quick thinking as well as a lot of prior planning, the astronauts made it home safely. The movie Apollo 13, while a somewhat fictionalized account of the mission (for example James Lovell said the argument among the crew never happened, and Ken Mattingly wasn’t at KSC for the launch), it’s actually fairly accurate.

As you may be aware, part of the solution was to use the engine on the Lunar Module to change the trajectory of the combined spacecraft. This was a huge key in saving the mission.

But this leads to two questions that I’ve seen multiple times. The first is why they didn’t try to use the Service Module (SM) engine, since it was far more powerful and had far more fuel and they in theory could have turned around without having to loop around the Moon. This would have saved some days off the mission and gotten the astronauts home sooner.

NASA quickly rejected this idea for a variety of reasons, one was a fairly direct reason: there didn’t appear to be enough electrical power left in the CSM (Command/Service Module) stack to do so. The other though was somewhat indirect. They had no knowledge of the state of the SM engine. There was a fear that any attempt to use it would result in an explosion, destroying the SM and very likely the CM, or at the very least, damaging the heatshield on the CM and with a bad heatshield that would mean a dead crew. So, NASA decided to loop around the Moon using the LM descent engine, a longer, but far less risky maneuver.

Another question that has come up was why they didn’t eject the now dead and deadweight, SM. This would have meant less mass, and arguably been easier for the LM to handle. Again, the answer is because of the heatshield. NASA had no data on how the heatshield on the CM would hold up after being exposed to the cold of space for days and feared it could develop cracks. It had been designed to be protected by the SM on the flight to and from the Moon. So, it stayed.

The overriding argument here was “don’t risk making things worse.” Personally, my guess is given the way things were, firing the main engine on the SM probably would have worked. And exposing the heatshield to space probably would have been fine (since it was so overspecced to begin with). BUT, why take the risk when they had known safer options? Convenience is generally a poor argument against potentially catastrophic outcomes.

So, in theory, these days it’s trivial to upgrade a MBR disk to a GPT one. But, if something goes wrong, or that’s not really the root cause of my issues, I end up going from a crippled, but working server to a dead server I have to rebuild from scratch. Fortunately, I have options (including now a new disk so I can essentially mirror the one disk, have an exact copy and try the MBR->GPT solution on that one) but they may take another day or two to implement.

And in the same vein, if it’s a known SQL Server, or an unknown one, you’re working on, PLEASE make backups before you make changes, especially anything dramatic that risks data loss. (and I’ll add a side note, if you can, avoid restarting SQL Server when diagnosing issues, you lose a LOT of valuable information in the DMV tables.

So things CAN get worse. But that doesn’t mean there’s any need to take steps that will. Be cautious. Have a backout plan.

“Monday Monday”

I wrote once before about a day being a “Monday” and a week later about it not being a “Monday”. Well, yesterday was another Monday. And it reminded me of the value of DR planning and how scaling to your actual needs and budget are important.

There’s an old saying a Cobbler’s Children has no shoes and there’s some truth to that. And, well my kids have shoes, but yesterday reminded me I still want to improve my home DR strategy.

I had actually planned on sleeping in late since it’s the week between Christmas and New Years and my largest client is basically doing nothing and everyone else in the house is sleeping in this week. But that said, old habits die hard and after one of the cats woke me up to get fed, I decided to check my email. That’s when I noticed some of the tabs open in Chrome were dead. I’m not sure what I looked at next, but it caused me to ping my home server: nothing.

While that’s very unusual, it wouldn’t be the first time it did a BSOD. I figured I’d go to the basement, reboot, grab the paper, some breakfast and be all set. Well, I was partly right. Sure enough when I looked at the screen there was an error on it, but not a BSOD, but a black and white text screen with a bunch of characters and a line with an error on it. I rebooted, waited for the Server 2012 logo and then went out to get the newspaper. I came back, it was still booting, but I decided to wait for it to complete. Instead, it threw another BSOD (a real BSOD this time). I did a reboot and seconds later up came a BIOS message “PARITY ERROR”.

I figured it must be a bad RAM chip and while 16 GB wouldn’t be great, I could live with that if I had to cut down. But, things only got worse. Now the server wouldn’t even boot. I don’t mean as in I kept getting parity errors or a BSOD but as in, nothing would happen, no BIOS, nothing. Best as I can tell my server had succumbed to a known issue with the motherboard.

The technical term for this is “I was hosed”. But, in true DR spirit, I had backup plans and other ideas. The biggest issue is, I had always assumed my issue would be drive failure, hence backups, RAID, etc. I did not expect a full motherboard failure.

On one hand, this is almost the best time of the year for such an event. Work is slow, I could work around this, it wouldn’t normally be a big issue. However, there were some confounding issues. For one, my daughter is in the midst of applying to colleges and needs to submit portfolio items. These are of course saved on the server. Normally I’d move the server data drive to another machine and say “just go here” but she’s already stressed enough, I didn’t want to add another concern. And then much to my surprise, when I called ASRock customer service, they’re apparently closed until January! Yes, they apparently have no one available for a week. So much for arguing for an RMA. And finally of course, even if I could do an RMA, with the current situation with shipping packages, who knew when I would get it.

So, backup Plan A was to dig out an old desktop I had in house and move the drives over. This actually worked out pretty well except for one issue. The old desktop only has 2 GB of RAM in it! My server will boot, but my VMs aren’t available. Fortunately for this week that’s not an issue.

And Plan B was to find a cheap desktop at Best Buy, have my wife pick it up and when she got home, move the server disks to that and have a reasonably powered machine as a temporary server. That plan was great, but, for various reasons I haven’t overcome yet, the new machine won’t boot from the server drive (it acts like it doesn’t even see it.) So, for now I’m stuck with Plan A for now.

I’ve since moved on to Plan C and ordered a new Mobo (ironically another ASRock, because despite this issue, it’s been rock solid for 4+ years) and expect to get it by the 5th. If all goes well I’ll be up and running with a real server by then, just in time for the New Year.

Now, Plan D is still get ASRock to warranty the old one (some people have successfully argued for this because it appears to be a known defect). If that works, then I’ll order another case, more RAM and another OS license and end up with a backup server.

Should I have had a backup server all along? Probably. If nothing else, having a backup domain controller really is a best practice. But the reality is, this type of failure is VERY rare, and the intersection of circumstances that really requires me to have one is more rare. So I don’t feel too bad about not having a fully functional backup server until now. At the most, I lost a few hours of sleep yesterday. I didn’t lose any client time, business or real money. So, the tradeoff was arguably worth it.

The truth is, a DR plan needs to scale with your needs and budget. If downtime simply costs you a few hours of your time coming up with a workaround (like mine did), then perhaps sticking with the work around if you can’t afford more is acceptable. Later you can upgrade as you needs require it and your budget allows for it. For example, I don’t run a production 24×7 SQL Server, so I’m not worried about clustering, even after I obtain my backup server.

If you can work in a degraded fashion for some time and can’t afford a top-notch DR solution, that might be enough. But consider that closely before going down that route.

On the other hand, if like my largest client downtime can cost you thousands or even millions of dollars, than you had darn well invest in a robust DR solution. I recently worked with them on testing the DR plan for one of their critical systems. As I mentioned, it probably cost them tens of thousands of dollars just for the test itself. But, they now have a VERY high confidence that if something happened, their downtime is under 4 hours and they would lose very little data. And for the volume of business, it’s worth it. For mine, a few hours of downtime and a few days of degraded availability is ok and cost effective. But, given I have a bit of extra money, I figure it’s now worth mitigating even that.

In closing because this IS the Internet… a couple of cat pictures.

The friendlier one
The shyer but smarter one

Backups Are Useless

I’m going to take a controversial stand and argue that backups are useless.

Over the last few months I’ve worked with a client of mine to test their Disaster Recovery procedure for one of their major in-house applications. This involved multiple several-hour meetings with anywhere from 5 to 10 people at each meeting, sometimes more. Each hour probably cost the client $1000s of dollars. The cost of running these meetings and tests probably cost the client well over $100K.

This is ignoring the costs of the associated hardware, the power for the backup datacenter, the cost of heating and cooling, and of course the licensing. I wouldn’t be surprised if they easily spend more than $1 Million a year in backups and the like.

And for what? A fairly low probability event?

I mean sure, if their system failed and they had no Disaster Recovery plan it could cost them 10s of millions of dollars in business and perhaps even end up putting 100s of people out of work. But they’d find other jobs. In the meantime, all that money spent on backups could have been spent on other things like lunches for for the employees (and maybe a pizza or two for select consultants). Think of the boost to the pizza economy that would have been!

So, don’t do backups.

Oh and don’t wear a mask. They’re hot, sweaty, and really not even .1% of the US population has died. And sure, you might get COVID-19, but you’ll probably survive. Sure, you might have some cognitive long-term issues, but hey, that’s sort of like the employees at my client above who, if the company went under, could simply find another job. I mean it’s not a big deal. Amirite?

Now let’s be serious. If a DBA came in to your business and said not to bother doing backups, you’d probably laugh at them. Do backups. And of course wear a mask. There is so much evidence it makes a difference. And, socially distance for now. And reconsider large family gatherings for the next month or two, if only to help increase the odds that you can have such a gathering a year from now.

Part of this post was prompted by a question on Quora from a user asking how to recover their database if they didn’t have a backup. I hated to tell them that it might be too late and there was quite likely little they could do. And I’ve read too many heart-wrenching stories from nurses who have had to hold the hand of a dying patient because they thought Covid was no big deal or a hoax. So, please, take precautions. Even if nothing happens to you, it may happen to those close to you.

That said, I will repeat an adage about backups I heard a few SQL Saturdays ago: “Backups don’t matter, restores do!” So do backups, but restore them every once in awhile to make sure that they actually work!

For the record, with my client, not only did the official DR test run go smoothly, we beat our RTO and RPO by huge margins. If disaster strikes, it’s highly likely this customer will weather it without threatening the future of the company.

Caving and SQL

Longtime readers know that I spend a lot of my time talking about and teaching caving, more specifically cave rescue, and SQL Server, more specifically the operations side. While in some ways they are very different, there are areas where they overlap. In fact I wrote a book taking lessons from both, and airplane crashes to talk about IT Disaster Management.

Last week is a week where both had an overlap. One of the grottoes in the NSS (think like a SQL User Group) sponsored a talk on Diversity and Inclusion in the caving community. The next day, SQL Pass had a virtual panel on the exact same subject.

Welcoming

Let me start with saying that one thing I appreciate about both communities is that they will welcome pretty much anyone. You show up and ask to be involved and someone will generally point you in the right direction.  In fact several years ago, I heard an Oracle DBA mention how different the SQL community was from his Oracle experience, and how welcoming and sharing we could be.

This is true in the caving community. I recall an incident decades ago where someone from out of town called up a caving friend he found in the NSS memberhsip manual and said, “hey, I hear you go caving every Friday, can I join you?” The answer was of course yes.  I know I can go many places in this country, look up a caver and instantly be pointed to a great restaurant, some great caves and even possibly some crash space to sleep.

So let’s be clear, BOTH communities are very welcoming.

And I hear that a lot when the topic of diversity and inclusion comes along. “Oh we welcome anyone. They just have to ask.”

But…

Well, there’s two issues there and they’re similar in both communities. The less obvious one is that often anyone is welcome, but after that, there’s barriers, some obvious, some less so. Newcomers start to hear the subtle comments, the subtle behaviors. For example, in caving, modesty is often not a big deal. After crawling out of a wet muddy hole, you may think nothing of tearing off your clothes in the parking lot and changing. Perhaps you’re standing behind a car door but that’s about it. It’s second nature, it’s not big deal. But imagine now that you’re the only woman in that group. Sure, you were welcomed into the fold and had a blast caving, how comfortable are you with this sudden lack of modesty? Or you’re a man, but come from a cultural or religious background where modesty is a high premium?

In the SQL world, no one is getting naked in the datacenters (I hope). But, it can be subtle things there too. “Hey dudes, you all want to go out for drinks?” Now many folks will argue, “dudes is gender neutral”. And I think in most cases it’s INTENDED to be. But, turn around and ask them, “are you attracted to dudes?” and suddenly you see there is still a gender attached.  There’s other behaviors to. There’s the classic case of when a manager switched email signatures with one of his reports and how the attitudes of the customers changed, simply based on whose signature was on the email.

So yes, both groups definitely can WELCOME new folks and folks outside of the majority, but do the folks they welcome remain welcomed? From talking to people who aren’t in the majority, the answer I often get is “not much.”

An Interlude

“But Greg, I know….” insert BIPOC or woman or other member of a minority.  “They’re a great DBA” or “They’re a great caver! Really active in the community.”  And you’re right. But you’re also seeing the survivorship bias. In some cases, they did find themselves in a more welcoming space that continued to be welcoming. In some cases you’re seeing the ones who forged on anyway. But think about it, half our population is made up of women. Why aren’t 1/2 our DBAs?  In fact, the number of women in IT is declining! And if you consider the number of women in high school or college who express an interest in IT and compare it to those in in their 30s, you’ll find the number drops. Women are welcome, until they’re not.

In the caving community during an on-line discussion where people of color were speaking up about the barriers they faced, one person, a white male basically said, “there’s no racism in caving, we’ll welcome anyone.”  A POC pointed out that “as a black man in the South, trust me, I do NOT feel safe walking through a field to a cave.”  The white man continued to say, “sure, but there’s no racism in caving” completely dismissing the other responder’s concerns.

There’s Still More…

The final point I want to make however is that “we welcome people” is a necessary, but not sufficient step. Yes, I will say pretty much every caver I know will welcome anyone who shows an interest. But that’s not enough. For one thing, for many communities, simply enjoying the outdoors is something that’s not a large part of their cultural.  This may mean that they’re not even aware that caving is a possibility. Or that even if it is, they may not know how to reach out and find someone to take them caving.

Even if they overcome that hurdle, while caving can be done on the cheap, there is still the matter of getting some clothing, a helmet, some lights. There’s the matter of getting TO the cave.

In the SQL world, yes anyone is welcome to a SQL Saturday, but what if they don’t have a car? Is mass transit an option? What if they are hearing impaired? (I’ve tried unsuccessfully 2 years in a row to try to provide an ASL interpreter for our local SQL Saturday. I’m going to keep trying). What if they’re a single parent? During the work week they may have school and daycare options, but that may not be possible for a SQL Saturday or even an afterhours event. I even had something pointed out to me, during my talk on how to present, that someone in the audience had not realized up until I mentioned it, that I was using a laser pointer. Why? Because they were colorblind and never saw the red dot. It was something that I, a non-colorblind person had never even considered. And now I wonder, how many other colorblind folks had the same issue, but never said anything?

In Conclusion

It’s easy and honestly tempting to say, “hey, we welcome anyone” and think that’s all there is to it. The truth is, it takes a LOT more than that. If nothing else, if you’re like me, an older, cis-het white male, take the time to sit in on various diversity panels and LISTEN. If you’re invited to ask questions or participate, do so, but in a way that acknowledges your position. Try not to project your experiences on to another. Only once have I avoided a field to get to a cave, because the farmer kept his bull there. But I should not project MY lack of fear about crossing a field onto members of the community who HAVE experienced that.

Listen for barriers and work to remove them. Believe others when they mention a barrier. They may not be barriers for you, but they are for others. When you can, try to remove them BEFORE others bring them up. Don’t assume a barrier doesn’t exist because no one mentions it. Don’t say, “is it ok if I use a red laser pointer?” because you’re now putting a colorblind person on the spot and singling them out. That will discourage them. For example find a “software” pointer (on my list of things to do) that will highlight items directly on the screen. This also works great for large rooms where there may be multiple projection screens in use.

If caving, don’t just assume, “oh folks know how to find us” reach out to community groups and ask them if they’re interested and offer to help. (note I did try this this year, but never heard back and because of the impact of Covid, am waiting until next year to try again.)

Don’t take offense. Unless someone says, “hey, Greg, you know you do…” they’re not talking about you specifically, but about an entire system. And no one is expecting you to personally fix the entire system, but simply to work to improve it where you can. It’s a team effort. That said, maybe you do get called out. I had a friend call me out on a tweet I made. She did so privately. And she did so because, she knew I’d listen. I appreciated that. She recognized I was human and I make mistakes and that given the chance, I’ll listen and learn. How can one take offense at that? I saw it has a sign of caring.

Finally realize, none of us are perfect, but we can always strive to do better.

So, today give some thought about how you can not only claim your community, whatever it may be, is welcoming, but what efforts you can make to ensure it is.

 

On a separate note, check out my latest writing for Red-Gate, part II on Parameters in PowerShell.

It’s too late?

I want to start with a sobering thought. It’s too late to contain this pandemic. I’m watching the news as slowly more and more states in the US issue versions of “shelter in place” or “stay at home” orders. But I think in most cases, it’s too late. The virus has probably already spread so much that self-isolation won’t be nearly as effective as it would have been had the states issued the same orders a week or two earlier. That said, it’s most likely still better than doing nothing.

Human beings at times are lousy with risk analysis. If a risk is immediate, we can react well, but the longer it stretches out or the further away it is, the harder it is to get people to react. Almost any climate scientist who has studied anthropogenic global warming has known for a decade or more we have a problem and we have a very quickly narrowing window for solving it, and the longer we wait, the harder it will become.

Yet too many of us put off the problem for another day.

So it is with the Covid-19 virus. “Oh we don’t have to lock down just yet, let’s wait another day.” And I’ll admit, sitting in the state that is the center of the virus outbreak here in the US, I’m tempted to say, “25,000 isn’t TOO bad, we can manage that.”  But that’s the lizard part of my brain reacting. It’s the emotional part. Then I kick in the rational part. If we use one of the numbers bandied about, doubling every 4 days, that means by this weekend, in New York State alone, it will be 50,000. By April 1st, 100,000. By the end of April, it could be the entire state.  Those numbers are hard to comprehend.

That said, I’m also hopeful. Modelling pandemics is pretty much pure math, but reality is more complex and often luck can play a huge factor. Let me try to explain.

First, we need to heed the words of experts like Dr. Fauci and others who are basing their remarks and recommendations on the inexorable exponential rise in expected infections. They are giving basically the worst case scenario if their recommendations are followed. And that’s proper. That’s really what you have to plan for.

Let me take a little side trip and mention a cave rescue in Vermont several years ago. By the time I had gotten the call to show up and to call out other rescues, the injured party had been in the cave for several hours. I didn’t know much about the extent of their injuries other than it was a fall and that it was in a Vermont cave, which almost certainly meant operating in tight quarters. I grabbed a box of Freihofer cookies, a lawn chair (my fellow cave rescuers will understand the reference), a contact list of other potential rescuers, and my son. While I drove, he’d read off a name and I’d say “yes call” or “Nope, next name.”  On the hour plus drive to the rescue we managed to contact at least two other people who could get there. (It turns out, as I surmised, several of the folks I wanted to call were members of the original caving party.)

Once there, my son and I were driven partway to the cave entrance and trudged the rest of the way. I talked with the folks on the scene to gather information and then dressed to go into the cave to gather first hand information. I still hadn’t gained too much information other than to know it was potentially shaping up to be a serious rescue. The person had been climbing a cable ladder when they fell and injured themselves. This meant, based on the information at hand, a worst case scenario of an evac through tight passages with the patient in a SKED stretcher.  I was playing the role of Dr. Fauci at that point, preparing for the worst based on the information I had.

Fortunately, literally at the moment I was about to enter the cave, one of the members of the original caving party crawled out and said, “he’s right behind me, he’ll be out in a minute or so.”  It turns out his injuries were fairly minor and with the members of his own caving party, he was able to get out of the cave under his own power.

I got back to Incident Command about an hour later and was informed, “oh, by the way, you’ve got at least 3 cavers who showed up to help. We held them at the bottom of the road. What should we tell them?”  My answer was simply, “Thanks and to go home.”

I relate this story not so much to talk about cave rescue specifically but to point out that even when planning for the worst, you may get a lucky break. But you can’t rely on them. Let me give an alternate scenario. Let’s say I had not called out the other rescuers and had gotten to the cave and crawled in, realized the situation was a worst case scenario, crawled back out and then initiated a call-out. It would have at that point probably meant at least an extra 90 minutes before the extra resources would have been on the scene. It would have meant the patient was exposed to hypothermic condition for another 90 minutes. It would have meant 90 more minutes of pain. It would have meant fewer brains working to solve the problem.

Getting back to Covid-19. Will we get lucky? I don’t know. I actually suspect we might. One “advantage” of an increasing population of sick people is we can better model it and we can also perform more drug trials. We may discover certain populations react differently to the disease than others and be able to incorporate that into the treatment plan. I don’t know. But I do know, we need to plan for the worst, and hope for a bit of luck. In the meantime, hunker down and let’s flatten the curve.

And if you’ve read this far and want to know how to make some pita bread, I’ll let you in on the two secrets I’ve learned: VERY hot oven (I typically bake mine at about 500-550F for 2 minutes on 1 side, and 1 minute on the other) and roll it out thinner than you might think.