Just a short blog post today since I’m actually in the middle of a call with a client as we test our failover scenario.
Right now I’m calling it a success even though the SQL Server hasn’t come up yet.
Why am I calling it a success? Because we learned that our current plan has a serious gaping hole concerning how the iSCSI drives failover. Yes, technically we failed to failover as quickly as we expected.
But, we’ve learned that before this system went into production. So that’s a success. This raises our confidence level for next time.
In all honesty, we often learn more from our failures from our successes. For example, before NASA would allow SpaceX to fly a crew on Crew Dragon, they required several abort tests, one of which involved launching a Falcon 9 and then in mid-flight firing the Crew Dragon abort engines. This resulted in the destruction of the Falcon 9 (which was expected) but proved the abort plans worked. Note however that for Orion on Artemis, NASA has decided such a test is not necessary. The decision making process behind this particular decision is worthy of a blog of its own.
In any case, with the current DR test, we expect to have things finally failed over in the next hour or two. Then we’ll update our playbook and have a lot more confidence.
Moral of your story: test your DR. Assume things will go wrong the first time because they will, but far better to have that before you go to production. This is not the first time I’ve had a failover not go as planned, but prior to production.
My faithful readers get a double dose today, only because when I wrote my earlier post I had not yet seen the invite for this month’s T-SQL Tuesday. Otherwise I would have started with this post (and perhaps written a better version of it. This will be a bit hurried).
Like many I’m picking PASS Summit. No, not very creative, but true and accurate. I should note my first conference was SQL Connections back in I believe 2006 or 2007 in Orlando and that had a fairly important impact on me too. But my first PASS Summit in 2015 had a bigger one. I managed to go in the place of our SQL Server User Group organizer provided I attended the User Group update the day before and also represent us officially in that capacity. I of course did both.
But I also had an ulterior motive for going. Two of my best friends from college lived in Seattle and I had not seen them in years, in fact in well over a decade. So it was a good chance to catch up with them. (Let me just say, flying from the east coast to the west coast and trying to go to bed at 1:00 AM West coast time, but waking up at 7:00 AM doesn’t work well!)
That said, the real reason this conference was so important was because I met Kathi Kellenberger @AuntKathi. She gave a presentation on how to get published. For years I had given thought to writing a book and with the recent death of my father, who had always wanted to write the Great American Novel this seemed like an interesting session to attend. She of course gave a great presentation. I spoke briefly with her afterwards and then went on to the next session. But her session stayed in my mind. Later that day I tracked her down and asked further questions and before I knew it I was introduced to her rep at Apress.com. Very quickly I was discussing my idea with him and before I knew it, he expressed and interest and suggested I submit a more formal idea via email. Within a few weeks of the conference I did so and my idea was accepted. That was the easy part. Translating my thoughts to paper was a bit harder. But a year later by the 2016 Pass Summit I was a published author. My dad wasn’t around to see it, but the book was dedicated to him. It wasn’t the Great American Novel and honestly, sales never lived up to even my more pessimistic expectations, but that doesn’t matter. Someone paid me for my writings! And you can still buy a copy of IT Disaster Response: Lessons Learned in the Field, my take on combining IT Disaster response with thoughts on plane crashes and cave rescues. It’s not the most technical book, nor was it intended to be, but it was meant to be sort of a different and more holistic way of looking at responding to disasters. Instead of talking about “do backups like this” it talks about using ICS (Incident Command System) and CRM (Crew Resource Management) techniques to help respond to your disaster.
I’m not here to sell you on my book but talk about how that one conference and that one chance encounter with the right person changed my life. But I won’t stop you from buying it. It’s a quick and I thikn fun read! And you might even learn something.
I’ve enjoyed all my PASS Summits, including 2020 when I finally had a chance to present (albeit remotely) and SQL Saturdays (where I’ve learned a LOT and owe too many people to name a great deal of thanks for all they’ve taught) but that first Summit was the one that probably had the most impact.
I mentioned recently that I had picked up a copy of the book The Last Stand of the Tin Can Sailors. I just finished it and would highly recommend it. The author, James Hornfischer does an excellent job of interweaving the fates of the ships and their crews over the course of several chapters. There’s an excellent sense of the fear and sense of duty among the sailors. He also includes several maps to help one orient themselves as they read about the battle unfolding. He appears to have done his research, which includes numerous interviews with the survivors, reading of the ships logs and more. The one area of missing information, and he admits it, is an adequate understanding of the Japanese side of the battle. This appears in part to be due to a lack of access to such logs and I suspect a language barrier and the difficulty of travelling to Japan.
I mention this because it’s important to understand that the story he writes, nearly 60 years after the battle gives a far fuller picture of what happened than any of the participants had that day. But even now that story is missing pieces.
To quickly recap, the Imperial Japanese Navy was given the mission of breaking up MacArthur’s landings on Leyte in order to reclaim the Philippines. Like many Japanese naval plans it was audacious but also required meticulous planning and timing. And it involved a decoy fleet. This is an important element to what precipitated the last stand. At this point in the war (late 1944), the Japanese Navy had few planes and few experienced pilots, so their aircraft carriers were not an effective force. This despite the fact that the Japanese had shown at Pearl Harbor that the future of surface naval warfare was almost exclusively to be done via aircraft. So they decided to use their aircraft carriers as bait for the Third Fleet commanded by Admiral Halsey. A bait he took; hook, line and sinker.
This left the northern edge of the Seventh Fleet, guarding the San Bernardino Strait basically undefended except for 3 task forces, Taffies 1-3, with just a slew of “jeep” carriers and destroyers and destroyer escorts. Taffy 3 was the northernmost of these and the ones directly engaged by the Japanese fleet. They were soon to be met be the IJN Yamato and and the rest of Admiral Kurita’s fleet of battleships and cruisers. By any measure, Taffy 3 was outgunned and outmatched. Yet, by the end of the day, despite the loss of 2 destroyers, 1 destroyer escort, and 2 escort carriers, the Japanese fleet had lost 3 heavy cruisers, 3 more damaged, a destroyer damaged and the loss of 52 aircraft (compared to the US losing 23) and was in full retreat.
At this point, and for the last 77 years one could reasonably ask, “why?” What drove Admiral Kurita’s decision to withdraw. Unfortunately, most of the answers are predicated on guesswork, educated guesswork, but still guesswork all the same. The simple answer appears to be two fold. For one, he didn’t know if Admiral Halsey had taken the bait, and in fact it appears that he didn’t think Halsey had, and that he was in fact attacking the fleet carriers, not escort carriers, and hence a much larger American fleet than was actually present. But despite his erroneous belief about the American Third Fleet’s position, he was most likely correct in his appraisal of the future of the mission: he did not believe he could continue forward and disrupt the landings. Since that was the primary goal of his mission and it most likely would fail, it appears he saw no point in risking the rest of his fleet and withdrew.
One can speculate what would have happened had he continued on with the battle. My personal, and mostly uneducated guess, is that he probably would have succeeded in sinking the other 2 carriers of Taffy 3 and perhaps the rest of the destroyers and destroyer escorts. However, his position was extremely precarious with the growing number of American aircraft starting to make sorties from Taffy 2 and from an improvised airstrip the Army had prepared and the pilots from Taffy 3 had basically taken over. It’s most likely he would have ended up with several more of his own ships on the ocean floor, including the Yamato.
So, he made what he thought was the best decision based on the information he had at the time. As did Halsey when he took the bait of the Northern Force of the basically defanged Japanese carriers.
So why do I recap all of this? Because I think it’s topical to a lot of what we do at times. This past weekend I was upgrading a SQL Server for a customer. Fairly routine work. And I ran into problems. Things I wasn’t expecting. It threw me off. Fortunately I was able to work around the issues, but it got me thinking about other upgrades and projects I’ve done.
The reality is, in IT (as well as life) we make plans to get things done. Sometimes they’re well thought out plans with lots of research done prior to the plan and everything is written down in detail to make sure nothing is forgotten.
And then… something unexpected happens. The local internet glitches. It turns out there’s a patch missing you had been told was there. Or there’s a patch there you didn’t know was there. Or a manager unexpectedly powers down the server during your data center move without telling you (yes, that happened to me once).
When things go majorly wrong, we’ll do a post-mortem. We’ll look back and say “Oh, that’s where things went wrong.” But we have to remind ourselves, at the time, we didn’t know better. We may not have had all the information on hand. When reviewing decisions, one has to separate “what do we know now” from “what did they know then.”
Now we know, “…Halsey acted stupidly” to quote a famous movie. He shouldn’t have taken the bait. We know Kurita probably should have turned back earlier (since the other half of the pincer had been turned back by the Seventh Fleet, putting the Japanese plan in serious jeopardy, or perhaps pressed on a bit longer before turning back (and taking out a few more escort carriers). But we shouldn’t judge their decisions based on what we know, but only on what they knew then.
Finally, I’m going to end with a quote from the battle. Spoken by Lieutenant Commander Robert W. Copeland of the USS Samuel B. Roberts (DE-413) to his crew over the 1MC “This will be a fight against overwhelming odds from which survival cannot be expected. We will do what damage we can.” And that they did. Among other things they launched their torpedoes at the IJN heavy cruiser ChÅkai, hitting and disabling it and then took on another Japanese cruiser with their 5″ guns until finally a shell took out their remaining engine room and they ended up dead in the water.
I can’t begin to fathom the heroism and bravery of the men of Taffy 3 that day. If you can, find the time to get a copy of the book and to read it.
P.S. The title of this post has an interesting story of its own, and I know at least one reader will know it all to well.
I managed to skip two weeks of writing, which is unusual for me, but I was busy with other business, primarily last week leading an NCRC weeklong class of cave rescue for Level 1 students. I had previously lead such a class over three weekends last year, and have helped teach the Level 2 class multiple times. Originally this past week was supposed to be our National weeklong class, but back in February we had agreed to postpone it due to the unknown status of the ongoing Covid pandemic. However, due to a huge demand and the success of vaccinations, we decided to do a “Regional” Class just limited to Level 1 students. This would help handle the pent up demand, create students for the Level 2 class that would be at National, and to do sort of a test run of our facilities before the much larger National.
There’s an old saying that no plan survives the first contact with the enemy. In cave rescue this is particularly true. It also appears to be true in cave rescue training classes!
The first hitch was the drive up the the camp we were using. The road had been stripped down to the base dirt level and they were doing construction. Not a huge issue, just a dusty one. But for cavers, dust is just mud without the water. But this would come into play later in the week.
Once at the camp, as I was settling in and confirming the facilities, the first thing I noticed was that the scissors lift we had used to rig ropes in the gym last time was gone. A few texts and I learned it had only been on loan to the camp the past two years and was no longer available. This presented our first real challenge. How to get ropes up over the beams 20-30′ in the air.
But shortly after I realized I had a far greater issue. The custom made rigging plates we use to tie off the end of the ropes to the posts were still sitting in my garage at home. I had completely forgotten them. This was resolved by a well timed call to an instructor heading towards the camp, who via a longer detour then he expected, was able to get them. Fortunately, had that call waited another 5 minutes, his detour would have probably doubled. So the timing was decent.
I figured the week was off to a good start at that point! Honestly though, we solved the problems and moved on. I went to bed fairly relaxed.
All went well until Monday. This was the day we were supposed to do activities on the cliffs. Several weeks ago, my son and I, along with two others had gone to the cliffs, which were on the same property as the camp, but accessible only by leaving the camp and accessing from a public road, in order to clear away debris and do other work to make them usable. I was excited to show them off. Unfortunately, due to the weather forecast of impending thunderstorms all day we made the decision to revise our schedule and move cliff day to the next day. There went Plan A. Plan B became “go the next day.”
On Tuesday I and a couple of other instructors got in my car to head to the cliffs in advance of the students so we could scope things out and plan the activities. We literally got to the bottom of the road from the main entrance to the camp where we were going to turn on to the road under construction, only to find a the road closed there with a gaping ditch dug across it. So much for Plan B. We went back to the camp, told students to hang on and then I headed out again, hoping to basically take a loop around and approach the access road to the cliffs from the opposite direction. After about a 3 mile detour we came to the other end of the road and found it closed there. Despite trying to sweet talk the flag person, we couldn’t get past (we could have lied and said we lived on the road, but after 8-10 other cars would have arrived in a caravan saying the same thing we thought that might be suspicious). There went Plan C. We called an instructor back at the camp and headed back.
We got there and turns out an instructor had already come up with Plan D, which was to see if we could access the cliffs by crossing a field the camp owned and going through the woods. It might involve some hiking, but it might be doable. While there are dirt-bike paths, there’s nothing there that worked for us. So that plan fell apart. We were up to Plan E now. Plan E was proposed to further swap some training, but we realized that would impact our schedule too much. Now on to Plan F. For Plan F, we decided to head to a local cave which we thought would have some suitable cliffs outside.
That worked. It would out quite well actually. We lost maybe an hour to 90 minutes with all the plans, but we ultimately came upon a plan that worked. We were able to teach the skills we wanted and accomplish our educational objectives.
Often we wake up with a plan in our heads for what we will do that day. Most days those plans work out. But, then there are the days where we have to adapt. Things go sideways. Something breaks, or something doesn’t go as planned. In the NCRC we have an unofficial motto, Semper Gumby – “Always be Flexible”. Sometimes you have to completely change plans (cancelling due to the threat of thunderstorms), others you may have to try to adapt (finding other possible routes to the cliffs) and finally you may need to reconsider how to meet your objectives in a new way (finding different cliffs).
My advice, don’t lock yourself into only one solution. It’s a recipe for failure.
Last week I wrote about how in many crisis situations you should actually stop and take 5 minutes to assess the situation, take a deep breath, and maybe even make a cup of tea. The point was, in many cases, we’re not talking life or death, and by taking a bit longer to respond we can have a better response.
I pointed out that you don’t always have that luxury. That happened to my mom’s partner within days of me writing last week’s post. While at work at a local supermarket chain, he heard someone shout “Man down”. Next thing he knew, a young man was laying on the floor having seizures. He jumped into action and provided the appropriate, immediate first aid. This included telling someone to call 911. Apparently no one else, including his manager responded at first. But, he had learned how to respond in his basic training in the Army decades ago. That training stuck.
My mom called me to talk about this and wondered why no one else had responded (she knows of my interest in emergency response and the like). I pointed out it’s a variety of factors, but often comes down to people don’t know how to respond, or they’ve assumed someone else has already responded. This discussion prompted a quick Facebook post by me that that I’m expanding upon here.
Let me ask you this, if someone collapsed in front of you at the mall, would you know what to do? What would you do? Would you do it?
The reality is, unfortunately many would not respond. So here’s my advice.
Get some training
You do not need to become an EMT to respond. In fact most training can be done in just a few hours.
Take a First Aid and a CPR course. Make sure the CPR course includes a segment on how to use an AED (Automatic External Defibrillator). I’ve taken several such courses over the years and try to remain certified.
Take a Stop the Bleed class. This is a bit different from your standard First Aid class. I haven’t taken it yet, but plan to when I can find one near me (I may even look into getting one setup when I have a bit more free time).
“911, what’s your emergency?”
Call 911. Anyone can do this. I would recommend even teaching even your young children to do this if they find you or someone else unconscious. Even if they can’t communicate much details, 911 operators are trained to gather what information they can, have ways (usually electronically) of determining the address and dispatching help. (Please note, if your child or someone else calls 911 by accident, please do NOT hang up. Simply let them know it was a mistake. It happens, they understand. But if they aren’t made aware, they WILL dispatch resources).
TELL someone specific to call 911. If you’re about to render aid, do NOT assume someone has already called 911 or will. In a crowd, groupthink happens and everyone starts to freeze and/or assume someone else has it handled. My advice, don’t just say “someone call 911”. Point to a specific person and tell them to call 911. Odds are, they will do it. In many cases in an emergency, folks are simply looking for someone to take charge and to give them direction. Now, someone else may have already called 911, or it may end up being multiple people will be calling 911. THAT IS OK. That’s far better than no one calling if it’s an emergency. In the event of a heart attack minutes count. This means that the sooner 911 is called, the better.
Respond
This may sound obvious, but be prepared to act. Again, it’s a common trope that in large crowds, people tend NOT to act, because in part they expect someone else already has it covered. Be that person who does act.
Years ago in the northern Virginia area, I witnessed a car get t-boned on the far side of an intersection from me. There were 3 lanes of traffic in either direction. NO ONE stopped to check on the drivers. I had to wait for the light to change before I could cross the intersection and check on them. Fortunately, the driver of the car I checked on was fine, other than some very minor injuries from their air bag deploying. And by this time, another witness had finally stopped to check on the 2nd car. They too were fine. But several dozen people had witnessed the accident and only the two of us had responded. If they drivers had been seriously injured and no one had responded, things could have been much worse for them.
Carry gloves, maybe more
Carry nitrile gloves with you. Sounds perhaps a bit silly or trite, but they don’t take up room and you can toss them in your backpack, glove compartment (yes, really you can put gloves in there), your purse etc. If you do come across someone who is injured, especially if blood or other bodily fluids are present, don them. I even carry a tiny disposable rebreather mask for CRP in my work backpack. Takes up no room but it’s there if I need it.
When you enter public buildings, look to see if they have a sign about AED availability. Note it and if possible where it is. In addition to telling someone to call 911, be prepared to tell someone “Get the AED, I think there’s on next to the desk in reception.”
Get your employer involved
Get your work to sponsor training. And honestly, while many companies might offer video tutorials with a quick online quiz at the end, I think they’re a bare minimum. I think hands on training is FAR more effective. There’s a number of reasons for this as I understand it, including the fact that you’re often engaging multiple pathways to the brain (tactile as well as visual and auditory) and a certain level of stress can actually improve memorization.
Seeing a video about how to use an AED is very different from holding a training unit in your hands and feeling its weight and hearing it give you instructions directly. Applying a bandage is far more realistic when your mock patient is laying there groaning in pain. Even getting into the action of telling someone “Call 911” is far more impactful when you do it in a hands-on manner and not simply checking a box in an on-line quiz.
Find out what resources are available in the office. Is there a first aid kit? What’s in it? For larger offices, I would argue they should have an AED and perhaps a Stop the Bleed kit. When’s the last time the AED batteries were tested? Who is responsible for that?
This works
In the case of the “man down” that prompted this post, they are reportedly doing fine and suffered no injuries.
I know of a local case, at a school where a student collapsed. A coach and the school nurse responded. And while the nurse especially had more training, what saved the students life was having an AED on site and available. Even if the school nurse or coach had not been there, in theory any bystander could have responded in a similar fashion.
As I said above, you don’t have to be a highly trained EMT or the like to make an impact and save someone from further injury or even save a life. You simply need to have some basic training and be willing to respond.
This weekend I had the pleasure of moderating Brandon Leach‘s session at Data Saturday Southwest. The topic was “A DBA’s Guide to the Proper Handling of Corruption”. There were some great takeaways and if you get a chance, I recommend you catch it the next time he presents it.
But there was one thing that stood out that he mentioned that I wanted to write about: taking 5 minutes in an emergency. The idea is that sometimes the best thing you can do in an emergency is take 5 minutes. Doing this can save a lot of time and effort down the road.
Now, obviously, there are times when you can’t take 5 minutes. If you’re in an airplane and you lose both engines on takeoff while departing La Guardia, you don’t have 5 minutes. If your office is on fire, I would not suggest taking 5 minutes before deciding to leave the building. But other than the immediate life-threatening emergencies, I’m a huge fan of taking 5 minutes. Or as I’ve put it, “make yourself a cup of tea.” (note I don’t drink tea!) Or have a cookie!
Years ago, when the web was young (and I was younger) I wrote sort of a first-aid quiz web-page. Nothing fancy or formal, just a bunch of questions with hyperlinks to the bottom. It was self-graded. I don’t recall the exact wording of one of the questions but it was something along the lines of “You’re hiking and someone stumbles and breaks their leg, how long should you wait before you run off to get help.” The answer was basically “after you make some tea.”
This came about after hearing a talk from Dr. Frank Hubbell, the founder of SOLO talk about an incident in the White Mountains of New Hampshire where the leader of a Boy Scout troop passed out during breakfast. Immediately two scouts started to run down the trail to get help. While doing so, one slipped and fell off a bridge and broke his leg. Turns out the leader simply had passed out from low blood sugar and once he woke up and had some breakfast was fine. The pour scout with the broken leg though wasn’t quite so fine. If they had waited 5 minutes, the outcome would have been different.
The above is an example of what some call “Go Fever”. Our adrenaline starts pumping and we feel like we have to do something. Sitting still can feel very unnatural. This can happen even when we know rationally it’s NOT an emergency. Years ago during a mock cave rescue training exercise, a student was so pumped up that he started to back up and ran his car into another student’s motorcycle. There was zero reason to rush, and yet he had let go fever hit him.
Taking the extra 5 minutes has a number of benefits. It gives you the opportunity to catch your breath and organize the thoughts in your head. It gives you time to collect more data. It also sometimes gives the situation itself time to resolve.
But, and Brandon touched upon this a bit, and I’ve talked about it in my own talk “Who’s Flying the Plane”, often for this, you need strong support from management. Management obviously wants problems fixed, as quickly as possible. This often means management puts pressure on us IT folks to jump into action. This can lead to bad outcomes. I once had a manager who told my team (without me realizing it at the time) to reboot a SQL Server because it was acting very slowly. This was while I was in the middle of remotely trying to diagnosis it. Not only did this not solve the problem, it made things worse because a rebooting server is exactly 100% not responsive, but even when it comes up, it has to load a lot of pages into cache and will have a slow response after reboot. And in this case, as I was pretty sure would happen, the reboot didn’t solve the problem (we were hitting a flaw in our code that was resulting in huge table scans). While non-fatal, taking an extra 5 minutes would have eliminated that outage and gotten us that much closer to solving the problem.
Brandon also gave a great example of a corrupted index and how easy it can be to solve. If your boss is pressuring you for a solution NOW and you don’t have the opportunity to take those 5 minutes, you might make a poor decision that leads to a larger issue.
My take away for today is three fold:
Be prepared to take 5 minutes in an emergency
Take 5 minutes today, to talk to your manager about taking 5 minutes in an emergency. Let them know NOW that you plan on taking those 5 minutes to calm down, regroup, maybe discuss with others what’s going on and THEN you will respond. This isn’t you being a slacker or ignoring the impact on the business, but you being proactive to ensure you don’t make a hasty decision that has a larger impact. It’s far easier to have this conversation today, than in the middle of a crisis.
If you’re a manager, tell your reports, that you expect them to take 5 minutes in an emergency.
It’s true. Even if they don’t realize it. Or even if they claim they do. They really don’t.
I’ve made this point before. Of course this is hyperbole. But a recent post by Taryn Pratt reminded me of this. I would highly recommend you go read Taryn’s post. Seriously. Do it. It’s great. It’s better than my post. It actually has code and examples and the like. That makes it good.
That said, why the title here? Because again, I want to emphasize what your boss really cares about is business continuity. At the end of the day they want to know, “if our server crashes, can we recover?” And the answer had better be “Yes.” This means that you need to be able to restore those backups, Or have another form of recovery.
Log-Shipping
It seems to me that over the years log-shipping has sort of fallen out of favor. “Oh we have SAN snapshots.” “We have Availability Groups!” “We have X.” “No one uses log-shipping any more, it’s old school.”
In fact this recently came up in a DR discussion I had with a client and their IT group. They use a SAN replication software to replicate data from one data center to another. “Oh you don’t need to worry about shipping logs or anything, this is better.”
So I asked questions like was it block-level, file-level, byte-level or what? I asked how much latency there was? I asked how we could be sure that data was hardened on the receiving side. I actually never got really clear answers to any of that other than, “It’s never failed in testing.”
So I asked the follow up question, “How was it tested.” I’m sure their answer was supposed to reassure me. “Well during a test, we’d stop writing to the primary, shut it down and the redirect the clients to the secondary.” And yes, that’s a good test, but it’s far from a complete test. Here’s the thing, many disasters don’t allow the luxury of cleaning stopping writes to the primary. They can occur for many reasons, but in many cases the failure is basically instantaneous. This means that data was inflight. Where in flight? Was it hardened to the log? Was that data in flight to the secondary? Inquiring minds want to know.
Now this is not to say these many methods of disk based replication (as opposed to SQL based which is a different beast) aren’t effective or don’t have their place. It’s simply to say, they’re not perfect and one has to understand their limitations.
So back to log-shipping. I LOVE log-shipping. Let me start with a huge caveat. In an unplanned outage, your secondary will only be up to date as the most recent log backup. This could be an issue. But, the upside is, you should have a very good idea of what’s in the database and your chances of a corrupted block of data, or the like is very low.
But there’s two facts I love about it.
Every time I restore a log file, I’ve tested the backup of that log file. This may seem obvious, but, it does give me a constant check on my backups. If my backups fail for any reason, lack of space, a bad block gets written and not noticed, etc. I’ll know as soon as my next restore fails. Granted, my FULL Backups aren’t being restored all the time, but I’ve got at least some more evidence that my backup scheme in general is working. (and honestly, if I really needed to, I could backup my copy and use that in a DR situation.)
It can make me look like a miracle worker. I have, in the past, in a shop where developers had direct access to prod and had been known to mess up data, used log-shipping to save the day. Either on my DR box, or a separate box I’d keep around that was too slow CPU wise for DR, but had plenty of diskspace, I’d set it to delay applying logs for 3-4 hours. In the event of most DR events, it was fairly simple to catch-up on log-shipping and bring the DR box online. But more often than not, I used it (or my CPU weak but disk heavy box) in a different way. I’d get a report from a developer, “Greg, umm, I well, not sure how to say this, but just updated the automobile table so that everyone has a White Ford Taurus.” I’d simply reply, “give me an hour or so, I’ll see what I can do.” Now the reality is, it never took me an hour. I’d simply look at the log-shipped copy I had, apply any logs I needed to catch up to just before their error, then script out the data and fix the data in production. They were always assuming I was restoring the entire backup or something like that. This wasn’t the case, in part because doing so would have taken far more than an hour, and would have caused a complete production outage.
There was another advantage to my 2nd use of log-backups. I got practice at manually applying logs, WITH NOROLLBACK and the like. I’m a firm believer in Train as you Fight.
Yes, in an ideal world, a developer will never have such unrestricted access to Production ( and honestly it’s gotten better, I rarely see that these days) and you should never need to deal with an actual DR, but we don’t live in an ideal world.
So, at the end of the day, I don’t care if you do log-shipping, Taryn Pratt’s automated restores or what, but do restores; both automated and manually. Automated because it’ll test your backups. Manually because it’ll hone your skills for when your primary is down and your CEO is breathing down your neck as you huddle over the keyboard trying to bring things back.
Reminder
As a consultant, I’m always looking for new clients. My primary focus is helping to outsource your on-prem DBA needs. If need help, let me know!
So, by now, you may have all heard about the vehicle that got stuck trying to go through a somewhat narrow passage. No, I’m not talking about the container ship known as Ever Green. Rather I’m talking my car and the entrance to my garage!
Yes, due to circumstances I’ll elucidate, for a few minutes the driver’s side of my car and the left side of my garage door opening attempted to occupy the same spot in space and time. It did not end well. The one consolation is that this mishap was not visible from space!
Now I could argue, “but it wasn’t my fault! My daughter was driving.” But that’s not really accurate or fair. Yes, she was driving, but it was my fault. She’s still on her learner’s permit. This requires among other things, a licensed driver (that would be me) in the vehicle and observing what she was doing. She did great on the 8 mile drive home from high school. So great in fact that when she paused and asked about pulling into my garage, I said “go for it.”
To understand her hesitation, I have to explain that the garage is perpendicular to the driveway and a fairly tight turn. It’s certainly NOT a straight shot to get in. I’ve done it hundreds of times in the last 5 years (when the garage was added to the house) and so I’ve got it down. Generally my biggest concern is the passenger side front bumper “sweeping” into the garage door opening or the wall as I enter. I don’t actually give much thought on the driver’s side.
So, I gave her the guidance I thought necessary: “Ok, stay to the far right on the driveway, this gives you more room to turn.” “Ok good, start turning. Great. Ok. Ayup, you’ve cleared the door there, start to straighten out.” “Ok you’re doing…” Here the rest of the cockpit voice recorder transcript will be redacted other than for the two sounds, a “thunk” and then a “crunch”. The rest of the transcript is decidedly not family friendly.
The investigator, upon reviewing the scene and endlessly replaying the sounds in his head, came to the following conclusions:
The “thunk” was the sound of the fold-way mirror impacting the door frame and doing as was intended, folding away.
The “crunch” was the sound of the doors (yes, both driver’s side doors) impacting the said door frame.
Both the driver and the adult in charge were more focused on the front passenger bumper than they were on distance between the driver’s side and the door frame. Remedial training needs to be done here.
Anyway, I write all this because, despite what I said earlier, in a way this is a bit about the Ever Green and other incidents. Yes, my daughter was driving, but ultimately, it was my responsibility for the safe movement of the vehicle. Now, if she had had her license, then I might feel differently. But the fact is, I failed. So, as bad as she felt, I felt worse.
In the case of the Ever Green, it’s a bit more complex: the captain of a ship is ultimately responsible for the safe operation of their vessel. But also, in areas such as the Suez Canal, ships take on pilots who are in theory more familiar with the currents and winds and other factors that are local to that specific area that the captain may not be. I suspect there will be a bit of finger pointing. Ultimately though, someone was in charge and had ultimate responsibility. That said, their situation was different and I’m not about to claim it was simply oversight like mine. My car wasn’t being blown about by the wind, subject to currents or what’s known as the bank effect.
What’s the take take-away? At the end of day, in my opinion and experience, the best leaders are the ones that give the credit and take the blame. As a former manager, that was always my policy. There were times when things went great and I made sure my team got the credit. And when things went sideways, is when I stood up and took the blame. When a datacenter move at a previous job went sideways, I stepped up and took the blame. I was the guy in charge. And honestly, I think doing that helped me get my next job. I recall in the interview when the interviewer asked me about the previous job and I explained what happened and my responsibility for it. I think my forthrightness impressed him and helped lead to the hiring decision. The funny part is, when I was let go from the previous job, my boss also took responsibility for his failures in the operation. It’s one reason I still maintained a lot of respect for him.
So yes, my car doors have dents in them that can be repaired. The trim on my garage door needs some work. And next time BOTH my daughter and I will be more careful. But at the end of the day, no one was injured or killed and this mistake wasn’t visible from space.
This post is the result of several different thoughts running through my head combined with a couple of items I’ve seen on social media in the past few days. The first was a question posted to #SQLHelp on Twitter in regards to if a DBA came into a situation with a SQL Server in an unknown configuration what one would do. The second was a comment a friend made about how “it can’t get any worse” and several of us cheekily corrected him saying it can always get worse. And of course I’m still dealing with my server that died last week.
To the question of what to do with an unknown SQL Server, there were some good answers, but I chimed in saying my absolute first thing would be to make backups. Several folks had made good suggestions in regards to looking at system settings and possibly changing them, possibly re-indexing, etc. My point though was, all that could wait. If the server had been running up until now, while fixing those might be very helpful, the lack of fixing things would not make things worse. On the other hand, if there were no up to date backups and the server failed, the owner would be in a world of hurt. Now, for full disclosure, I was “one-upped” when someone pointed out that assuming they did have backups, what one really wanted to do was a restore. I had to agree. The truth is, no one needs backups, what they really need are restores. But the ultimate point is really the same, without a tested backup, your server can only get much worse if something goes wrong.
I’ve had to apply this thinking to my own dead server. Right now it’s running in a Frankenbeast mode on an old desktop with 2GB of RAM. Suffice to say, this is far from ideal. New hardware is on order, but in the meantime, most things work well enough.
I actually have a newer desktop in the house I could in theory move my server to. It would be a vast improvement over the current Frankenbeast; 8GB of RAM and a far faster CPU. But, I can’t. It doesn’t see the hard drive. Or more accurately, it won’t see an OS on it. After researching, I believe the reason comes down to a technical detail about how the hard drive is setup (namely the boot partition is what’s known as a MBR and it needs to be GPT). I’ll come back to this in a minute.
In the meantime, let’s take a little detour to mid April, 1970. NASA has launched two successful Lunar landings and the third, Apollo 13 is on its way to the Moon. They had survived their launch anomaly that came within a hair’s breadth of aborting their mission before they even made orbit. Hopes were high. Granted, Ken Mattingly was back in Houston, a bit disappointed he had been bumped from the flight due to his exposure to rubella. (The vaccine had just been released in 1969 and as such, he had never been vaccinated, and had not had it as a child. Vaccines work folks. Get vaccinated lest you lose your chance to fly to the Moon!)
A routine mission operation was to stir the oxygen tanks during the flight. Unfortunately, due to a Swiss Cheese effect of issues, this nearly proved disastrous when it caused a spark which caused an “explosion” which blew out the tank and ruptured a panel on the Service Module and did further damage. Very quickly the crew found themselves in a craft quickly losing oxygen but more importantly, losing electrical power. Contrary to what some might think, the loss of oxygen wasn’t an immediate concern in terms of breathing or astronaut health. But, without oxygen to run through the fuel cells, it meant there was no electricity. Without electricity, they would soon lose their radio communication to Earth, the onboard computer used for navigation and control of the spacecraft and their ability to fire the engines. Things were quickly getting worse.
I won’t continue to go into details, but through a lot of quick thinking as well as a lot of prior planning, the astronauts made it home safely. The movie Apollo 13, while a somewhat fictionalized account of the mission (for example James Lovell said the argument among the crew never happened, and Ken Mattingly wasn’t at KSC for the launch), it’s actually fairly accurate.
As you may be aware, part of the solution was to use the engine on the Lunar Module to change the trajectory of the combined spacecraft. This was a huge key in saving the mission.
But this leads to two questions that I’ve seen multiple times. The first is why they didn’t try to use the Service Module (SM) engine, since it was far more powerful and had far more fuel and they in theory could have turned around without having to loop around the Moon. This would have saved some days off the mission and gotten the astronauts home sooner.
NASA quickly rejected this idea for a variety of reasons, one was a fairly direct reason: there didn’t appear to be enough electrical power left in the CSM (Command/Service Module) stack to do so. The other though was somewhat indirect. They had no knowledge of the state of the SM engine. There was a fear that any attempt to use it would result in an explosion, destroying the SM and very likely the CM, or at the very least, damaging the heatshield on the CM and with a bad heatshield that would mean a dead crew. So, NASA decided to loop around the Moon using the LM descent engine, a longer, but far less risky maneuver.
Another question that has come up was why they didn’t eject the now dead and deadweight, SM. This would have meant less mass, and arguably been easier for the LM to handle. Again, the answer is because of the heatshield. NASA had no data on how the heatshield on the CM would hold up after being exposed to the cold of space for days and feared it could develop cracks. It had been designed to be protected by the SM on the flight to and from the Moon. So, it stayed.
The overriding argument here was “don’t risk making things worse.” Personally, my guess is given the way things were, firing the main engine on the SM probably would have worked. And exposing the heatshield to space probably would have been fine (since it was so overspecced to begin with). BUT, why take the risk when they had known safer options? Convenience is generally a poor argument against potentially catastrophic outcomes.
So, in theory, these days it’s trivial to upgrade a MBR disk to a GPT one. But, if something goes wrong, or that’s not really the root cause of my issues, I end up going from a crippled, but working server to a dead server I have to rebuild from scratch. Fortunately, I have options (including now a new disk so I can essentially mirror the one disk, have an exact copy and try the MBR->GPT solution on that one) but they may take another day or two to implement.
And in the same vein, if it’s a known SQL Server, or an unknown one, you’re working on, PLEASE make backups before you make changes, especially anything dramatic that risks data loss. (and I’ll add a side note, if you can, avoid restarting SQL Server when diagnosing issues, you lose a LOT of valuable information in the DMV tables.
So things CAN get worse. But that doesn’t mean there’s any need to take steps that will. Be cautious. Have a backout plan.
I wrote once before about a day being a “Monday” and a week later about it not being a “Monday”. Well, yesterday was another Monday. And it reminded me of the value of DR planning and how scaling to your actual needs and budget are important.
There’s an old saying a Cobbler’s Children has no shoes and there’s some truth to that. And, well my kids have shoes, but yesterday reminded me I still want to improve my home DR strategy.
I had actually planned on sleeping in late since it’s the week between Christmas and New Years and my largest client is basically doing nothing and everyone else in the house is sleeping in this week. But that said, old habits die hard and after one of the cats woke me up to get fed, I decided to check my email. That’s when I noticed some of the tabs open in Chrome were dead. I’m not sure what I looked at next, but it caused me to ping my home server: nothing.
While that’s very unusual, it wouldn’t be the first time it did a BSOD. I figured I’d go to the basement, reboot, grab the paper, some breakfast and be all set. Well, I was partly right. Sure enough when I looked at the screen there was an error on it, but not a BSOD, but a black and white text screen with a bunch of characters and a line with an error on it. I rebooted, waited for the Server 2012 logo and then went out to get the newspaper. I came back, it was still booting, but I decided to wait for it to complete. Instead, it threw another BSOD (a real BSOD this time). I did a reboot and seconds later up came a BIOS message “PARITY ERROR”.
I figured it must be a bad RAM chip and while 16 GB wouldn’t be great, I could live with that if I had to cut down. But, things only got worse. Now the server wouldn’t even boot. I don’t mean as in I kept getting parity errors or a BSOD but as in, nothing would happen, no BIOS, nothing. Best as I can tell my server had succumbed to a known issue with the motherboard.
The technical term for this is “I was hosed”. But, in true DR spirit, I had backup plans and other ideas. The biggest issue is, I had always assumed my issue would be drive failure, hence backups, RAID, etc. I did not expect a full motherboard failure.
On one hand, this is almost the best time of the year for such an event. Work is slow, I could work around this, it wouldn’t normally be a big issue. However, there were some confounding issues. For one, my daughter is in the midst of applying to colleges and needs to submit portfolio items. These are of course saved on the server. Normally I’d move the server data drive to another machine and say “just go here” but she’s already stressed enough, I didn’t want to add another concern. And then much to my surprise, when I called ASRock customer service, they’re apparently closed until January! Yes, they apparently have no one available for a week. So much for arguing for an RMA. And finally of course, even if I could do an RMA, with the current situation with shipping packages, who knew when I would get it.
So, backup Plan A was to dig out an old desktop I had in house and move the drives over. This actually worked out pretty well except for one issue. The old desktop only has 2 GB of RAM in it! My server will boot, but my VMs aren’t available. Fortunately for this week that’s not an issue.
And Plan B was to find a cheap desktop at Best Buy, have my wife pick it up and when she got home, move the server disks to that and have a reasonably powered machine as a temporary server. That plan was great, but, for various reasons I haven’t overcome yet, the new machine won’t boot from the server drive (it acts like it doesn’t even see it.) So, for now I’m stuck with Plan A for now.
I’ve since moved on to Plan C and ordered a new Mobo (ironically another ASRock, because despite this issue, it’s been rock solid for 4+ years) and expect to get it by the 5th. If all goes well I’ll be up and running with a real server by then, just in time for the New Year.
Now, Plan D is still get ASRock to warranty the old one (some people have successfully argued for this because it appears to be a known defect). If that works, then I’ll order another case, more RAM and another OS license and end up with a backup server.
Should I have had a backup server all along? Probably. If nothing else, having a backup domain controller really is a best practice. But the reality is, this type of failure is VERY rare, and the intersection of circumstances that really requires me to have one is more rare. So I don’t feel too bad about not having a fully functional backup server until now. At the most, I lost a few hours of sleep yesterday. I didn’t lose any client time, business or real money. So, the tradeoff was arguably worth it.
The truth is, a DR plan needs to scale with your needs and budget. If downtime simply costs you a few hours of your time coming up with a workaround (like mine did), then perhaps sticking with the work around if you can’t afford more is acceptable. Later you can upgrade as you needs require it and your budget allows for it. For example, I don’t run a production 24×7 SQL Server, so I’m not worried about clustering, even after I obtain my backup server.
If you can work in a degraded fashion for some time and can’t afford a top-notch DR solution, that might be enough. But consider that closely before going down that route.
On the other hand, if like my largest client downtime can cost you thousands or even millions of dollars, than you had darn well invest in a robust DR solution. I recently worked with them on testing the DR plan for one of their critical systems. As I mentioned, it probably cost them tens of thousands of dollars just for the test itself. But, they now have a VERY high confidence that if something happened, their downtime is under 4 hours and they would lose very little data. And for the volume of business, it’s worth it. For mine, a few hours of downtime and a few days of degraded availability is ok and cost effective. But, given I have a bit of extra money, I figure it’s now worth mitigating even that.
In closing because this IS the Internet… a couple of cat pictures.