T-SQL Tuesday #194 – Why I Don’t Take CRaP from Anyone

It’s been quite a while since I participated in a T-SQL Tuesday and with my new career path, less reason to do so, but this topic appealed to me.

This month Louis Davidson asked us to talk about a mistake we made and what we learned from it.

I have two mistakes. The first I call the White Ford Taurus Mistake. Back before the turn of the millennia (yes, I’ve been working with SQL Server for that long) I was consulting at a start-up (where I was later hired as their IT manager) that helped car dealers sell their inventory online. We were young, fast, and nimble. And sometimes had no clue what we were doing. I technically didn’t make this mistake, but I was there when it happened. A dealer had uploaded their inventory but there was a mistake for one vehicle. So we had a request to update the details on that car. So an update was made on the production server. I forget the exact statement but it was something like “Update AUTO set MANUFACTURER=’Ford’, Type=’Taurus’, Color=’White'” You’ll notice the distinct lack of a WHERE clause. Since we did this without any sort of explicit transaction around it, suddenly every car in the database was now a White Ford Taurus.

I can assure you we didn’t make that level of error again!

The second error was in some ways worse and led to the title of this blog. This occurred two years later when we had matured, had actual servers and had moved into a datacenter in New York City (incidentally it’s now the Google building at 111 8th Ave, but back then had several different hosting providers). I had been taken a lot of trips to our datacenter over the past few months so my boss suggested for this trip I take my wife and we catch a show. It was a great plan. Until it wasn’t. This trip was supposed to be fairly simple:

  • Failover our Classified Ads database to our backup server – This was basically telling Enterprise Manager to backup and restore the database from production to the backup server. (I’ll be honest, I forgot what this feature was called, it was so long ago).
  • Update a hardware driver on the main server
  • Failback

In theory this should have taken about 1-2 hours tops.

The first time I tried the failover it failed. I forget the exact reason, but it was a simple fix. However, what I didn’t realize was that Enterprise Manager had flipped the direction of the migration on the screen.

So after fixing the initial problem I hit the button again. This time it was successful. And very fast. Too fast. It took me a second to realize the problem. Enterprise Manager had done exactly what I told it to do. In this case it had copied an empty database over the production database.

No problem, right, simply restore the most recent backup.

Big problem. There was no recent backup.

This was one of the worst calls I ever had to make to my boss. Fortunately he took it in stride, called our Wisconsin office and had them start reloading the data. Since the only product involved was our classified ads database, the data had a high churn rate meaning over the course of 7 days it was all new data. So they basically had to do 7 days of loads in one day. Unfortunately this was going to take about 8 hours and meant that I had to postpone any plans until then. Once done, I headed back to the data center, made a backup, then did the failover correctly, and tried the driver update. This ran into its own issues, but isn’t the subject of this post.

I ended up leaving the datacenter at about 3:00 AM. Needless to say my wife and I didn’t have our date night.

So what did I learn from this? Reviewing and planning any production changes. As a result I developed what we internally called our Change, Review, Analysis, Plan document, or CRAP for short, but since that didn’t look good in an email, we dropped the Analysis and it simply became the Change, Review, and Plan document.

Somewhere I probably still have a copy of one, but the basics were a stratification of the risk and steps taken to reduce it. For example, in the above mistake, a key step would have simply been “ensure recent backup”. Had I done that before my mistake above, the recovery would have been about 20 minutes, not 8 hours.

So the goal of the document wasn’t just to prevent mistakes, but to assume they would happen and to analyze the impact and what steps could be taken to minimize the risks and reduce the recovery time if necessary. It also had an area for determining who needed to sign off on a change. Something that might have a minimal impact (such as adding a column to a table) might only require my signature and one other. On the other hand, if say we were doing a full fail-over test where if things went bad we could have a complete outage for an extended time, that would require the CEO to sign off on.

Over the years, sometimes I had employees of mine grumble when I’d reject the CRaP they had submitted to me for not being detailed enough or covering all the bases. But, in at least one case, one of my network managers came back after such a rejection and said, “Hmm, you’re right. I missed this particular failure mode. I’m going to do this in person instead of remotely.” The update he was doing ran smoothly, but if it had failed, the outage would have been measured in minutes not hours.

So sometimes, assume you’ll make mistakes, but have a plan for handling them.

Leave a comment