Testing

This ties in with the concept of experimentation. Thomas Grohser related a story the other night of a case of “yeah, the database failed and we tried to do a restore and found out we couldn’t.”

Apparently their system could somehow make backups, but couldn’t restore them. BIG OOPS.  (Apparently they managed to create an empty database and replay 4.5  years of transaction logs and recover their data. That’s impressive in its own right.)

This is not the first time I’ve worked with a client or heard of a company where their disaster recovery plans didn’t pass the first actual need of it. It may sound obvious, but companies need to test the DR plans. I’m in fact working with a partner on a new business to help companies think about their DR plans. Note, we’re NOT writing or creating DR plans for companies, we’re going to focus on how companies go about actually implementing and testing their DR plans.

Fortunately, right now I’m working with a client that had an uncommon use case. They wanted a restore of the previous night’s backup to a different server every day.

They also wanted to log-ship the database in question to another location.

This wasn’t hard to implement.

But what is very nice about this setup is, every 15 minutes we have a built-in automatic test of their log-backups.  If for a reason log-backups stop working or a log gets corrupt, we’ll know in fairly short time.

And, with the database copy, we’ll know within a day if their backups fail.  They’re in a position where they’ll never find out 4.5 years later that their backups don’t work.

This client’s DR plan needs a lot of work, they actually have nothing formal written down. However, they know for a fact their data is safe. This is a huge improvement over companies that have a DR plan, but have no idea if their idea is safe.

Morale of the story: I’d rather know my data is safe and my DR plan needs work than have a DR plan but not have safe data.

Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.

Rolling in the Deep

I was at SQL Saturday in Boston this past weekend and I sat in on a session given by Paresh Motiwala: “Why do we shun using tools for DBA job?”

It’s a decent question and sometimes rolling your own (hence the title here and yes I’m listening to Adele right now) is the right answer.

But often, it’s not.

A case in point are DBAs who avoid using the built-in SQL Server maintenance plans, especially for simple tasks such as Backups.

Now, I’ll start out by saying straight up, sometimes they’re not the optimal solution. (You may want to backup certain partitions on a different a rotation schedule and the like as one example.)

Below is a recent situation I came across.

But, if you do decide to roll your own, please do NOT do the following:

  1. First write a sproc that has to be placed into each database as it’s created in order to run the backup. (If you DO decide to go this route, please make this a scripted part of your release procedure so it’s not missed.)
  2. If you DO in fact ignore suggestion #1, I will point out that it doesn’t do much good to make the first step of your sproc to check to see if the database exists. BIG HINT: If the database doesn’t exist, the sproc won’t run in the first place! Yes, I know you’re saying, “Obviously” but this is the situation I just came across.
  3. If you DO in fact ignore suggestion #2, you probably don’t need to call your own special “alert sproc”.
  4. But if you DO in fact ignore suggestion #3, make sure your “alert sproc” does more than call a “email sproc” and pass it a few parameters.
  5. And if you DO in fact ignore suggestion #4, please make sure your “email sproc” does a bit more than build a message and call the built-in SQL Server stored proc to send an email.

Now granted, there may be reasons to do some of the above. Perhaps your alert sproc also calls some sort of 3rd party monitoring tool. Or your email needs are very specific.

In the case I just rectified, none of that was true.  So there was a lot of additional complexity (which really didn’t work anyway) for no good reason.

Another problem this roll your own backup setup had was that it used the same filename every time for its backups. i.e. On Monday the backup name was M:\Backups\FOO_FULL.BAK.  On Tuesday it was M:\Backups\FOO_FULL.BAK, etc.

In theory (and generally in practice) each of these would be backed-up to a 3rd party so it was in theory possible to find the backup for a specific day, but that was an added complexity; and probably not one you want in an actual DR situation.  Also, if for some reason the backup to the the 3rd party failed (but the local backups continued) they’d definitely lose the ability to restore specific days of backups.

In addition, the person who built this procedure setup differential backups for MOST databases to run every 15 minutes.  Now, I wouldn’t necessarily call that a terrible idea, but in this case, almost certainly not the best approach in my opinion.  However, again, the same file name was used each time.

This means that in a DR event, the company could restore the previous nights backup and if they wanted, the most recent Diff backup and that was it.  If they wanted to restore to a point in time in between, that was impossible. And in my experience this is far more common than most other restore needs.

Finally, the developer who wrote all this clearly did not understand what the CHECKPOINT command did. He had scheduled a checkpoint job to run every 30 minutes.  Again, in general, not only not necessary, but probably a bad idea. However in this case it not only was not necessary, the reason given in the job comments was completely wrong.  He seemed to think it would keep the transaction logs from growing.

This is of course NOT what it does and sure enough on the one database still with FULL RECOVERY enabled the transaction log was far larger than the actual database. (Fortunately it was a lightly used database or the disk might have filled up years ago.)

Since discovering all this, I’ve gone and replaced all this complexity with a set of maintenance jobs.  These will guarantee each system database is backed up weekly (for their needs this should be fine) with unique names.  User databases will be backed up nightly and retained for 4 nights (and perhaps extended once we determine fully how much disk space we want to set aside for this.)  Transaction logs will be performed every 15 minutes. These too will have unique names.

Now the customer can restore to any point in time in the last 4 days (if they go to their 3rd party backup, even further back) up to 15 minutes before a failure (and in some cases if the log is still available and they can backup the tail of the log, up to the instant before the failure).

If they add additional databases, they don’t have to worry about remembering to put in 3 separate sprocs for each database added and adding new jobs to the SQL Server Agent.

Now they not only have a far more robust backup plan, they have one that is far easier to maintain.  Oh and one that will actually send an email if there’s a problem with the backup.

The morale is: Don’t make things more complex unless you absolutely need to and if you do, make sure you actually achieve the goals you’re trying to achieve.

Practicing for Disaster

I’ve had this post by Wayne Hale in my queue for awhile since I’ve wanted to comment on it for awhile and until lately have been to busy to do so.

One of my current contracts requires them to do an annual DR test.  Since the end of the year is approaching, they’re trying to get the test in. Part of the test requires an “official” test observed by an outside auditor.

So, being smart, and since a lot has changed in the past year, we decide to schedule a dry-run or two before hand.

Well let’s just say those have not gone as expected.

Some might consider the dry-runs failures.

I don’t. I consider them successes. We are finding out now, in a controlled environment with no real time pressures, where we are weak and need to fix things.

It’s far better to do this now than during the audited test or even better than during an actual disaster event! So the dry-runs are serving their purpose, they’re helping us find the holes before it’s too late.

That said, I have to claim the part that I’m most involved with, the SQL Log-Shipping has been working well.  The only issue this week with that was a human error made by another DBA that was completely unrelated to the DR test and within minutes of him discovering his error he executed the proper procedure to begin fixing it.  The total fix on his end took no more than 5 minutes and other than monitoring on my end, the effort on my end took no more than 5 minutes.  That’s an excellent example of a robust design and set of procedures.

Today’s moral is don’t just have a DR plan, practice it. And not every failure is really a failure.