It’s true. Even if they don’t realize it. Or even if they claim they do. They really don’t.
I’ve made this point before. Of course this is hyperbole. But a recent post by Taryn Pratt reminded me of this. I would highly recommend you go read Taryn’s post. Seriously. Do it. It’s great. It’s better than my post. It actually has code and examples and the like. That makes it good.
That said, why the title here? Because again, I want to emphasize what your boss really cares about is business continuity. At the end of the day they want to know, “if our server crashes, can we recover?” And the answer had better be “Yes.” This means that you need to be able to restore those backups, Or have another form of recovery.
Log-Shipping
It seems to me that over the years log-shipping has sort of fallen out of favor. “Oh we have SAN snapshots.” “We have Availability Groups!” “We have X.” “No one uses log-shipping any more, it’s old school.”
In fact this recently came up in a DR discussion I had with a client and their IT group. They use a SAN replication software to replicate data from one data center to another. “Oh you don’t need to worry about shipping logs or anything, this is better.”
So I asked questions like was it block-level, file-level, byte-level or what? I asked how much latency there was? I asked how we could be sure that data was hardened on the receiving side. I actually never got really clear answers to any of that other than, “It’s never failed in testing.”
So I asked the follow up question, “How was it tested.” I’m sure their answer was supposed to reassure me. “Well during a test, we’d stop writing to the primary, shut it down and the redirect the clients to the secondary.” And yes, that’s a good test, but it’s far from a complete test. Here’s the thing, many disasters don’t allow the luxury of cleaning stopping writes to the primary. They can occur for many reasons, but in many cases the failure is basically instantaneous. This means that data was inflight. Where in flight? Was it hardened to the log? Was that data in flight to the secondary? Inquiring minds want to know.
Now this is not to say these many methods of disk based replication (as opposed to SQL based which is a different beast) aren’t effective or don’t have their place. It’s simply to say, they’re not perfect and one has to understand their limitations.
So back to log-shipping. I LOVE log-shipping. Let me start with a huge caveat. In an unplanned outage, your secondary will only be up to date as the most recent log backup. This could be an issue. But, the upside is, you should have a very good idea of what’s in the database and your chances of a corrupted block of data, or the like is very low.
But there’s two facts I love about it.
- Every time I restore a log file, I’ve tested the backup of that log file. This may seem obvious, but, it does give me a constant check on my backups. If my backups fail for any reason, lack of space, a bad block gets written and not noticed, etc. I’ll know as soon as my next restore fails. Granted, my FULL Backups aren’t being restored all the time, but I’ve got at least some more evidence that my backup scheme in general is working. (and honestly, if I really needed to, I could backup my copy and use that in a DR situation.)
- It can make me look like a miracle worker. I have, in the past, in a shop where developers had direct access to prod and had been known to mess up data, used log-shipping to save the day. Either on my DR box, or a separate box I’d keep around that was too slow CPU wise for DR, but had plenty of diskspace, I’d set it to delay applying logs for 3-4 hours. In the event of most DR events, it was fairly simple to catch-up on log-shipping and bring the DR box online. But more often than not, I used it (or my CPU weak but disk heavy box) in a different way. I’d get a report from a developer, “Greg, umm, I well, not sure how to say this, but just updated the automobile table so that everyone has a White Ford Taurus.” I’d simply reply, “give me an hour or so, I’ll see what I can do.” Now the reality is, it never took me an hour. I’d simply look at the log-shipped copy I had, apply any logs I needed to catch up to just before their error, then script out the data and fix the data in production. They were always assuming I was restoring the entire backup or something like that. This wasn’t the case, in part because doing so would have taken far more than an hour, and would have caused a complete production outage.
There was another advantage to my 2nd use of log-backups. I got practice at manually applying logs, WITH NOROLLBACK and the like. I’m a firm believer in Train as you Fight.
Yes, in an ideal world, a developer will never have such unrestricted access to Production ( and honestly it’s gotten better, I rarely see that these days) and you should never need to deal with an actual DR, but we don’t live in an ideal world.
So, at the end of the day, I don’t care if you do log-shipping, Taryn Pratt’s automated restores or what, but do restores; both automated and manually. Automated because it’ll test your backups. Manually because it’ll hone your skills for when your primary is down and your CEO is breathing down your neck as you huddle over the keyboard trying to bring things back.
Reminder
As a consultant, I’m always looking for new clients. My primary focus is helping to outsource your on-prem DBA needs. If need help, let me know!