SQL Saturday NYC 2017 Recap

Realizing that perhaps SOME entry is better than no entry, I figure I’d write a short one.

This weekend, spend about 27 hours in NYC with my wife Randi. While one goal of the trip was definitely a min-vacation, the actual impetus was again a SQL Saturday. Again, I was selected to present my talk, “Tips that Saved My Bacon”.  Apparently it was well received since the people nice enough to give feedback on the provided forms gave me top notches across the board.

While that’s always refreshing, it does make me wonder about what the other folks thought. Did they go away satisfied, feeling that there was no useful feedback to provider?  Or did they feel they couldn’t provide information since I might be insulted? Or did they simply not bother?  I’ll never know.

I’ll say now, that good feedback is always appreciated by me. (And feedback that I’m good is always an ego boost 🙂

I’ll be presenting again in a couple of weeks at SQL Saturday Philadephia, this time two talks, again my Bacon talk and my IT and Plane Crashes. A twofor if you will.

But, this weekend got me thinking about my weekends this year. I’ll have spent at least 15 days on Cave Rescue stuff (several weekends plus a week of teaching) and at least 4 SQL Saturdays (Chicago (passed), New York City (passed), Philadelphia and Albany) and 3 days at SQL Summit. So that’s 26 days at least donating time to organizations that I believe strongly in.

What do you do with your time?

 

Don’t Break the Chain!

If one backup is good, two is better right?

Not always.

Let me start by saying I’ve often been very skeptical of SQL Server backups done by 3rd party tools. There’s really two reasons. For one, many years ago (when I first started working with SQL Server) they often simply weren’t good. They had issues with consistency and the like. Over time and with the advent of services like VSS, that issue is now moot (though, I’ll admit old habits die hard).

The second reason was I hate to rely on things that I don’t have complete control over. As a DBA, I feel it’s my responsibility to make sure backups are done correctly AND are usable. If I’m not completely in the loop, I get nervous.

Recently, a friend had a problem that brought this issue to light. He was asked to go through their SQL Server backups to find the time period when a particular record was deleted so they could develop a plan for restoring the data deleted in the primary table and in the subsequent cascaded deletes. Nothing too out of the ordinary. A bit tedious, but nothing too terrible.

So, he did what any DBA would do, he restored the full backup of the database for the date in question. Then he found the first transaction log and restored that.  Then he tried to restore the second transaction log.

The log in this backup set begins at LSN 90800000000023300001,  which is too recent to apply to the database. An earlier log backup that  includes LSN 90800000000016600001 can be restored.

Huh? Yeah, apparently there’s a missing log.  He looks at his scheduled tasks. Nope, nothing scheduled. He looks at the filesystem.  Nope, no files there.

He tries a couple of different things, but nope, there’s definitely a missing file.  Anyone who knows anything about SQL Server backups, knows that you can’t break the chain. If you do, you can’t get too far. This can work both ways. I once heard of a situation where the FULL backups weren’t recoverable, but they were able to create a new empty database and apply five years worth of transaction logs. Yes, 5 years worth.

This was the opposite case. They had the full backup they wanted, but couldn’t restore even 5 hours worth of logs.

So where was that missing transaction log backup?

My friend did some more digging in the backup history files in the MSDB and found this tidbit:

backup_start_date backup_finish_date first_lsn last_lsn physical_device_name
11/9/2016 0:34 11/9/2016 0:34 90800000000016600000 90800000000023300000 NUL

There was the missing transaction backup.  It was a few minutes after the full backup, and definitely not part of the scheduled backups he had setup.  The best he can figure is the sysadmin had set SAN Snapshot software to take a full backup at midnight and then for some reason a transaction log backup just minutes later.

That would have been fine, except for one critical detail. See that rightmost column (partly cut-off)? Yes, ‘physical_device_name’. It’s set to NUL.  So the missing backup wasn’t made to tape or another spot on the disk or anyplace like that. It was sent to the great bit-bucket in the sky. In other words, my friend was SOL, simply out of luck.

Now, fortunately, the original incident, while a problem for his office, wasn’t a major business stopping incident. And while he can’t fix the original problem he was facing, he discovered the issues with his backup procedures long before a major incident did occurr.

I’m writing about this incident for a couple of reasons.  For one, it emphasizes why I feel so strongly about realistic DR tests.  Don’t just write your plan down. Do it once in awhile. Make it as realistic as it can be.

BTW, one of my favorite tricks that I use for multiple reasons is to setup log-shipping to a 2nd server.  Even if the 2nd server can never be used for production because it may lack the performance, you’ll know very quickly if your chain is broken.

Also, I thought this was a great example of where doing things twice doesn’t necessarily make things less resistant to disaster. Yes, had this been setup properly it would have resulted in two separate, full backups being taken, in two separate places. That would have been better. But because of a very simple mistake, the setup was worse than if only one backup had been written.

I’d like to plug my book: IT Disaster Response due out in a bit over a month. Pre-order now!

Deep Drilling

I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.

But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen).  Nope.

Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records.  No luck.

I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.

Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

Now we’re getting someplace.  After a minor diversion of my own doing I then ran

DBCC CheckIDENT('sysjobhistory',NORESEED)

This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.

Now, the simple solution, which worked for me in this case was to issue a

truncate table sysjobhistory

This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.

But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs.  But for a year that should generate less than 1 million rows.  This database server hasn’t been running for 2 thousand years.

So, I decided to monitor things a bit and wait for a few jobs to run.

Then, I executed the following query.

select max(instance_id) from sysjobhistory

This returned a value along the lines of 232031.  Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!

So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.

I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November.  Each directory had close to 22,000 log files that should have been deleted and weren’t.

Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.

I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.

Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).

So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again.  That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.

But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.

Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.

 

Testing

This ties in with the concept of experimentation. Thomas Grohser related a story the other night of a case of “yeah, the database failed and we tried to do a restore and found out we couldn’t.”

Apparently their system could somehow make backups, but couldn’t restore them. BIG OOPS.  (Apparently they managed to create an empty database and replay 4.5  years of transaction logs and recover their data. That’s impressive in its own right.)

This is not the first time I’ve worked with a client or heard of a company where their disaster recovery plans didn’t pass the first actual need of it. It may sound obvious, but companies need to test the DR plans. I’m in fact working with a partner on a new business to help companies think about their DR plans. Note, we’re NOT writing or creating DR plans for companies, we’re going to focus on how companies go about actually implementing and testing their DR plans.

Fortunately, right now I’m working with a client that had an uncommon use case. They wanted a restore of the previous night’s backup to a different server every day.

They also wanted to log-ship the database in question to another location.

This wasn’t hard to implement.

But what is very nice about this setup is, every 15 minutes we have a built-in automatic test of their log-backups.  If for a reason log-backups stop working or a log gets corrupt, we’ll know in fairly short time.

And, with the database copy, we’ll know within a day if their backups fail.  They’re in a position where they’ll never find out 4.5 years later that their backups don’t work.

This client’s DR plan needs a lot of work, they actually have nothing formal written down. However, they know for a fact their data is safe. This is a huge improvement over companies that have a DR plan, but have no idea if their idea is safe.

Morale of the story: I’d rather know my data is safe and my DR plan needs work than have a DR plan but not have safe data.

Experimenting

There are times when you have to take at face value what you are told.

There are 1.31 billion people living in China. This according to several sources (that all probably go back to the same official document from the Chinese government.)  I’m willing to believe that number. I’m certainly not going to go to China and start counting heads. For one, I don’t have the time, for another, I might look awfully weird doing so. It’s also accurate enough for any discussions I might have about China. But if I were going to knit caps for every person in China I might want a more accurate number.

That said, sometimes one shouldn’t take facts at face value. A case in point is given below. Let me start out with saying the person who gave me this fact, wasn’t wrong.  At least they’re no more wrong than the person who tells me that the acceleration due to gravity is 9.8m/s².  No, they are at worst inaccurate and more likely imprecise. Acceleration due to gravity here on Earth IS roughly 9.8m/s². But it varies depending where on the surface I am. And if I’m on the Moon it’s a completely different value.

Sometimes it is in fact possible to actually test and often worth it. I work with SQL Server and this very true here. If a DBA tells you with absolute certainty that a specific setting should be set, or a query must be written a specific way or an index rebuilt automatically at certain times, ask why. The worst answer they can give is, “I read it some place.”  (Please note, this is a bit different from saying, “Generally it’s best practice to do X”. Now we’re back to saying 9.8m/s², which is good enough for most things, but may not be good enough if say you want to precisely calibrate a piece of laboratory equipment.)

The best answer is “because I tested it and found that it works best”.

So, last night I had the pleasure of listening to Thomas Grohser speak on the SQL IO engine at local SQL Server User Group meeting. As always it was a great talk. At one point he was talking about backups and various ways to optimize them. He made a comment about setting the maxtransfersize to 4MB being ideal. Now, I’m sure he’d be the first to add the caveat, “it depends”. He also mentioned how much compression can help.

But I was curious and wanted to test it. Fortunately I had access to a database that was approximately 15GB in size. This seemed liked the perfect size with which to test things.

I started with:

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB.BAK’ with maxtransfersize=4194304

This took approximately 470 seconds and had a transfer rate of 31.151 MB/sec.

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB_COMP.BAK’ with maxtransfersize=4194304, compression

This took approximately 237 seconds and a transfer rate of 61.681 MB/sec.

This is almost twice as fast.  While we’re chewing up a few more CPU cycles, we’re writing a lot less data.  So this makes a lot of sense. And of course now I can fit more backups on my disk. So compression is a nice win.

But what about the maxtransfersize?

backup database TESTDB to disk=’Z:\backups\TESTDB.BAK’

This took approximately 515 seconds and a transfer rate of 28.410 MB/sec. So far, it looks like changing the maxtransfersize does help a bit (about 8%) over the default.

backup database TESTDB to disk=’Z:\backups\TESTDB_comp.BAK’ with compression

This took approximately 184 seconds with a transfer rate of 79.651 MB/sec.  This is the fastest of the 4 tests and by a noticeable amount.

Why? I honestly, don’t know. If I was really trying to optimize my backups, most likely I’d run each of these tests 5-10 more times and take an average. This may be an outlier. Or perhaps the 4MB test with compression ran slower than normal.  Or there may be something about the disk setup in this particular case that makes it the fastest method.

The point is, this is something that is easy to setup and test. The entire testing took me about 30 minutes and was done while I was watching tv last night.

So before you simply read something on some blog someplace about “you should do X to SQL Server” take the time to test it. Perhaps it’s a great solution in your case. Perhaps it’s not. Perhaps you can end up finding an even better solution.

 

 

 

 

Bacon

Another SQLSaturday has come and gone. This one in Philadelphia. I had actually been to this location twice before for SQL Server User Group meetings. I have a client in the area, so sometimes my schedule syncs up and I manage to get in some time with the Philly SQL Server User Group. It’s free, and it’s always a good way to learn something new.

Several months ago there were several memes going around on the Internet about how much money it cost to rescue Matt Damon from various places (WWII Europe, from beyond a black hole, from Mars). Today I caught a clip of a deleted scene from The Martian where his character talks about how much was spent to rescue one person.

It comes down to the fact that humans are social creatures and we like being part of a community. Sometimes I stop and think about the communities I’m in and how they overlap or don’t overlap.

As I’ve mentioned, I’m an instructor with the National Cave Rescue Commission. I’m also the Northeast Regional Coordinator for said organization. We technically don’t rescue folks, we teach folks to rescue people. We do it because the people who are best suited to rescue cavers are other cavers.  We do it because we know that someday there will be someone stuck or injured or lost in a cave and they may need our help. And we do it because we know that person might be us.  And if we’re not willing to help others, who will help us?

The other community on my mind today is the SQLSaturday community. Sure, we’re generally not dealing with live or death decisions and I can’t think of the last time someone’s life was on the line as I was doing work as a  DBA. But the concept is still the same.  We’re only as good as the community. Sure, there are some brilliant DBAs out there. There are some folks who speak T-SQL more fluently than many of us speak our native language. But ultimately, even they often will rely on the knowledge of others because there’s simply too much to know about this platform. So we come together and share.

Both of these communities, caver rescuers and SQL Saturday presenters share a couple of things in common.  We have a love for what it is that we do. We desire to help others. And, we don’t do it for the profit or the glory.  When I teach any of the NCRC cave rescue courses I get paid with my meals. The same is true with SQL Saturdays. Sometimes I might get a free shirt. And generally I’m paying for my own travel to any of these events.

However, there’s one other form of payment I strive for in each case. That of passing on knowledge and of gaining knowledge. With the cave rescue classes, I can often recall distinctly the look on a student’s face when a concept I’m teaching registers and they have what I call the “aha” moment. That’s worth the time and effort for me.

The same is true with when I teach at SQL Saturdays. If I have a single attendee come up to me and say, “Hey, I really liked your talk. The idea for X really resonated” I feel like I’ve accomplished my goal.

Equally, I almost always come away with learning something myself.That too makes it worth my while.

My talk yesterday was “Tips that have saved my Bacon”. I’ll admit I was a bit nervous giving this talk, since the first time I gave it at SQLSaturday NYC 2015 I had only about 8 people show up. So I wasn’t sure how popular it would be this time and being slotted for the first sessions in the morning I wasn’t sure if folks would be  up yet.

Fortunately, over 25 people showed up and seemed to enjoy it. And yes, I did have several people come up to me afterwards thanking me for the session and that they enjoyed it. So, my mission was accomplished. I also received some good critical feedback for improving it next time.

I’ll be speaking again at SQLSaturday here in Albany NY on July 30th. This time though will be a different topic, one I really enjoy and so far I’ve received great feedback on. It’s not about SQL specifically, but more about how IT and Management can learn from plane crashes.

I hope you can make it.

Documentation

Do it, it’s important.

Ok, I suppose I should expand a bit upon that and in this case add an actual example.

So last night, I again attended the local SQL Server User Group meeting. The talk this month was by Ray Kim and was on Documentation for Techies.  While we all agree that documentation is good, it’s sort of interesting how rare most techs actually do it. Ray’s talk covered some of this and further talked about exactly how valuable it is. In addition, several audience members spoke about how proper documentation saved their company a great deal of money simply by giving their tech support people the ability to answer questions in a far faster form.

I got thinking about some of the clients I’ve worked for and how I’ve wanted to document stuff, but often they have very little actually setup in the way of procedures to handle documentation. This is unfortunate, because it can cost them money. For example, for a client right now I’m working on automating a task.  It turns out that there’s not much documentation, so I’m basically struggling to figure things out as a I go.

One thing you hear tech folks talk about a lot is “oh the code is self-documenting”. And sometimes it is.  Since I work in SQL, often, but not always it’s clear what the code is doing. For example

Select firstname, lastname from Clients where ClientID=@ClientID

probably doesn’t need a comment saying what it does.  It’s pretty clear.  But a more complex query might need some commenting, or it may need some explanation as why a particular approach was taken. For example I was recently writing a stored procedure where the where clause was not quite what one would expect if one were to naively write it in the most obvious manner.  However, the obvious manner would have resulted in a table scan of a very large table. By writing what I did, I could ensure a seek would occur.

I also had a habit, which after thinking about last night and testing today, I’m going to modify a bit. Often I’d write procedures such as:

-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
if OBJECT_ID('foo', 'p') is not null drop procedure foo
go
create procedure foo
as
select 'bar'
go

Now, note technically this is a script (T-SQL) that will drop and then create the procedure, so it’s more than just the script. But it’s useful for me because I can ensure I’m running the latest and greatest and drop the old one if it exists before running it.

But, last not got me thinking. What happens if 3 years down the road someone comes along and needs to edit my code. Let’s say the client didn’t do a good job of keeping track of source code and they have to extract the scripts to create the procedures from SQL Server itself using say SSMS?

The results end up looking much more like this:

USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[foo]
GO
USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create procedure [dbo].[foo]
as
select 'bar'
GO

Ignore the extra USE statements and the SSMS generated comments and SET statements. Notice my comments are gone.  This actually makes sense because in the first script, the comments occur before a GO statement so the SQL engine interprets them as completely separate from the statements to create the actual stored proc.  All my useful comments are now history.

BUT, there’s a simple solution. Move the comments to after the first GO statement.

if OBJECT_ID('foo', 'p') is not null drop procedure foo
 
go
 
-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
-- Version: 1.1
-- Comments moved below GO statement
 
create procedure foo
as
 
select 'bar'
go

Now if I use SSMS to generate my script I get:

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [dbo].[foo]
GO

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

— Usage: Exec FOO
— Author: Greg D. Moore
— Date: 2016-03-15
— Version: 1.0
— This simply returns bar when executed
— Version: 1.1
— Comments moved below GO statement

create procedure [dbo].[foo]
as

select ‘bar’

GO

Now my great documentation is preserved. This is a small thing but down the road could save the next developer a lot of trouble.

So, stop and think about not only documentation, but how to make sure it’s preserved and useful in the future.

Rolling in the Deep

I was at SQL Saturday in Boston this past weekend and I sat in on a session given by Paresh Motiwala: “Why do we shun using tools for DBA job?”

It’s a decent question and sometimes rolling your own (hence the title here and yes I’m listening to Adele right now) is the right answer.

But often, it’s not.

A case in point are DBAs who avoid using the built-in SQL Server maintenance plans, especially for simple tasks such as Backups.

Now, I’ll start out by saying straight up, sometimes they’re not the optimal solution. (You may want to backup certain partitions on a different a rotation schedule and the like as one example.)

Below is a recent situation I came across.

But, if you do decide to roll your own, please do NOT do the following:

  1. First write a sproc that has to be placed into each database as it’s created in order to run the backup. (If you DO decide to go this route, please make this a scripted part of your release procedure so it’s not missed.)
  2. If you DO in fact ignore suggestion #1, I will point out that it doesn’t do much good to make the first step of your sproc to check to see if the database exists. BIG HINT: If the database doesn’t exist, the sproc won’t run in the first place! Yes, I know you’re saying, “Obviously” but this is the situation I just came across.
  3. If you DO in fact ignore suggestion #2, you probably don’t need to call your own special “alert sproc”.
  4. But if you DO in fact ignore suggestion #3, make sure your “alert sproc” does more than call a “email sproc” and pass it a few parameters.
  5. And if you DO in fact ignore suggestion #4, please make sure your “email sproc” does a bit more than build a message and call the built-in SQL Server stored proc to send an email.

Now granted, there may be reasons to do some of the above. Perhaps your alert sproc also calls some sort of 3rd party monitoring tool. Or your email needs are very specific.

In the case I just rectified, none of that was true.  So there was a lot of additional complexity (which really didn’t work anyway) for no good reason.

Another problem this roll your own backup setup had was that it used the same filename every time for its backups. i.e. On Monday the backup name was M:\Backups\FOO_FULL.BAK.  On Tuesday it was M:\Backups\FOO_FULL.BAK, etc.

In theory (and generally in practice) each of these would be backed-up to a 3rd party so it was in theory possible to find the backup for a specific day, but that was an added complexity; and probably not one you want in an actual DR situation.  Also, if for some reason the backup to the the 3rd party failed (but the local backups continued) they’d definitely lose the ability to restore specific days of backups.

In addition, the person who built this procedure setup differential backups for MOST databases to run every 15 minutes.  Now, I wouldn’t necessarily call that a terrible idea, but in this case, almost certainly not the best approach in my opinion.  However, again, the same file name was used each time.

This means that in a DR event, the company could restore the previous nights backup and if they wanted, the most recent Diff backup and that was it.  If they wanted to restore to a point in time in between, that was impossible. And in my experience this is far more common than most other restore needs.

Finally, the developer who wrote all this clearly did not understand what the CHECKPOINT command did. He had scheduled a checkpoint job to run every 30 minutes.  Again, in general, not only not necessary, but probably a bad idea. However in this case it not only was not necessary, the reason given in the job comments was completely wrong.  He seemed to think it would keep the transaction logs from growing.

This is of course NOT what it does and sure enough on the one database still with FULL RECOVERY enabled the transaction log was far larger than the actual database. (Fortunately it was a lightly used database or the disk might have filled up years ago.)

Since discovering all this, I’ve gone and replaced all this complexity with a set of maintenance jobs.  These will guarantee each system database is backed up weekly (for their needs this should be fine) with unique names.  User databases will be backed up nightly and retained for 4 nights (and perhaps extended once we determine fully how much disk space we want to set aside for this.)  Transaction logs will be performed every 15 minutes. These too will have unique names.

Now the customer can restore to any point in time in the last 4 days (if they go to their 3rd party backup, even further back) up to 15 minutes before a failure (and in some cases if the log is still available and they can backup the tail of the log, up to the instant before the failure).

If they add additional databases, they don’t have to worry about remembering to put in 3 separate sprocs for each database added and adding new jobs to the SQL Server Agent.

Now they not only have a far more robust backup plan, they have one that is far easier to maintain.  Oh and one that will actually send an email if there’s a problem with the backup.

The morale is: Don’t make things more complex unless you absolutely need to and if you do, make sure you actually achieve the goals you’re trying to achieve.

Practicing for Disaster

I’ve had this post by Wayne Hale in my queue for awhile since I’ve wanted to comment on it for awhile and until lately have been to busy to do so.

One of my current contracts requires them to do an annual DR test.  Since the end of the year is approaching, they’re trying to get the test in. Part of the test requires an “official” test observed by an outside auditor.

So, being smart, and since a lot has changed in the past year, we decide to schedule a dry-run or two before hand.

Well let’s just say those have not gone as expected.

Some might consider the dry-runs failures.

I don’t. I consider them successes. We are finding out now, in a controlled environment with no real time pressures, where we are weak and need to fix things.

It’s far better to do this now than during the audited test or even better than during an actual disaster event! So the dry-runs are serving their purpose, they’re helping us find the holes before it’s too late.

That said, I have to claim the part that I’m most involved with, the SQL Log-Shipping has been working well.  The only issue this week with that was a human error made by another DBA that was completely unrelated to the DR test and within minutes of him discovering his error he executed the proper procedure to begin fixing it.  The total fix on his end took no more than 5 minutes and other than monitoring on my end, the effort on my end took no more than 5 minutes.  That’s an excellent example of a robust design and set of procedures.

Today’s moral is don’t just have a DR plan, practice it. And not every failure is really a failure.

Shiny Things

It’s not uncommon for fads or “shiny things” to rise to the top.  In my work in cave rescue for example, there’s a lot of really cool equipment that comes out every year.  And often a particular piece or type of equipment will grab peoples attention and folks will start to use it all over the place. A good example of this are various “rope grabs” such as a Rescucender. Suddenly everyone is teaching, “Oh, you should use this tool, it’s so much better.” In part people fall into the trap of thinking it’s better because it’s newer.

Another example are ascending systems.  For climbing ropes in caves there are Frogs, Ropewalkers, Texas and more.  They all have their advantages. And quite honestly, for the most part, they are far better than what they replaced, the traditional “3-knot” system.  Unfortunately, as an instructor I’ve come across a large number of experienced vertical cavers who have NO idea how to climb on a 3-knot system.

Rescucenders have their place and they had their peak of prominence. For awhile it seemed they were used and taught all over the place. A lot of teams I know are going back to focusing on a good old Pruisk Knot for a rope grab.  This of course is the most common knot used in 3-knot climbing systems.

In one case, the newer, better, shinier thing was found not necessarily to be all that better.  In the other case, the newer equipment really is better. But the basics shouldn’t be forgotten. You never know when you’ll need to fashion Prusik’s out of shoelaces.

I bring this up because of an article brought to my attention today by , discussing SQL vs. NoSQL. Many have told me that NoSQL will replace a traditional SQL RDBMS. Of course I was told the same thing about OODBMS and other database systems. So far many of the replacements have withered on the vine and are long gone.  Some still thrive in niche applications.  I don’t think NoSQL will wither and die; nor is it simply a small niche.  Yet neither will it completely replace SQL databases either.  Both have incredible powers and are very good at what they do.  Both I think are here to stay for the foreseeable future.

But don’t fall into the trap of thinking you have a data problem and assuming the newer solution, NoSQL is the better solution.  It might be.  It might not. If I have to climb rope, I’ll use the “newer” (it’s decades old now) technology of my Frog climbing system. If I need to put a rope-grab on a rope, I’ll almost certainly go back to the “older” (it’s all relative) technology of a Prusik knot.

The actual needs of the application drive the solution. Don’t allow the solution to drive the design.

A good rule of thumb is: Anytime anyone sets up a debate, especially about technology as an X vs. Y debate, they’re probably framing it wrong from the get-go. It’s rarely X xor Y, but almost always X and/or Y, or even possibly Z.