Deep Drilling

I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.

But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen).  Nope.

Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records.  No luck.

I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.

Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

Now we’re getting someplace.  After a minor diversion of my own doing I then ran

DBCC CheckIDENT('sysjobhistory',NORESEED)

This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.

Now, the simple solution, which worked for me in this case was to issue a

truncate table sysjobhistory

This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.

But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs.  But for a year that should generate less than 1 million rows.  This database server hasn’t been running for 2 thousand years.

So, I decided to monitor things a bit and wait for a few jobs to run.

Then, I executed the following query.

select max(instance_id) from sysjobhistory

This returned a value along the lines of 232031.  Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!

So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.

I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November.  Each directory had close to 22,000 log files that should have been deleted and weren’t.

Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.

I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.

Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).

So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again.  That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.

But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.

Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.

 

Practicing for Disaster

I’ve had this post by Wayne Hale in my queue for awhile since I’ve wanted to comment on it for awhile and until lately have been to busy to do so.

One of my current contracts requires them to do an annual DR test.  Since the end of the year is approaching, they’re trying to get the test in. Part of the test requires an “official” test observed by an outside auditor.

So, being smart, and since a lot has changed in the past year, we decide to schedule a dry-run or two before hand.

Well let’s just say those have not gone as expected.

Some might consider the dry-runs failures.

I don’t. I consider them successes. We are finding out now, in a controlled environment with no real time pressures, where we are weak and need to fix things.

It’s far better to do this now than during the audited test or even better than during an actual disaster event! So the dry-runs are serving their purpose, they’re helping us find the holes before it’s too late.

That said, I have to claim the part that I’m most involved with, the SQL Log-Shipping has been working well.  The only issue this week with that was a human error made by another DBA that was completely unrelated to the DR test and within minutes of him discovering his error he executed the proper procedure to begin fixing it.  The total fix on his end took no more than 5 minutes and other than monitoring on my end, the effort on my end took no more than 5 minutes.  That’s an excellent example of a robust design and set of procedures.

Today’s moral is don’t just have a DR plan, practice it. And not every failure is really a failure.

The obvious isn’t so obvious

Ok, apologies to my loyal reader or two for not blogging in a LONG time.  Either life has been too busy, or when things have calmed down, I haven’t had anything to blog about.

Normally I don’t want to blog weird technical stuff, but this time I will.

I’m currently working for a client doing some DBA work.  Lots of fun actually.  

Said client has an issue. They want to get have a standby datacenter.  Great idea.  Log-shipping will work well here for them.  Just one catch: corporate won’t allow anything through the firewall unless it’s SSH.  Hmm.

No real problem, I figure I can do “log-shipping in the blind”.  Basically take the transaction logs from the primary server, use rsync to get through the firewall and apply them to the secondary server.  If need be, I’ll custom write the jobs.  It should be easy.

Here’s one example:

http://sqlblog.com/blogs/merrill_aldrich/archive/2011/05/19/case-study-secure-log-shipping-via-ssl-ftp.aspx

The key part (for me) is the part there “Half a Log-Shipping Config”

Pretty straightforward, no DBA would balk at that.

So, I set it all up. Setup some scheduled tasks.  Make a backup, copy it to the secondary server and restore it.  So far so good.  Take logs created by existing log backup job, and watch as they automatically get shipped to the datacenter.  

“Hey, this is going great. I’ll be done before lunch.”

Manually apply a log or two to make sure there are no issues.

“Works perfect.  Lunch is starting to look real good. Maybe I’ll break early and go out.”

Setup scheduled task to run the job created above.

Fails. “Oh fudge. (Hmm, maybe some fudge after lunch?)”

Oh wait, just a typo in the directory.  

Rebuild job. Run it. Success!

Great, let’s check things out.  

“Hmm, the .TUF (transaction undo file) isn’t there.”

Let’s look at the jobs.

Now here, you’ll have to bear with me. Again, the corporate rules will NOT permit me to cut-paste even simple text error messages from a server in the datacenter.

But basically get a bunch of messages along the lines of:

2013-06-26 05:00:01.92 Skipped log backup file. Secondary DB: ‘MyDemo’, File: ‘\\NAS\Logshipping\MyDemo_201306261156.trn’

A bunch of these.

Well, a lot of Googling suggested that since log flies were empty (this database doesn’t get much traffic) SQL Server was smart enough to know there was nothing to apply.

Sure enough, manually applying them showed there was nothing for them to do. I needed a real transaction log with a transaction in it. No problem. Go to original database. Do a quick update.  Create transaction log and wait for the automated log copier to get it to the right place.

“Hmm, maybe I’ll just do a late lunch today.”

Get it to the secondary server. Run the job.

“Hmm. Skipped the files I expected it to skip.  No problem.”

Gets to the file I expect it to apply:

(now I’m retyping rather than cutting/pasting)

2013-06-26 13:14:05.43 Found first log backup to restore. Secondary DB: ‘MyDemo’; File: 

‘\\NAS\Logshipping\MyDemo_201306261605.trn’ 

2013-06-26 13:14:05.44 The restore option was successful. Secondary Database ‘MyDemo’ Number of log backup files restored: 0

“Huh? “

Ayup. You read it right. It found the right file. It said the restore was successful. Then it said 0 files were restored.

Now, I eventually broke for lunch.  And dinner. And bed. And breakfast. And another lunch and dinner and more sleep. Also a couple of bike rides and a bunch of meetings and other stuff.

But truth is, I was stumped. I tried everything. Rebuilt the job. Tried manually updating the tables on the secondary to tell it I had already applied a log (that got me a weird datetime error I never quite figured out)

I could manually apply the logs just fine.

Log-shipping between two servers within the datacenter worked just fine.

Why wasn’t this working? I mean a log file is a log file right?

Well, apparently yes and no.

The ONE difference I noticed was that the transaction logs from the working test had their filenames using UTC time stamps, including seconds.

The ones I was shipping were from a standard maintenance plan I had setup and times were in local time w/o the time stamp.

BTW, this http://www.sqlservercentral.com/Forums/Topic351549-146-1.aspx#bm438143 is what also helped me wonder about this.

“It couldn’t be THAT simple could it?”

So I setup a database in the primary datacenter, full logging, and setup log-shipping on that side. Now, I still couldn’t provide any data about the secondary, but I could at least script the primary side that does the backups.

Set that up.  Shipped over a database, some logs. Now I’d like to say it worked on the first try, but I ran into some issues (completely of my own doing, so not relevant here).

But 3rd time is the charm as they say.  Got a backup over to the secondary.  Let the autosync job move over some transaction logs (including several with transactions).

Then, on the secondary ran the job I had previously handcrafted and. SUCCESS.

So, yes, you can do “blind” log-shipping (which I knew).

You can setup the secondary by hand.

But apparently you can’t use your existing transaction log job. You’re better off setting up the log-backups on the primary using the normal tools and then shipping those.

So lesson learned.

Sometimes, it’s enough to know there’s a right answer to keep you driving towards it.

And now time for dinner.