I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.
But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen). Nope.
Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records. No luck.
I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.
Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.
Now we’re getting someplace. After a minor diversion of my own doing I then ran
DBCC CheckIDENT('sysjobhistory',NORESEED)
This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.
Now, the simple solution, which worked for me in this case was to issue a
truncate table sysjobhistory
This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.
But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs. But for a year that should generate less than 1 million rows. This database server hasn’t been running for 2 thousand years.
So, I decided to monitor things a bit and wait for a few jobs to run.
Then, I executed the following query.
select max(instance_id) from sysjobhistory
This returned a value along the lines of 232031. Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!
So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.
I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November. Each directory had close to 22,000 log files that should have been deleted and weren’t.
Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.
I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.
Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).
So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again. That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.
But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.
Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.