Deep Drilling

I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.

But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen).  Nope.

Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records.  No luck.

I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.

Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

Now we’re getting someplace.  After a minor diversion of my own doing I then ran

DBCC CheckIDENT('sysjobhistory',NORESEED)

This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.

Now, the simple solution, which worked for me in this case was to issue a

truncate table sysjobhistory

This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.

But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs.  But for a year that should generate less than 1 million rows.  This database server hasn’t been running for 2 thousand years.

So, I decided to monitor things a bit and wait for a few jobs to run.

Then, I executed the following query.

select max(instance_id) from sysjobhistory

This returned a value along the lines of 232031.  Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!

So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.

I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November.  Each directory had close to 22,000 log files that should have been deleted and weren’t.

Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.

I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.

Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).

So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again.  That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.

But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.

Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.

 

Experimenting

There are times when you have to take at face value what you are told.

There are 1.31 billion people living in China. This according to several sources (that all probably go back to the same official document from the Chinese government.)  I’m willing to believe that number. I’m certainly not going to go to China and start counting heads. For one, I don’t have the time, for another, I might look awfully weird doing so. It’s also accurate enough for any discussions I might have about China. But if I were going to knit caps for every person in China I might want a more accurate number.

That said, sometimes one shouldn’t take facts at face value. A case in point is given below. Let me start out with saying the person who gave me this fact, wasn’t wrong.  At least they’re no more wrong than the person who tells me that the acceleration due to gravity is 9.8m/s².  No, they are at worst inaccurate and more likely imprecise. Acceleration due to gravity here on Earth IS roughly 9.8m/s². But it varies depending where on the surface I am. And if I’m on the Moon it’s a completely different value.

Sometimes it is in fact possible to actually test and often worth it. I work with SQL Server and this very true here. If a DBA tells you with absolute certainty that a specific setting should be set, or a query must be written a specific way or an index rebuilt automatically at certain times, ask why. The worst answer they can give is, “I read it some place.”  (Please note, this is a bit different from saying, “Generally it’s best practice to do X”. Now we’re back to saying 9.8m/s², which is good enough for most things, but may not be good enough if say you want to precisely calibrate a piece of laboratory equipment.)

The best answer is “because I tested it and found that it works best”.

So, last night I had the pleasure of listening to Thomas Grohser speak on the SQL IO engine at local SQL Server User Group meeting. As always it was a great talk. At one point he was talking about backups and various ways to optimize them. He made a comment about setting the maxtransfersize to 4MB being ideal. Now, I’m sure he’d be the first to add the caveat, “it depends”. He also mentioned how much compression can help.

But I was curious and wanted to test it. Fortunately I had access to a database that was approximately 15GB in size. This seemed liked the perfect size with which to test things.

I started with:

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB.BAK’ with maxtransfersize=4194304

This took approximately 470 seconds and had a transfer rate of 31.151 MB/sec.

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB_COMP.BAK’ with maxtransfersize=4194304, compression

This took approximately 237 seconds and a transfer rate of 61.681 MB/sec.

This is almost twice as fast.  While we’re chewing up a few more CPU cycles, we’re writing a lot less data.  So this makes a lot of sense. And of course now I can fit more backups on my disk. So compression is a nice win.

But what about the maxtransfersize?

backup database TESTDB to disk=’Z:\backups\TESTDB.BAK’

This took approximately 515 seconds and a transfer rate of 28.410 MB/sec. So far, it looks like changing the maxtransfersize does help a bit (about 8%) over the default.

backup database TESTDB to disk=’Z:\backups\TESTDB_comp.BAK’ with compression

This took approximately 184 seconds with a transfer rate of 79.651 MB/sec.  This is the fastest of the 4 tests and by a noticeable amount.

Why? I honestly, don’t know. If I was really trying to optimize my backups, most likely I’d run each of these tests 5-10 more times and take an average. This may be an outlier. Or perhaps the 4MB test with compression ran slower than normal.  Or there may be something about the disk setup in this particular case that makes it the fastest method.

The point is, this is something that is easy to setup and test. The entire testing took me about 30 minutes and was done while I was watching tv last night.

So before you simply read something on some blog someplace about “you should do X to SQL Server” take the time to test it. Perhaps it’s a great solution in your case. Perhaps it’s not. Perhaps you can end up finding an even better solution.

 

 

 

 

Documentation

Do it, it’s important.

Ok, I suppose I should expand a bit upon that and in this case add an actual example.

So last night, I again attended the local SQL Server User Group meeting. The talk this month was by Ray Kim and was on Documentation for Techies.  While we all agree that documentation is good, it’s sort of interesting how rare most techs actually do it. Ray’s talk covered some of this and further talked about exactly how valuable it is. In addition, several audience members spoke about how proper documentation saved their company a great deal of money simply by giving their tech support people the ability to answer questions in a far faster form.

I got thinking about some of the clients I’ve worked for and how I’ve wanted to document stuff, but often they have very little actually setup in the way of procedures to handle documentation. This is unfortunate, because it can cost them money. For example, for a client right now I’m working on automating a task.  It turns out that there’s not much documentation, so I’m basically struggling to figure things out as a I go.

One thing you hear tech folks talk about a lot is “oh the code is self-documenting”. And sometimes it is.  Since I work in SQL, often, but not always it’s clear what the code is doing. For example

Select firstname, lastname from Clients where ClientID=@ClientID

probably doesn’t need a comment saying what it does.  It’s pretty clear.  But a more complex query might need some commenting, or it may need some explanation as why a particular approach was taken. For example I was recently writing a stored procedure where the where clause was not quite what one would expect if one were to naively write it in the most obvious manner.  However, the obvious manner would have resulted in a table scan of a very large table. By writing what I did, I could ensure a seek would occur.

I also had a habit, which after thinking about last night and testing today, I’m going to modify a bit. Often I’d write procedures such as:

-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
if OBJECT_ID('foo', 'p') is not null drop procedure foo
go
create procedure foo
as
select 'bar'
go

Now, note technically this is a script (T-SQL) that will drop and then create the procedure, so it’s more than just the script. But it’s useful for me because I can ensure I’m running the latest and greatest and drop the old one if it exists before running it.

But, last not got me thinking. What happens if 3 years down the road someone comes along and needs to edit my code. Let’s say the client didn’t do a good job of keeping track of source code and they have to extract the scripts to create the procedures from SQL Server itself using say SSMS?

The results end up looking much more like this:

USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]') AND type in (N'P', N'PC'))
DROP PROCEDURE [dbo].[foo]
GO
USE [Baz]
GO
/****** Object:  StoredProcedure [dbo].[foo]    Script Date: 03/15/2016 10:47:22 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
create procedure [dbo].[foo]
as
select 'bar'
GO

Ignore the extra USE statements and the SSMS generated comments and SET statements. Notice my comments are gone.  This actually makes sense because in the first script, the comments occur before a GO statement so the SQL engine interprets them as completely separate from the statements to create the actual stored proc.  All my useful comments are now history.

BUT, there’s a simple solution. Move the comments to after the first GO statement.

if OBJECT_ID('foo', 'p') is not null drop procedure foo
 
go
 
-- Usage: Exec FOO
-- Author: Greg D. Moore
-- Date: 2016-03-15
-- Version: 1.0
-- This simply returns bar when executed
-- Version: 1.1
-- Comments moved below GO statement
 
create procedure foo
as
 
select 'bar'
go

Now if I use SSMS to generate my script I get:

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[foo]’) AND type in (N’P’, N’PC’))
DROP PROCEDURE [dbo].[foo]
GO

USE [Baz]
GO

/****** Object: StoredProcedure [dbo].[foo] Script Date: 03/15/2016 10:48:53 ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

— Usage: Exec FOO
— Author: Greg D. Moore
— Date: 2016-03-15
— Version: 1.0
— This simply returns bar when executed
— Version: 1.1
— Comments moved below GO statement

create procedure [dbo].[foo]
as

select ‘bar’

GO

Now my great documentation is preserved. This is a small thing but down the road could save the next developer a lot of trouble.

So, stop and think about not only documentation, but how to make sure it’s preserved and useful in the future.

Who’s Flying the Plane

I mentioned in an earlier post my interest in plane crashes. I had been toying with a presentation based on this concept for quite awhile.

A little over a month ago, at the local SQL Server User group here in Albany I offered to present for the February meeting. I gave them a choice of topics: A talk on Entity Framework and how its defaults can be bad for performance and a talk on plane crashes and what IT can learn from them.  They chose the latter. I guess plane crashes are more exciting than a dry talk on EF.

In any event, the core of the presentation is based on the two plane crashes mentioned in the earlier post, Eastern Airlines Flight 401, the L-1011 crash in Florida in 1972 and US Airways Flight 1549, the Miracle on the Hudson in 2009.

I don’t want to reproduce the entire talk here (in part because I’m hoping to present it elsewhere) but I want to highlight one slide:

Flight 401 vs 1549

  • Flight 401 – Perfectly good aircraft
  • Flight 1549 –About as bad as it gets
  • Flight 401 – 101 Fatalities/75 Survivors
  • Flight 1549 – 0 Fatalities

Flight 401 had a bad front nosegear landing light and crashed.

Flight 1549 had two non-functional engines and everyone got off safely.

The difference, was good communications, planning, and a focus at all times on who was actually flying the airplane.

This about this the next time you’re in a crisis.  Are you communicating well? How is your planning, and is someone actually focused on making sure things don’t get worse because you’re focusing on the wrong problem.  I touch upon that here when I talk about driving.

The moral: always make sure someone is “flying the plane”.

Avoiding Mistakes

People often treat mistakes as unavoidable. Or sometimes people think mistakes are made because the person making them is unfamiliar with the situation or environment.

The truth is far more complex and often mistakes are not only avoidable, but they’re a direct result of the person being overly familiar with the situation or environment.

A post that came across my desk the other day discusses the Normalization of Deviance. No, I’m not talking about how references to 50 Shades of Grey are all over the place. I’m talking about how we come to accept errors as “ok” or even normal.

A classic example of this are the cases of O-ring burnthrough on flights prior to the Challenger disaster. The original specs called for no burnthrough. Any burnthrough was not acceptable. Yet once it was observed the basic attitude was that it hadn’t caused a problem so it was acceptable. At one point apparently when one was burned through approximately 33% the way, a claim was made that the O-rings had a safety factor of 3. This is a gross misapplication of the concept of a safety-factor since the O-rings were specced to have zero burn-through. By moving the goalposts, they permitted further launches to occur and burnthroughs to continue to until 51L and seven lives were lost. This was a huge management error. In this case the mistake was to ignore the original rule and essentially rewrite it without adequate review. The engineers had become used to the new norm, despite it being wrong.

In the example given in the first link above, a different form of deviance occurred. This was a social deviance that apparently made the lack of use of checklists acceptable.

In the crash, a large red warning device was completely ignored. One would think this was the mistake that caused the crash. However it’s really secondary to the original problem. The original problem is that checklists were developed precisely because humans CAN fail to notice large red warning devices. By not performing the checklist a mistake was missed and lives lost. Everything else is sort of fluff.

For pilots, take-offs become a routine procedure. So routine they begin to make simple mistakes.  Had this been their first time flying or even their first time in that particular model of aircraft they most certainly would have been paying attention. This is why checklists exist in cases like this, to eliminate the mistakes routine can introduce. Either pilot should have questioned the lack of the take-off checklists and insisted on their use.

They didn’t and people died.

 

 

Post hoc ergo propter hoc

One of my favorite shows is The West Wing and there is an episode of the same name as this post. Unfortunately for you, Aaron Sorkin is a better writer than I.

That said, this concept, “After it, therefore because of it” is a common mistake many of us make when forming theories. It’s related to the concept that correlation is not causation.

I was reminded of this the other night when another phrase entered my mind: “Rain Follows The Plow”. This was a hopeful theory in the 19th century that as settlers settled past the 100th Meridian, the rain would follow where they plowed. Simply put, by farming the land, rainfall would increase.

The theory sounds a bit perverse until one considers that for awhile, increased rainfall did seem to increase as the more land came under the plow. So, there was some basis for the idea at first. The correlation seemed to match. However, this just ended up being a short-term climate change.

Unfortunately the theory was also a product of the idea that humans were the center of creation. As the subsequent Dust Bowl and other issues showed however, this theory was, (excuse the bad pun) all wet.

Sometimes correlation is not causation and we should not let our all too human biases influence our theories.

Fortunately, properly done, science is eventually self-correcting. Scientists make mistakes, but over time, the winnowing process eliminates them.  The idea of scientific racism was once extremely popular, but over time has clearly been shown to be false.  The idea of an ether was shown to be false.

Meanwhile, other theories have continued to hold up to intense scrutiny. As weird as quantum mechanics appears to be, evidence continues to mount that much of the current theory is in fact correct. When scientists discover particles that travel faster than light the default assumption continues to be (and so far correctly) that there is an error in the experiment.

Not much of a moral here other than just because the rooster crows when the sun rises, don’t mistake the crowing for the cause of the sunrise.

 

GIGO

A huge tenet of programming is GIGO: Garbage In/Garbage Out.

Years ago when I was practicing for a play (Night of January the 16th by Ayn Rand). I was the bailiff.  At one point in the play I’m handed a copy of a check that is evidence. I’m supposed to “read” what’s on the check. Of course since it’s a play, I have my lines memorized.

But during this dress rehearsal I’m given a piece of paper with actual writing on it. Unfortunately it was just some random writing. But my brain went into segfault and I stopped. Part of my brain wanted to read what’s on the piece of paper.  Part of my brain wanted to say my lines, but it could no longer remember them.

It was a perfect example of how easy it is to scramble the input for our brains.  In the actual performances we made sure the piece of paper was actually blank.

I was reminded of this the other night when Steve Harvey made his gaff on live television. I was curious how he could make such a mistake but I had my suspicions. And I was right.  The cue card apparently was VERY poorly designed and his visual input system (i.e. his eyes and brain) screwed up. Read here for more details. Bad input lead to bad output.

These are humorous examples, but in the software world, these can be very dangerous.

At one point during the shuttle program, they found an error where the arm thought it had rotated more than 360 degrees, a physical impossibility. This link has some details (though in my recollection the issue was not a rounding error but that the code went from 0-360 instead of 0-359 or 1-360).  Garbage in could have lead to potentially bad garbage out.

Much more recently however, here’s an example of intentional “garbage” in. This is part of the encryption software used on many firewalls. Your bank or other financial institution for example may be using this code.

Ironically true garbage, as in a purely random number, might be better. But here it seems someone poisoned the input with their specific number and then set it up to use the results in a dangerous manner. I say dangerous because the 3rd party using this code may not realize that they’re completely vulnerable to having all their data seen.  About the only thing worse than unencrypted data is data you think is encrypted but isn’t.  In the former, I’m probably going to pay far more attention to who has access.  I’ll add too that some of us suspect the NSA had a hand in this.

This is by the way I highly recommend folks don’t write their own encryption. Unless you’re an expert you’re liable to screw it up.

Moral: So be careful of your inputs, they definitely influence your outputs, both in code and in your brain.

Page 81

One of the things that has always fascinated me is human memory; how we create them, what sticks, what doesn’t and how it evolves.

Many people think that human memories are rather static. The truth is that’s far from the truth.  As we saw in the 1990s is easy to form false memories and easy to conflate them.

One detail that is interesting is that human memories are a bit like DRAM in a computer.  In essence when we recall a memory, we have to basically read out the memory space and write it back.  One of the side effects is this can actually help strengthen memories.  However, it also means when it’s written back, other memories can be conflated with it and a new, slightly different memory is formed.

There’s two main ways of remember something that stand out to me as I write this.  Repetition and what I’ll call “sudden shock”.

Many things we need to repeat until we remember them.  An example is a child learning their times tables. There’s really not much context and really only rote repetition will cause these to sink in.

At the other end of the spectrum are the memories that are etched in our minds. “Where were you when Challenger blew up?”  “How did you first hear about 9/11?”  If you ask someone of the right age, they’ll know exactly when/where they were and probably recall vivid details.

If you ask them where they were on the 3rd shuttle mission, they’d probably have no clue.  The same is true if you ask them what they were doing on 9/9.

In between are more general memories. Memories of childhood that don’t necessarily have a specific timestamp or even importance.  I recall playing in some woods behind my house growing up, but there was nothing really significant about the time or place. I have no idea why I have that memory.

I mentioned above that memories can be modified or manipulated. There’s some work on treating PTSD this way; helping patients recall specific events under controlled circumstances and essentially rewriting the memory into something that doesn’t cause an attack. (Propanolol is one drug being experimented with to do this.)

Strangely there’s one memory of mine that persists that while not a real issue is sort of pointless and annoying to me.  It’s “Page 81”.

What’s that you ask? Many years ago (let’s just say before I was a teenager I think) I was staying at my cousin’s grandmother’s house.  On the bookshelf they had a copy of Jaws 2. I started reading it but had to leave before I could finish it. Since I knew I’d be back the next summer I decided to remember what page I was on. I repeated the page number to myself over and over again. And to this day, I can remember, I was on page 81 of Jaws 2 when I stopped reading.  Of course decades later I have no idea what happened in pages 1-80 so the memory doesn’t do me much good. But there it is. It’s still there. Page 81.

As a note, most of this post was based on memory (I had to look up the name of the drug) so some details may be wrong.

Page 81.

 

 

Impossible Things

“I daresay you haven’t had much practice,” said the Queen. “When I was your age, I always did it for half-an-hour a day. Why, sometimes I’ve believed as many as six impossible things before breakfast.

I wanted to relate a story that happened to a colleague of mine.  She maintains servers in two separate datacenters. All the servers are on the same domain.  So, in theory updating the password in one datacenter does so in both datacenters.  And this is the way it has been for the past year. Things worked as expected.

Recently she noticed however that after updating her password, the one datacenter was still using the old password and the second datacenter was using the new one.

In a proper domain this shouldn’t be possible, but apparently it was.  She spent some time confirming it before calling the IT department.

When she explained the problem to IT, their response was, “That’s not possible.  We synch the servers every hour and run an exception report every night that would show that.”

IT managed to fix the problem (after finally acknowledging it.)

However, the troubling problem though to me is not that the passwords got out synch. Sometimes the impossible does happen. What’s more troubling is that the sync apparently never fixed the problem and the exception report never showed an issue in over the month the problem existed.

The moral here?

It’s not that the impossible sometimes happens (though that would be good for a future blog post).  It’s when your alerts and warnings fail, perhaps it’s time you look at your alerts and warnings since they’re obviously not doing you much good.

The obvious isn’t so obvious

Ok, apologies to my loyal reader or two for not blogging in a LONG time.  Either life has been too busy, or when things have calmed down, I haven’t had anything to blog about.

Normally I don’t want to blog weird technical stuff, but this time I will.

I’m currently working for a client doing some DBA work.  Lots of fun actually.  

Said client has an issue. They want to get have a standby datacenter.  Great idea.  Log-shipping will work well here for them.  Just one catch: corporate won’t allow anything through the firewall unless it’s SSH.  Hmm.

No real problem, I figure I can do “log-shipping in the blind”.  Basically take the transaction logs from the primary server, use rsync to get through the firewall and apply them to the secondary server.  If need be, I’ll custom write the jobs.  It should be easy.

Here’s one example:

http://sqlblog.com/blogs/merrill_aldrich/archive/2011/05/19/case-study-secure-log-shipping-via-ssl-ftp.aspx

The key part (for me) is the part there “Half a Log-Shipping Config”

Pretty straightforward, no DBA would balk at that.

So, I set it all up. Setup some scheduled tasks.  Make a backup, copy it to the secondary server and restore it.  So far so good.  Take logs created by existing log backup job, and watch as they automatically get shipped to the datacenter.  

“Hey, this is going great. I’ll be done before lunch.”

Manually apply a log or two to make sure there are no issues.

“Works perfect.  Lunch is starting to look real good. Maybe I’ll break early and go out.”

Setup scheduled task to run the job created above.

Fails. “Oh fudge. (Hmm, maybe some fudge after lunch?)”

Oh wait, just a typo in the directory.  

Rebuild job. Run it. Success!

Great, let’s check things out.  

“Hmm, the .TUF (transaction undo file) isn’t there.”

Let’s look at the jobs.

Now here, you’ll have to bear with me. Again, the corporate rules will NOT permit me to cut-paste even simple text error messages from a server in the datacenter.

But basically get a bunch of messages along the lines of:

2013-06-26 05:00:01.92 Skipped log backup file. Secondary DB: ‘MyDemo’, File: ‘\\NAS\Logshipping\MyDemo_201306261156.trn’

A bunch of these.

Well, a lot of Googling suggested that since log flies were empty (this database doesn’t get much traffic) SQL Server was smart enough to know there was nothing to apply.

Sure enough, manually applying them showed there was nothing for them to do. I needed a real transaction log with a transaction in it. No problem. Go to original database. Do a quick update.  Create transaction log and wait for the automated log copier to get it to the right place.

“Hmm, maybe I’ll just do a late lunch today.”

Get it to the secondary server. Run the job.

“Hmm. Skipped the files I expected it to skip.  No problem.”

Gets to the file I expect it to apply:

(now I’m retyping rather than cutting/pasting)

2013-06-26 13:14:05.43 Found first log backup to restore. Secondary DB: ‘MyDemo’; File: 

‘\\NAS\Logshipping\MyDemo_201306261605.trn’ 

2013-06-26 13:14:05.44 The restore option was successful. Secondary Database ‘MyDemo’ Number of log backup files restored: 0

“Huh? “

Ayup. You read it right. It found the right file. It said the restore was successful. Then it said 0 files were restored.

Now, I eventually broke for lunch.  And dinner. And bed. And breakfast. And another lunch and dinner and more sleep. Also a couple of bike rides and a bunch of meetings and other stuff.

But truth is, I was stumped. I tried everything. Rebuilt the job. Tried manually updating the tables on the secondary to tell it I had already applied a log (that got me a weird datetime error I never quite figured out)

I could manually apply the logs just fine.

Log-shipping between two servers within the datacenter worked just fine.

Why wasn’t this working? I mean a log file is a log file right?

Well, apparently yes and no.

The ONE difference I noticed was that the transaction logs from the working test had their filenames using UTC time stamps, including seconds.

The ones I was shipping were from a standard maintenance plan I had setup and times were in local time w/o the time stamp.

BTW, this http://www.sqlservercentral.com/Forums/Topic351549-146-1.aspx#bm438143 is what also helped me wonder about this.

“It couldn’t be THAT simple could it?”

So I setup a database in the primary datacenter, full logging, and setup log-shipping on that side. Now, I still couldn’t provide any data about the secondary, but I could at least script the primary side that does the backups.

Set that up.  Shipped over a database, some logs. Now I’d like to say it worked on the first try, but I ran into some issues (completely of my own doing, so not relevant here).

But 3rd time is the charm as they say.  Got a backup over to the secondary.  Let the autosync job move over some transaction logs (including several with transactions).

Then, on the secondary ran the job I had previously handcrafted and. SUCCESS.

So, yes, you can do “blind” log-shipping (which I knew).

You can setup the secondary by hand.

But apparently you can’t use your existing transaction log job. You’re better off setting up the log-backups on the primary using the normal tools and then shipping those.

So lesson learned.

Sometimes, it’s enough to know there’s a right answer to keep you driving towards it.

And now time for dinner.