“So, why are you sitting here?”

I had been anticipating the question and it was a fair question, after all, I was one of two men sitting at the Women in Technology Birds of a Feather table at PASS Summit.  But let me back up a bit.

Last week was the PASS Summit in Seattle, an annual event that I mentioned two weeks ago that I was headed to. There are several thousand people that attend and in order to promote networking, in the massive lunch hall, they have a number of tables set aside for particular topics, i.e. “birds of a feather”. So if there’s a particular topic or interest group you are associated with you, you can sit at such a table and know you’re among like minded friends. For example on Day One I had set at the “Virtual and Local User Group” table.  But today, I found myself at the Women in Technology table.

So why?

Let’s back up even further. I grew up in a small town in the northwest corner of Connecticut. I can’t say my parents were poor, but we probably lived below what many would consider a middle-class lifestyle. However, I was very fortunate to have hard-working parents and grandparents who helped, and more than a bit of privilege.  What do I mean by this? One example comes to mind. A couple of years after college when I was first consulting, I needed a small business loan to cover a project for a client. I literally walked into the local bank and on my word got the loan I needed. Even then I realized I had a bit of privilege going on there.

As I’ve grown older, I’ve listened to more and more testimonies from women and persons of color and continued to realize how for granted I’ve taken many aspects of my life. As a result, I’ve worked to listen to others and try to increase their access to opportunities and gain the same privilege I was simply born with by being a white male.

So why was I there?

The question was not a surprise, since the table host, Kathi Kellenberger had said she wanted to go around the table and ask folks why they were there. fortunately she hadn’t started with me first! This gave me time to think about my answer.

To listen. To listen to two women of color talk about their struggles and efforts to make it into the world of being SQL DBAs. To listen to other women talk about their experiences and to learn from them.

So I gave that and a bit more as my answer and then shut up and listened. It was a great lunch and a great experience.  As my friend, and WIT Virtual Group co-leader (along Kathi) Rie Irish is wont to say, “if women could solve these problems we’d have done so by now. We need your help”.

So to my fellow men out there, I would say, be an ally. Attend the WIT Luncheon (which was the day before) at Pass Summit.  Encourage women to speak at your User Group and at SQL Saturdays, stop others from interrupting them during meetings, amplify their ideas. And sometimes, just shut up and listen. And if you’re involved with SQL Server and PASS and want more information reach out to Rie and Kathi and contact the Virtual Group the manage, Women in Technology.  Trust me, men are welcome as allies.

 

The obvious isn’t so obvious

Ok, apologies to my loyal reader or two for not blogging in a LONG time.  Either life has been too busy, or when things have calmed down, I haven’t had anything to blog about.

Normally I don’t want to blog weird technical stuff, but this time I will.

I’m currently working for a client doing some DBA work.  Lots of fun actually.  

Said client has an issue. They want to get have a standby datacenter.  Great idea.  Log-shipping will work well here for them.  Just one catch: corporate won’t allow anything through the firewall unless it’s SSH.  Hmm.

No real problem, I figure I can do “log-shipping in the blind”.  Basically take the transaction logs from the primary server, use rsync to get through the firewall and apply them to the secondary server.  If need be, I’ll custom write the jobs.  It should be easy.

Here’s one example:

http://sqlblog.com/blogs/merrill_aldrich/archive/2011/05/19/case-study-secure-log-shipping-via-ssl-ftp.aspx

The key part (for me) is the part there “Half a Log-Shipping Config”

Pretty straightforward, no DBA would balk at that.

So, I set it all up. Setup some scheduled tasks.  Make a backup, copy it to the secondary server and restore it.  So far so good.  Take logs created by existing log backup job, and watch as they automatically get shipped to the datacenter.  

“Hey, this is going great. I’ll be done before lunch.”

Manually apply a log or two to make sure there are no issues.

“Works perfect.  Lunch is starting to look real good. Maybe I’ll break early and go out.”

Setup scheduled task to run the job created above.

Fails. “Oh fudge. (Hmm, maybe some fudge after lunch?)”

Oh wait, just a typo in the directory.  

Rebuild job. Run it. Success!

Great, let’s check things out.  

“Hmm, the .TUF (transaction undo file) isn’t there.”

Let’s look at the jobs.

Now here, you’ll have to bear with me. Again, the corporate rules will NOT permit me to cut-paste even simple text error messages from a server in the datacenter.

But basically get a bunch of messages along the lines of:

2013-06-26 05:00:01.92 Skipped log backup file. Secondary DB: ‘MyDemo’, File: ‘\\NAS\Logshipping\MyDemo_201306261156.trn’

A bunch of these.

Well, a lot of Googling suggested that since log flies were empty (this database doesn’t get much traffic) SQL Server was smart enough to know there was nothing to apply.

Sure enough, manually applying them showed there was nothing for them to do. I needed a real transaction log with a transaction in it. No problem. Go to original database. Do a quick update.  Create transaction log and wait for the automated log copier to get it to the right place.

“Hmm, maybe I’ll just do a late lunch today.”

Get it to the secondary server. Run the job.

“Hmm. Skipped the files I expected it to skip.  No problem.”

Gets to the file I expect it to apply:

(now I’m retyping rather than cutting/pasting)

2013-06-26 13:14:05.43 Found first log backup to restore. Secondary DB: ‘MyDemo’; File: 

‘\\NAS\Logshipping\MyDemo_201306261605.trn’ 

2013-06-26 13:14:05.44 The restore option was successful. Secondary Database ‘MyDemo’ Number of log backup files restored: 0

“Huh? “

Ayup. You read it right. It found the right file. It said the restore was successful. Then it said 0 files were restored.

Now, I eventually broke for lunch.  And dinner. And bed. And breakfast. And another lunch and dinner and more sleep. Also a couple of bike rides and a bunch of meetings and other stuff.

But truth is, I was stumped. I tried everything. Rebuilt the job. Tried manually updating the tables on the secondary to tell it I had already applied a log (that got me a weird datetime error I never quite figured out)

I could manually apply the logs just fine.

Log-shipping between two servers within the datacenter worked just fine.

Why wasn’t this working? I mean a log file is a log file right?

Well, apparently yes and no.

The ONE difference I noticed was that the transaction logs from the working test had their filenames using UTC time stamps, including seconds.

The ones I was shipping were from a standard maintenance plan I had setup and times were in local time w/o the time stamp.

BTW, this http://www.sqlservercentral.com/Forums/Topic351549-146-1.aspx#bm438143 is what also helped me wonder about this.

“It couldn’t be THAT simple could it?”

So I setup a database in the primary datacenter, full logging, and setup log-shipping on that side. Now, I still couldn’t provide any data about the secondary, but I could at least script the primary side that does the backups.

Set that up.  Shipped over a database, some logs. Now I’d like to say it worked on the first try, but I ran into some issues (completely of my own doing, so not relevant here).

But 3rd time is the charm as they say.  Got a backup over to the secondary.  Let the autosync job move over some transaction logs (including several with transactions).

Then, on the secondary ran the job I had previously handcrafted and. SUCCESS.

So, yes, you can do “blind” log-shipping (which I knew).

You can setup the secondary by hand.

But apparently you can’t use your existing transaction log job. You’re better off setting up the log-backups on the primary using the normal tools and then shipping those.

So lesson learned.

Sometimes, it’s enough to know there’s a right answer to keep you driving towards it.

And now time for dinner.

 

Eliminating the impossible

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”  Sherlock Holmes

Since I haven’t found this issue elsewhere, and since it’s been awhile since I’ve blogged, I figured I’d post.

So, the scenario is:

Windows 2008 R2 Cluster that was pre-existing before I arrived on the scene.  2-nodes.  setup to run SQL Server.  SQL Server would run fine on Node A, but a failover to Node B would fail.

Some back history to the setup that wasn’t complete nor detailed.  But the problem was suspected to be with DNS or Active Directory.

I arrive on the scene and one of my jobs is to setup additional clustered SQL instances on this Windows Cluster. I do so, expecting to have the exact same issue. Nope. Things work fine once I figure out the rights my user needed but didn’t have in order to ADD the second node (Logon as a service btw). (For the time being I built a 1-node cluster, yes, you can do that, and then once I had the rights, simply added the 2nd node.)

So, now I’m in the situation with a 2-node cluster and 3 SQL instances.  Two fail over as expected.  One (the original) does not.

Time to put on my debugging hat on.

I won’t bore you with the details.  Suffice to say I tried a lot.

Compared ipconfig /all results – Everything the same (what wasn’t the same, I made the same where it made sense to.  Still no joy.)

Pinged the WINS and DNS servers from both boxes. OK, here was a difference.  Node A could ping both its primary and its secondary WINS server.  Node B could NOT ping its secondary WINS server.  Interesting. But, didn’t really seem like the issue since it couldn’t explain why the other 2 instances would fail over just fine.

Checked out the registry.  Same in both cases.

Start to look at error logs.  At first nothing.  Then realize that according to the timestamps, a SQLError Log IS being created on Node B.  I look even more closely. The service is actually STARTING!  But then it’s stopping. And in between there’s a bunch of errors about not being able to log in.  Very strange.

So now I try to tackle the problem from a different angle. I fail over the disk and IP resources but don’t tell the cluster service to startup SQL Server.

Then, I go to the command line and start the service manually.

Works fine. Connections can be made, etc.  Of course the cluster service doesn’t think it’s up, but that’s to be expected and ok at this point.

But, this is only a partial test.  Since maybe it’s my user that can do this, but not the service account.

So, go to the services screen, change SQL Server to startup using my account and confirm that works.  Great.

Change it back to the designated service account and start it manually from there.  Starts just fine.

BUT, no login errors.

Finally that part clicks.  The thing trying to login and do a query is the CLUSTER Service itself. It’s simply the heartbeat Cluster Service uses to make sure the node started. No wonder, it is attempting to start the node and then failing. It never hears the heartbeat.

Since it takes about a minute for the startup to actually fail, I confirm that I can connect to SQL Server in that minute window.  Sure enough, no problem, at least until the Cluster Service fails it.

So basically SQL Server is in fact running properly and starting up properly. It’s simply that the Cluster Service can’t confirm it is running so it shuts SQL Server down.

I started to try several various things that all ended up in a blind alley.

Then as I was poking around the SQL Server Configuration Manager on Node B it dawned on me to look at the SQL Native Client and compare it to Node A. The one critical difference was that Node B had some aliases setup.  They looked correct, but following a troubleshooting axiom of mine “One of these things is not like the other” I decided to rename them (not delete them, since another axiom is “don’t break anything you can’t fix”) so they wouldn’t be used.

I then tested the failover, fully not expecting this to solve the problem. The failover worked just fine. Wow. That surprised me.Of course I never trust anything I can’t replicate.  Changed the aliases back to their original form. Test failover. It fails. Change them back to the updated names and things work again.

I had my solution.

Now, my blog is intended to be more about thinking than actual technical issues, but for this I’ll make an exception.  So for future reference, Google and more:

The error I received in the SQL Error logs was:

2013-02-20 08:36:47.74 Logon Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: 192.168.3.44]
2013-02-20 08:36:47.74 Logon Error: 17806, Severity: 20, State: 2.

No Googling for this helped.(It’s a common error in other contexts, none were helpful here that I found.)

But otherwise, this was your basic troubleshooting.

  • Eliminate possibilities
  • Try variations
  • When you think you’ve solved it, replicate it.

And, no matter how improbable it is (I never would have guessed Aliases) if you’ve eliminated everything else, it must be that.

SQL Server User Group

Not much of a post tonight.  I didn’t realize it had been over two months since I last posted.  I’m sure all my faithful readers (all 1 or 2 of them?) have been holding their breath.

Anyway, tonight went to the local SQL Server Users Group meeting.  They had a remote demonstration on query tuning, one of my favorite topics.

In fact performance tuning in general has been a topic I’ve often enjoyed.

First rule: There is always a performance bottleneck.  This is one of the first rules I recall reading.  At first people object, “but the system is fast enough.”  That may be true.  BUT, there is still limiting it from being better.  Of course you might not be able to fix that limit.  But more importantly, it may not matter.

Second rule: It may not matter.  If your query is already running in subsecond times, it may not be worth spending any time on optimizing it any further.  Or, if your query takes 1 hour to run, but runs at night when nothing else is running, it may not matter.

Third rule: Optimize only what you need to.  A classic example of this I’ve seen is finding reports that run overnight and slow down other processes.  You start to optimize it and then think to ask, “is anyone still using this report?”  You find out the report is no longer being used.  Now you can achieve the holy grail (and perhaps the only exception to rule 1): infinite optimization.  Delete the report and suddenly you have infinitely optimized it.

BTW, these rules don’t apply just to SQL Server.

Optimize your life and enjoy more of your time doing things you enjoy.

White Ford Taurus

So, listening to the 24 hours of SQL Pass webinars. The current topic is “I Was Young and Didn’t Know Any Better” and the panelists are sharing war stories of mistakes they’ve made.

So far they all sound familiar.  So I thought I’d share one of mine.  Well technically not my mistake, but one that I adopted.

Many moons ago, I was advising a company that was involved in building websites for car dealerships.  One day they needed to do an update to the live data.  This was back in the days when all code and updates were cowboy updates.  Of course you ran the query on the live database the first time. You didn’t necessarily have a stating database or even as was later discovered, good backups.

Apparently a customer needed to update a car in their inventory.

UPDATE AUTO set cartype=’White Ford Taurus’

Nice, syntactically valid… and a disaster.  Ayup.  Suddenly every car in the database at every dealership was now a White Ford Taurus.

Ever since then we called that the “White Ford Taurus” problem.

Now, I might mock doing updates on live data, but sometimes its necessary.  I’m curious how others prevent their own “White Ford Taurus” problems.

Personally, I just now make EXTRA effort to make sure I have a WHERE clause.

But I also tend to almost always do it as:

BEGIN TRAN
UPDATE AUTO set cartype=’White Ford Taurus’
if @@rowcount<> 1 rollback tran else commit tran

Or sometimes I’ll simply hardcode the rollback tran, run it once, see what happens and then rerun it with a commit tran.

So, if rather than updating the 1 row I want, I find myself updating 1000s of rows, I’ll catch myself and be safe.

Sure, it’s not perfect, both it and using the WHERE clause require me to make sure I don’t forget them.  But the more ways to catch it, the better.

Obviously avoiding ad-hoc updates on live data is preferable, but when you can’t, be extra careful.  And of course make sure you have good backups. But that goes without saying.