Unknown's avatar

About Greg Moore

Founder and owner of Green Mountain Software, a consulting firm based in the Capital District of New York focusing on SQL Server. Formerly, a consulting DBA ("and other duties as assigned") by day, and sometimes night, and caver by night (and sometimes day). Now, a PA student working to add PA-C after my name so I can work as a Physician Assistant. When I'm not in front of a computer or with my family I'm often out hiking, biking, caving or teaching cave rescue skills.

Impossible Things

“I daresay you haven’t had much practice,” said the Queen. “When I was your age, I always did it for half-an-hour a day. Why, sometimes I’ve believed as many as six impossible things before breakfast.

I wanted to relate a story that happened to a colleague of mine.  She maintains servers in two separate datacenters. All the servers are on the same domain.  So, in theory updating the password in one datacenter does so in both datacenters.  And this is the way it has been for the past year. Things worked as expected.

Recently she noticed however that after updating her password, the one datacenter was still using the old password and the second datacenter was using the new one.

In a proper domain this shouldn’t be possible, but apparently it was.  She spent some time confirming it before calling the IT department.

When she explained the problem to IT, their response was, “That’s not possible.  We synch the servers every hour and run an exception report every night that would show that.”

IT managed to fix the problem (after finally acknowledging it.)

However, the troubling problem though to me is not that the passwords got out synch. Sometimes the impossible does happen. What’s more troubling is that the sync apparently never fixed the problem and the exception report never showed an issue in over the month the problem existed.

The moral here?

It’s not that the impossible sometimes happens (though that would be good for a future blog post).  It’s when your alerts and warnings fail, perhaps it’s time you look at your alerts and warnings since they’re obviously not doing you much good.

Link

Ok, perhaps this isn’t Buckland, but alerts can be important.

I’ve been wanting to write something on alerts for quite awhile, and this isn’t quite it. Rather I’ll reference another URL on alerting.  This sums up quite well much of what I’ve wanted to say for awhile.

To single out one important rule: if you’re going to get an alert, be prepared to act upon it!

Knowing that you pegged your server CPU at 100% every once in a while might be useful, but probably not something to wake people up about.  And if it is hitting 100% infrequently, there’s probably nothing worth doing. On the other hand, if it’s routinely hitting 100% CPU, perhaps your action plan is to spin up another web server, or move load to a different database. Or, perhaps your plan is even to do nothing. But, planning to do nothing and accepting that, is very different from not planning and simply do nothing because you have no idea of waht to do.

Note, alerting is very different from monitoring and logging.  If my CPU is hitting 100% once a week for 5 seconds, and then twice a week for 6 seconds, and then 4-5 times a day, for 10 seconds, I want to start making plans. But again, I probably don’t want to wake someone up.

Monitor, yes. Alert: maybe.

That’s it for tonight.

 

The obvious isn’t so obvious

Ok, apologies to my loyal reader or two for not blogging in a LONG time.  Either life has been too busy, or when things have calmed down, I haven’t had anything to blog about.

Normally I don’t want to blog weird technical stuff, but this time I will.

I’m currently working for a client doing some DBA work.  Lots of fun actually.  

Said client has an issue. They want to get have a standby datacenter.  Great idea.  Log-shipping will work well here for them.  Just one catch: corporate won’t allow anything through the firewall unless it’s SSH.  Hmm.

No real problem, I figure I can do “log-shipping in the blind”.  Basically take the transaction logs from the primary server, use rsync to get through the firewall and apply them to the secondary server.  If need be, I’ll custom write the jobs.  It should be easy.

Here’s one example:

http://sqlblog.com/blogs/merrill_aldrich/archive/2011/05/19/case-study-secure-log-shipping-via-ssl-ftp.aspx

The key part (for me) is the part there “Half a Log-Shipping Config”

Pretty straightforward, no DBA would balk at that.

So, I set it all up. Setup some scheduled tasks.  Make a backup, copy it to the secondary server and restore it.  So far so good.  Take logs created by existing log backup job, and watch as they automatically get shipped to the datacenter.  

“Hey, this is going great. I’ll be done before lunch.”

Manually apply a log or two to make sure there are no issues.

“Works perfect.  Lunch is starting to look real good. Maybe I’ll break early and go out.”

Setup scheduled task to run the job created above.

Fails. “Oh fudge. (Hmm, maybe some fudge after lunch?)”

Oh wait, just a typo in the directory.  

Rebuild job. Run it. Success!

Great, let’s check things out.  

“Hmm, the .TUF (transaction undo file) isn’t there.”

Let’s look at the jobs.

Now here, you’ll have to bear with me. Again, the corporate rules will NOT permit me to cut-paste even simple text error messages from a server in the datacenter.

But basically get a bunch of messages along the lines of:

2013-06-26 05:00:01.92 Skipped log backup file. Secondary DB: ‘MyDemo’, File: ‘\\NAS\Logshipping\MyDemo_201306261156.trn’

A bunch of these.

Well, a lot of Googling suggested that since log flies were empty (this database doesn’t get much traffic) SQL Server was smart enough to know there was nothing to apply.

Sure enough, manually applying them showed there was nothing for them to do. I needed a real transaction log with a transaction in it. No problem. Go to original database. Do a quick update.  Create transaction log and wait for the automated log copier to get it to the right place.

“Hmm, maybe I’ll just do a late lunch today.”

Get it to the secondary server. Run the job.

“Hmm. Skipped the files I expected it to skip.  No problem.”

Gets to the file I expect it to apply:

(now I’m retyping rather than cutting/pasting)

2013-06-26 13:14:05.43 Found first log backup to restore. Secondary DB: ‘MyDemo’; File: 

‘\\NAS\Logshipping\MyDemo_201306261605.trn’ 

2013-06-26 13:14:05.44 The restore option was successful. Secondary Database ‘MyDemo’ Number of log backup files restored: 0

“Huh? “

Ayup. You read it right. It found the right file. It said the restore was successful. Then it said 0 files were restored.

Now, I eventually broke for lunch.  And dinner. And bed. And breakfast. And another lunch and dinner and more sleep. Also a couple of bike rides and a bunch of meetings and other stuff.

But truth is, I was stumped. I tried everything. Rebuilt the job. Tried manually updating the tables on the secondary to tell it I had already applied a log (that got me a weird datetime error I never quite figured out)

I could manually apply the logs just fine.

Log-shipping between two servers within the datacenter worked just fine.

Why wasn’t this working? I mean a log file is a log file right?

Well, apparently yes and no.

The ONE difference I noticed was that the transaction logs from the working test had their filenames using UTC time stamps, including seconds.

The ones I was shipping were from a standard maintenance plan I had setup and times were in local time w/o the time stamp.

BTW, this http://www.sqlservercentral.com/Forums/Topic351549-146-1.aspx#bm438143 is what also helped me wonder about this.

“It couldn’t be THAT simple could it?”

So I setup a database in the primary datacenter, full logging, and setup log-shipping on that side. Now, I still couldn’t provide any data about the secondary, but I could at least script the primary side that does the backups.

Set that up.  Shipped over a database, some logs. Now I’d like to say it worked on the first try, but I ran into some issues (completely of my own doing, so not relevant here).

But 3rd time is the charm as they say.  Got a backup over to the secondary.  Let the autosync job move over some transaction logs (including several with transactions).

Then, on the secondary ran the job I had previously handcrafted and. SUCCESS.

So, yes, you can do “blind” log-shipping (which I knew).

You can setup the secondary by hand.

But apparently you can’t use your existing transaction log job. You’re better off setting up the log-backups on the primary using the normal tools and then shipping those.

So lesson learned.

Sometimes, it’s enough to know there’s a right answer to keep you driving towards it.

And now time for dinner.

 

Eliminating the impossible

How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”  Sherlock Holmes

Since I haven’t found this issue elsewhere, and since it’s been awhile since I’ve blogged, I figured I’d post.

So, the scenario is:

Windows 2008 R2 Cluster that was pre-existing before I arrived on the scene.  2-nodes.  setup to run SQL Server.  SQL Server would run fine on Node A, but a failover to Node B would fail.

Some back history to the setup that wasn’t complete nor detailed.  But the problem was suspected to be with DNS or Active Directory.

I arrive on the scene and one of my jobs is to setup additional clustered SQL instances on this Windows Cluster. I do so, expecting to have the exact same issue. Nope. Things work fine once I figure out the rights my user needed but didn’t have in order to ADD the second node (Logon as a service btw). (For the time being I built a 1-node cluster, yes, you can do that, and then once I had the rights, simply added the 2nd node.)

So, now I’m in the situation with a 2-node cluster and 3 SQL instances.  Two fail over as expected.  One (the original) does not.

Time to put on my debugging hat on.

I won’t bore you with the details.  Suffice to say I tried a lot.

Compared ipconfig /all results – Everything the same (what wasn’t the same, I made the same where it made sense to.  Still no joy.)

Pinged the WINS and DNS servers from both boxes. OK, here was a difference.  Node A could ping both its primary and its secondary WINS server.  Node B could NOT ping its secondary WINS server.  Interesting. But, didn’t really seem like the issue since it couldn’t explain why the other 2 instances would fail over just fine.

Checked out the registry.  Same in both cases.

Start to look at error logs.  At first nothing.  Then realize that according to the timestamps, a SQLError Log IS being created on Node B.  I look even more closely. The service is actually STARTING!  But then it’s stopping. And in between there’s a bunch of errors about not being able to log in.  Very strange.

So now I try to tackle the problem from a different angle. I fail over the disk and IP resources but don’t tell the cluster service to startup SQL Server.

Then, I go to the command line and start the service manually.

Works fine. Connections can be made, etc.  Of course the cluster service doesn’t think it’s up, but that’s to be expected and ok at this point.

But, this is only a partial test.  Since maybe it’s my user that can do this, but not the service account.

So, go to the services screen, change SQL Server to startup using my account and confirm that works.  Great.

Change it back to the designated service account and start it manually from there.  Starts just fine.

BUT, no login errors.

Finally that part clicks.  The thing trying to login and do a query is the CLUSTER Service itself. It’s simply the heartbeat Cluster Service uses to make sure the node started. No wonder, it is attempting to start the node and then failing. It never hears the heartbeat.

Since it takes about a minute for the startup to actually fail, I confirm that I can connect to SQL Server in that minute window.  Sure enough, no problem, at least until the Cluster Service fails it.

So basically SQL Server is in fact running properly and starting up properly. It’s simply that the Cluster Service can’t confirm it is running so it shuts SQL Server down.

I started to try several various things that all ended up in a blind alley.

Then as I was poking around the SQL Server Configuration Manager on Node B it dawned on me to look at the SQL Native Client and compare it to Node A. The one critical difference was that Node B had some aliases setup.  They looked correct, but following a troubleshooting axiom of mine “One of these things is not like the other” I decided to rename them (not delete them, since another axiom is “don’t break anything you can’t fix”) so they wouldn’t be used.

I then tested the failover, fully not expecting this to solve the problem. The failover worked just fine. Wow. That surprised me.Of course I never trust anything I can’t replicate.  Changed the aliases back to their original form. Test failover. It fails. Change them back to the updated names and things work again.

I had my solution.

Now, my blog is intended to be more about thinking than actual technical issues, but for this I’ll make an exception.  So for future reference, Google and more:

The error I received in the SQL Error logs was:

2013-02-20 08:36:47.74 Logon Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: 192.168.3.44]
2013-02-20 08:36:47.74 Logon Error: 17806, Severity: 20, State: 2.

No Googling for this helped.(It’s a common error in other contexts, none were helpful here that I found.)

But otherwise, this was your basic troubleshooting.

  • Eliminate possibilities
  • Try variations
  • When you think you’ve solved it, replicate it.

And, no matter how improbable it is (I never would have guessed Aliases) if you’ve eliminated everything else, it must be that.

Interesting post from someone I know

Well so much for getting back on the bandwagon of posting more.

But in my defense, I’ve been busy this past month.

Someone I worked with briefly with has posted a post that I really liked and I think is on-topic to some of the stuff here, namely “how we think.”

http://www.ratha.com/monotropism#comment-244

I’ve always been fascinated by not only intelligence, but HOW we think. Why can some folk solve a problem faster than others? Why can some folks walk into a room and “see” everything there and others miss many details.

Years ago I developed the idea that folks had varying amounts of I/O “bandwidth” and more and more research seems to be bearing this out at a very general level. Ratha’s post explores a specific concept here of monotropism vs. polytropism and how some folks tend to focus on one thing to the extreme.

Following Directions

Not much to say here other than this link may have saved me a lot of work.

I would have saved myself even more work if I had paid close attention to the last step:

Don’t miss this step, it’s very important: Select the new document; PressCtrl + A; Press F9.

Just to add, what this does is make sure the correct images get merged in.

Git ‘r Done

It’s rare I post items so quick in succession, but I’m trying to post a bit more often and these topics work together.

I mentioned in my previous post about the group of people I work with on the NCRC Educational Committee.  But I wanted to follow that up with a comment about a goal

In cave rescue, our goal is to get the patient to the surface as quickly and safely as possible in as good or better shape than we found them.

Ultimately that goal should drive pretty much everything we do on a rescue.

Sometimes though, students fail to see it that way. On one hand it is amusing when we watch students take a simple problem and over-complicate it. Sometimes two instructors will look at each other and ask, “why are they doing it THIS way and not THAT way?” During training it’s easy to refocus them and remind them what the goal is. At the end of the week of training we have a mock rescue where the students are on their own. At this point, if they lose focus of the goal, they may take longer than they expect or wish to.

During one practice I was on as a studnet, a discussion began about how to move the litter with the patient in it under a tight low roof along a stream passage. After a minute or two of discussion, 2 of the other students and I looked at each other and realized the other members of the group were too focused on convincing the other members that their way was the best way to move the patient. The entire time they were trying to “win” the conversation, which had apparently become their current goal; the patient wasn’t moving.

So, the three of us simply moved the patient quickly and safely to the other side of the obstruction. After about a minute, the conversation stopped and the folks on the other side of the discussion realized the patient they were arguing about moving, had been moved. Things improved from there.

Now, by no means should it sound like I’ve never lost focus (see my post The Hunger Games for an example of a potentially more dangerous situation where I definitely lost focus of the correct goal.)

But this leads to a question: “How should the goal be accomplished?”

To give an example, perhaps I can build and operate a beautiful 4:1 haul system where every leg collapses the optimal amount and I can operate it with just 2 people.  Or, I can put up a 1:1 haul system that’s inelegant and requires 6-7 people to operate it.  Both will move the patient, but which one is “better?”

Well, honestly, “it depends”.  If I have plenty of extra people and I’m short on time, I’ll go with the 1:1 almost every time.  It’s simple and it works.  It can be setup in just a few minutes and requires very little equipment.

But what if I’m tight on people and I have the time?  Perhaps then the 4:1 is the proper solution.

This is where experience and judgement come into play.  Both systems “Git ‘r Done” and both can help me with my goal of getting the patient to daylight. And that is my goal in a rescue.  My goal in a rescue is NOT to build a beautiful 4:1.  My goal is to build a system that gets the patient out safely and quickly.  If a 4:1 will work best to accomplish the goal, I’ll do that.  If it won’t. I’ll forgo it, no matter how sweet and sexy it may seem to me.

We sometimes teach students a handy metric of two questions to ask themselves:

  1. Does it work?
  2. Is it safe?

A 4:1 that isn’t fully rigged when the patient arrives fails the first question, no matter how elegant and well it may operate when its finally rigged.

On the other hand, if the 1:1 is fully rigged, but I don’t have enough folks to operate it, it also fails the first question.  However, if I have enough folks to operate it, I’m not going to start discussing how there might be a better way to rig it with fewer people.  Note that “is it the BEST or OPTIMAL solution” isn’t part of the metric.  In this case, it really doesn’t matter.

Keeping these questions in mind can often negate extra conversation (such as the example above of the patient not moving while folks were discussing the BEST way to move him.)

So, when you face the task of solving a problem, especially one with time pressure and that is most likely a one-off, ask yourself if the solution you currently have is safe and if it works.  If you can answer yes to both, then Git ‘r done.

Smart People

Like many, from time to time, I’ve had the distinction of being “the smartest person in the room”. (Often that’s when I’m the only guy in the room, but that’s just a minor detail.)

This past weekend though, I had the pleasure of being in a room full of people where I was definitely not the smartest guy in the room.  One of my side activities is working with the Educational Committee of the National Cave Rescue Commission. This entails, among other things, having face to face meetings once or twice a year. During this time we work on the curriculum, trying to improve it every year.  This weekend’s meeting had 8 people (including myself) in attendance.  The people there bring an extremely wide degree of skills to the table, ranging from medical experience, SRT experience, grammar experience, experience about ropes and devices, rescue experience and more.

With such a diverse background, there are times when not everyone is in agreement on various teaching points.  But, while there may be spirited debate at times, everyone still keeps the end goal in mind: developing the best possible curriculum for cave rescue.

However, one has to be careful about “the smartest guy in the room.”  There is an old joke about a plane about to crash. It has 3 passengers, a Boy Scout, a priest, and a Nobel Prize winner and 2 parachutes.

The Nobel Prize winner grabs a pack shouts, “My discoveries will save the world, I deserve to live” and jumps out of the plane.

The priest tells the Boy Scout, “Son, please take the last parachute.  I have lived a good life and I am prepared to meet the Lord”.

The Boy Scout turns to the priest, “Don’t worry Father, the smartest guy in the world just grabbed my backpack and jumped out of the plane.”

Being smart doesn’t make one immune from error.  But surrounding oneself with smart people can often lead to better solutions.

You don’t have to be the smartest person in the room, but you should at least try to surround yourself with them.

Think your way

Years ago Thomas J. Watson at IBM thought up a single word slogan: Think!.

Later Apple countered with Think Different!

I’ve always been fascinated with how people think.  One thing I’ve learned over the years is most people don’t think nearly as much as they think they do.  This isn’t necessarily a bad thing.  Actually in many ways it’s most likely an evolutionary adaptation that for the most part works.

For example, if a million years ago, you suddenly saw your ancestors all running away, you’d probably leap up and follow them without waiting to confirm there was a sabre-toothed tiger hot on their heels.  By not taking the extra second or two to think, you were able to pass on your genes.

But, in today’s highly evolved society, often that exact same reaction is not the right move.

I was reminded of this today getting off the train in New York City.  I tend to sit in the middle of the car (a lot of thinking has gone into this actually!)  So when I get up, I always have a choice of which way to go to get off the train.  Now, in a perfect world, it wouldn’t matter.  Since I’m equidistant from both ends of the car, it shouldn’t matter which way I go.  But often it does.  Today actually I was a little forward of middle.  I turned back to see more than 1/2 the car lined up behind me heading towards the rear of the car.  In part this was because the stairs from the platform towards the station are in that direction.  

Everyone was standing still.  I walked forward and was out of the car and to the stairs before the rear of the car had barely emptied.  I’m not sure what the hold up was. But the fact was, more than 1/2 the car apparently saw someone else get up, head towards the back so they did.  They never took the split second to look forward to see if that would be any faster.

Now you may be saying big deal. And you’re right. Getting off a train car a few seconds ahead of other people usually isn’t.

But, I’ve heard of similar things happening in more critical situations. How many times have you listened to the flight attendant or movie announcer point out fire exits and then promptly forgot about them?  How many folks make notes of all the exits in a building when they enter?

A study of fires and incidents in public facilities often shows that folks tend to do one of two things. 

One, they go where everyone else is going.  Again, this makes sense at a basic level.  There is the underlying assumption whether they realize it or not that they’re following the person in front of them because they assume the person who is moving must know something they don’t.  Of course it never dawns on them that the person in front of them may simply be moving in that direction because of the person in front of THEM.  The tragic side effect is that often you get crowding, trampling or simply wedges by the exit.  Often you’ll find a number of deaths occur near the most visible or most used exit to a facility.  In many cases this is due to wedges of people getting stuck trying to get through the door.

The other thing that people tend to do is head towards the exit they know about.  In fact this is most likely what causes the person at the front of the pack above to start moving in the direction they’re moving. 

Unfortunately, building codes generally assume that in an emergency, folks will exit all available exists (or some subset) in roughly equal numbers.  Experience shows in a crisis, this is rarely the case.  (I do believe this assumption is also starting to change and some building codes are changing to reflect that.)

(as an aside, one of the more remarkable survival stories I recall from 9/11 was a small group of people that essentially MADE their own exit by cutting a hole through a fireproof wall to another area where they could escape from.  So, sometimes consider making your own exit.)

Now, since I hate to end these passages on a somber note and because I want to relate this more to the IT industry, I’ll point out that often the successful businesses are not the ones that do what the heard does.  They do it differently. 

So, two exercises left to the reader.  Next time you go to a movie theater or restaurant, take a note of the exits and decide which one you will head to that you might have not noticed before.  Or if you’re stuck on a train or in a long line, stop to think if perhaps there’s another way people are overlooking.

The other one is more fun and one friends did in high school.  At the next group photo, pretend to see something and point to it.  See how many people follow along because obviously if you saw something, they must see something.  Then ask the last person who pointed what they were looking at.  I’ll bet in most cases you’ll hear “whatever it was they were looking at!”

 

A Solution Exists

I’m reminded of an old, very geeky joke:

A mathematician, a scientist, and an engineer all take a room at a hotel.  They each wake up separately to discover a small fire in their wastebaskets.

The scientist sits for a minute, calculates the amount of heat being created by the fire, determines the amount of water required to put it out, fills up his ice-bucket to that precise amount and tosses it on the fire, putting it out.  There’s no excess water and the mess is minor so he goes back to sleep.

The engineer looks at the fire, grabs the ice-bucket fills it to the top with water, figures there’s more than enough water in it, toss it on the fire with a margin of water to spare.  The fire goes out and he goes back to sleep.

The mathematician looks at the fire, looks at the ice-bucket, determines a solution exists and goes back to sleep.

When solving a problem, sometimes knowing a solution exists (or doesn’t exist) is half the battle.

I’m currently trying to win a programming contract to update a gentleman’s website. One of the requirements may be to integrate with some sort of payment service.  So, some quick research showed such a solution “existed”.  I had achieved the level of mathematician.  The service suggested is  Paypal’s https://www.paypal.com/ipn service. 

But it’s one thing to show a solution exists, I had to actually work with the solution to make sure if the time comes, I can implement it.  So, spent a few hours tonight working with the code example provided by Paypal and a database I created to test the proposed solution. After a couple of minor errors, I was able to use the test sandbox to call my callback URL and record the responses to the database.  I had achieved the basics. But this was also the hardest part.  So I have achieved the level of scientist here.

Now, if the customer gives me the go-ahead, I’ll have to expand the code to actually parse the request string and other details and integrate it with his code and database.  That said, that part is actually fairly trivial in many ways as I will control the entire process there.

So I haven’t quite achieved the level of engineer, but if the time comes, I will.

So that fire is out.  On to the next one.