Missing the Old Ways

I’ll admit, sometimes I’m a curmudgeon. Sometimes I miss the old ways. Last night was a case in point. My wife asked me to look at her computer. For some reason suddenly all her Office Apps had stopped working after a reboot. I tried a few simple things and sure enough I’d get a pop-up saying “This App can’t Open” every time. Googling brought up a page that seemed helpful and had a number of recommendations. I resisted the 2nd option of creating/using a new Microsoft account because I’m not keen in extra accounts, etc. I’ll save you the trouble of reading the rest of this post and say, I finally did that, using MY existing Microsoft account and magically everything started working. I then removed my account and things seem to continue to work.

But I’m frustrated. I miss the old days where one installed software and it well, frankly, stayed installed. I really don’t think one should have to worry about software like Office suddenly breaking because an online account isn’t available or the like. I’d be ok with certain features not being available (e.g. saving to the cloud automatically) but basic functionality shouldn’t suddenly break on a reboot.

I’ll admit there are days I miss DOS when things were really pretty simple.

Of course the irony is I’m writing on this one a dual screen computer running 16 gigs of memory with an ungodly number of programs open and an even larger number of tabs in browsers open. So I’m not entirely against new things. But I do want basic stuff just to work.

And that’s my thoughts for this week.

Up a Creek…

Actually more like an upstream problem. Or two.

Or, another week in the life of a DBA and other duties as assigned

So a few weeks ago a developer at a client of mine reported that some recently deployed code wasn’t working. I tried it and of course it worked for me. That isn’t unusual since I have sysadmin rights on that box. I tried execute as using her ID and it failed. Not uncommon, sometimes in production permissions don’t get promoted the way they should. So I checked her permissions and the permissions of the users actively trying to call the stored procedure. Strangely, they all had the proper permissions, at least as far as I could tell. Then I had that lightbulb insight and realized I had been misinterpreting the error message execute as was giving me. “The server principle “DOMAIN\USerXYZ” is not able to access the database “ImportantDB” under the current security context.”

I realized I had been troubleshooting the wrong problem. It wasn’t that the person didn’t have permissions to execute, it was that no one other than sysadmins had the right to CONNECT to the database. A simple:

GRANT CONNECT SQL

And all was good to go! Or was it? It took us some digging to figure out why this had happened on a production database. Apparently when the database was designed in Dev, the developers had the rights they need to connect, so never thought about who else might need to connect. Apparently they had created it with very limited Grant Connect rights. When the database was moved to production, in this case, with a backup and restore the same lack of rights moved upstream with it.

Now, in the opposite direction, a vendor wanted to send a file to my client in their UAT environment. I fired up the PowerShell script to retrieve the file and decrypt it. The decryption failed. It took me awhile to figure out the problem. The client has a rule that every 2 years we must upgrade our RSA Public Key with them. Ironically, I had just completed the most recent update last month and moved it into production. Apparently though, their rule doesn’t apply to their UAT environment. Which came as sort of a shock to me, since they’re always so insistent on us following their security requirements. Of course beyond the irony of them not following their own rules, the file they had asked we download, wasn’t there.

The PM contacted them and they assured him the file would be there on Friday. Well here it is Tuesday and we’ll see if this time the file is there.

In any case, this time, the problem wasn’t promoting a change from UAT to PROD, but the client’s failure to move a change from PROD to UAT.

Such is life.

So sometimes I’m the creek without the paddle and sometimes I’m down the creek…

SQL Server Scheduled Email Queries

Every once in a while I come across a problem that both surprises me that it exists and that the solution is often trivial. This is one of those. Basically developer at a client wanted to setup a scheduled task that would execute a query and email the results. Everything appeared to work, but the emails never went out. There were no errors that were obvious other than the emails never arrived. Some digging via profiler showed that the SQL Agent user was having permission issues. But basically giving it every permission possible didn’t solve it.

So let me walk you through it. First, let’s create a real simple stored procedure. This assumes you have AdventureWorks2014 installed (yeah, it’s old, but it’s what I had handy).

Use adventureworks2014
GO
Create or Alter Procedure Send_Test_Email
as
exec msdb.dbo.sp_send_dbmail @recipients=’test@example.com’, @body=’This is the body of the email’, @subject=’Test Email w query embedded’,
@query=’select * from adventureworks2014.sales.SalesOrderHeader where subtotal > 150000′;

Now, here’s an important detail I did discover during testing. If my query didn’t actually query a table, but instead was say @query=’select getdate()’ things worked fine.

That said, if you simply execute the query above in SSMS, it should work just fine. (I’d recommend you put your own email in it for testing. This way you’ll know if the email is actually being sent.)

Before you do that, also create the following stored procedure:

Use adventureworks2014
GO
Create or Alter Procedure Send_Test_Email_Query_attached
as
exec msdb.dbo.sp_send_dbmail @recipients=’test@example.com’, @body=’This is the body of the email’, @subject=’Test Email w query embedded’,
@query=’select * from adventureworks2014.sales.SalesOrderHeader where subtotal > 150000′,
@attach_query_result_as_file = 1,
@query_result_no_padding = 1,
@query_attachment_filename = ‘Sales Report.csv’,
@query_result_separator = ‘;’,
@exclude_query_output = 1,
@append_query_error = 0,
@query_result_header = 0

This will execute the same query as the original stored procedure, but place the contents in a CSV file as an attachment.

Again, if you execute the above directly from SSMS, you should receive the email without an issue. This is basically what the client was attempting to do.

Now create the following job:

USE [msdb]
GO

/ Object: Job [Send AdventureWorks Email] Script Date: 7/27/2021 9:12:49 AM / BEGIN TRANSACTION DECLARE @ReturnCode INT SELECT @ReturnCode = 0 / Object: JobCategory [[Uncategorized (Local)]] Script Date: 7/27/2021 9:12:49 AM /
IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]’ AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=N'[Uncategorized (Local)]’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

END

DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N’Send AdventureWorks Email’,
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N’No description available.’,
@category_name=N'[Uncategorized (Local)]’,
@owner_login_name=N’sa’, @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/ Object: Step [Send email] Script Date: 7/27/2021 9:12:50 AM /
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N’Send email’,
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N’TSQL’,
@command=N’exec Send_test_email_Query_attached’,
@database_name=N’adventureworks2014′,
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

If you execute this job as is, you should see the following:

Success! Or is it?

But you will never get the email!

Modify the above job and replace the line:

@command=N’exec Send_test_email_Query_attached’,

with

@command=N’exec Send_test_email’,

Now if you run the job, it will fail!

This time it clearly failed!

But as is usual with sp_send_dbmail, the error message isn’t overly helpful:

Executed as user: NT AUTHORITY\SYSTEM. Failed to initialize sqlcmd library with error number -2147467259. [SQLSTATE 42000] (Error 22050). The step failed.

But it does help give a clue. The problem seems to be some sort of permissions issue. So I’ll admit, at this point I tried all sorts of solutions, including setting up a proxy user, giving unfettered rights to my SQL Agent user and other things (thinking once I got it working I could then lock things back down). Instead, I found a much easier solution buried in a thread on the Microsoft site.

You’ll note when I wrote the original stored procedure I fully qualified the table name. This is often generally useful, but here I did it because I was cheating and knew I’d need it for this demo.

The solution is actually VERY simple and I’ll show it both graphically and via a script.

First: change the database to MSDB and then fully qualify your call to the stored procedure as below:

Execute it from the MSDB database

However, there’s one more critical step: under the ADVANCED tab in the step change the Run As User to dbo:

Seems simple, but it’s critical

Now if you script out the scheduled task it should look like:

USE [msdb]
GO

/ Object: Job [Send AdventureWorks Email] Script Date: 7/27/2021 9:26:37 AM /
EXEC msdb.dbo.sp_delete_job @job_id=N’08a55e18-eecf-4d4c-8197-8135a0d7520b’, @delete_unused_schedule=1
GO

/ Object: Job [Send AdventureWorks Email] Script Date: 7/27/2021 9:26:37 AM / BEGIN TRANSACTION DECLARE @ReturnCode INT SELECT @ReturnCode = 0 / Object: JobCategory [[Uncategorized (Local)]] Script Date: 7/27/2021 9:26:37 AM /
IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]’ AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N’JOB’, @type=N’LOCAL’, @name=N'[Uncategorized (Local)]’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

END

DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N’Send AdventureWorks Email’,
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N’No description available.’,
@category_name=N'[Uncategorized (Local)]’,
@owner_login_name=N’sa’, @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/ Object: Step [Send email] Script Date: 7/27/2021 9:26:37 AM /
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N’Send email’,
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N’TSQL’,
@command=N’exec adventureworks2014.dbo.Send_Test_Email_Query_attached’,
@database_name=N’msdb’,
@database_user_name=N’dbo’,
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)’
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

If you then execute THIS scheduled task you’ll get both a success message:

And you should actually receive the email with the query results attached.

So, basically an issue that surprised me that it even existed, actually ended up with a fairly simple solution.

“We’re up to plan F”

I managed to skip two weeks of writing, which is unusual for me, but I was busy with other business, primarily last week leading an NCRC weeklong class of cave rescue for Level 1 students. I had previously lead such a class over three weekends last year, and have helped teach the Level 2 class multiple times. Originally this past week was supposed to be our National weeklong class, but back in February we had agreed to postpone it due to the unknown status of the ongoing Covid pandemic. However, due to a huge demand and the success of vaccinations, we decided to do a “Regional” Class just limited to Level 1 students. This would help handle the pent up demand, create students for the Level 2 class that would be at National, and to do sort of a test run of our facilities before the much larger National.

There’s an old saying that no plan survives the first contact with the enemy. In cave rescue this is particularly true. It also appears to be true in cave rescue training classes!

The first hitch was the drive up the the camp we were using. The road had been stripped down to the base dirt level and they were doing construction. Not a huge issue, just a dusty one. But for cavers, dust is just mud without the water. But this would come into play later in the week.

Once at the camp, as I was settling in and confirming the facilities, the first thing I noticed was that the scissors lift we had used to rig ropes in the gym last time was gone. A few texts and I learned it had only been on loan to the camp the past two years and was no longer available. This presented our first real challenge. How to get ropes up over the beams 20-30′ in the air.

But shortly after I realized I had a far greater issue. The custom made rigging plates we use to tie off the end of the ropes to the posts were still sitting in my garage at home. I had completely forgotten them. This was resolved by a well timed call to an instructor heading towards the camp, who via a longer detour then he expected, was able to get them. Fortunately, had that call waited another 5 minutes, his detour would have probably doubled. So the timing was decent.

I figured the week was off to a good start at that point! Honestly though, we solved the problems and moved on. I went to bed fairly relaxed.

All went well until Monday. This was the day we were supposed to do activities on the cliffs. Several weeks ago, my son and I, along with two others had gone to the cliffs, which were on the same property as the camp, but accessible only by leaving the camp and accessing from a public road, in order to clear away debris and do other work to make them usable. I was excited to show them off. Unfortunately, due to the weather forecast of impending thunderstorms all day we made the decision to revise our schedule and move cliff day to the next day. There went Plan A. Plan B became “go the next day.”

On Tuesday I and a couple of other instructors got in my car to head to the cliffs in advance of the students so we could scope things out and plan the activities. We literally got to the bottom of the road from the main entrance to the camp where we were going to turn on to the road under construction, only to find a the road closed there with a gaping ditch dug across it. So much for Plan B. We went back to the camp, told students to hang on and then I headed out again, hoping to basically take a loop around and approach the access road to the cliffs from the opposite direction. After about a 3 mile detour we came to the other end of the road and found it closed there. Despite trying to sweet talk the flag person, we couldn’t get past (we could have lied and said we lived on the road, but after 8-10 other cars would have arrived in a caravan saying the same thing we thought that might be suspicious). There went Plan C. We called an instructor back at the camp and headed back.

We got there and turns out an instructor had already come up with Plan D, which was to see if we could access the cliffs by crossing a field the camp owned and going through the woods. It might involve some hiking, but it might be doable. While there are dirt-bike paths, there’s nothing there that worked for us. So that plan fell apart. We were up to Plan E now. Plan E was proposed to further swap some training, but we realized that would impact our schedule too much. Now on to Plan F. For Plan F, we decided to head to a local cave which we thought would have some suitable cliffs outside.

That worked. It would out quite well actually. We lost maybe an hour to 90 minutes with all the plans, but we ultimately came upon a plan that worked. We were able to teach the skills we wanted and accomplish our educational objectives.

Often we wake up with a plan in our heads for what we will do that day. Most days those plans work out. But, then there are the days where we have to adapt. Things go sideways. Something breaks, or something doesn’t go as planned. In the NCRC we have an unofficial motto, Semper Gumby – “Always be Flexible”. Sometimes you have to completely change plans (cancelling due to the threat of thunderstorms), others you may have to try to adapt (finding other possible routes to the cliffs) and finally you may need to reconsider how to meet your objectives in a new way (finding different cliffs).

My advice, don’t lock yourself into only one solution. It’s a recipe for failure.

Take 5 Minutes

This weekend I had the pleasure of moderating Brandon Leach‘s session at Data Saturday Southwest. The topic was “A DBA’s Guide to the Proper Handling of Corruption”. There were some great takeaways and if you get a chance, I recommend you catch it the next time he presents it.

But there was one thing that stood out that he mentioned that I wanted to write about: taking 5 minutes in an emergency. The idea is that sometimes the best thing you can do in an emergency is take 5 minutes. Doing this can save a lot of time and effort down the road.

Now, obviously, there are times when you can’t take 5 minutes. If you’re in an airplane and you lose both engines on takeoff while departing La Guardia, you don’t have 5 minutes. If your office is on fire, I would not suggest taking 5 minutes before deciding to leave the building. But other than the immediate life-threatening emergencies, I’m a huge fan of taking 5 minutes. Or as I’ve put it, “make yourself a cup of tea.” (note I don’t drink tea!) Or have a cookie!

Years ago, when the web was young (and I was younger) I wrote sort of a first-aid quiz web-page. Nothing fancy or formal, just a bunch of questions with hyperlinks to the bottom. It was self-graded. I don’t recall the exact wording of one of the questions but it was something along the lines of “You’re hiking and someone stumbles and breaks their leg, how long should you wait before you run off to get help.” The answer was basically “after you make some tea.”

This came about after hearing a talk from Dr. Frank Hubbell, the founder of SOLO talk about an incident in the White Mountains of New Hampshire where the leader of a Boy Scout troop passed out during breakfast. Immediately two scouts started to run down the trail to get help. While doing so, one slipped and fell off a bridge and broke his leg. Turns out the leader simply had passed out from low blood sugar and once he woke up and had some breakfast was fine. The pour scout with the broken leg though wasn’t quite so fine. If they had waited 5 minutes, the outcome would have been different.

The above is an example of what some call “Go Fever”. Our adrenaline starts pumping and we feel like we have to do something. Sitting still can feel very unnatural. This can happen even when we know rationally it’s NOT an emergency. Years ago during a mock cave rescue training exercise, a student was so pumped up that he started to back up and ran his car into another student’s motorcycle. There was zero reason to rush, and yet he had let go fever hit him.

Taking the extra 5 minutes has a number of benefits. It gives you the opportunity to catch your breath and organize the thoughts in your head. It gives you time to collect more data. It also sometimes gives the situation itself time to resolve.

But, and Brandon touched upon this a bit, and I’ve talked about it in my own talk “Who’s Flying the Plane”, often for this, you need strong support from management. Management obviously wants problems fixed, as quickly as possible. This often means management puts pressure on us IT folks to jump into action. This can lead to bad outcomes. I once had a manager who told my team (without me realizing it at the time) to reboot a SQL Server because it was acting very slowly. This was while I was in the middle of remotely trying to diagnosis it. Not only did this not solve the problem, it made things worse because a rebooting server is exactly 100% not responsive, but even when it comes up, it has to load a lot of pages into cache and will have a slow response after reboot. And in this case, as I was pretty sure would happen, the reboot didn’t solve the problem (we were hitting a flaw in our code that was resulting in huge table scans). While non-fatal, taking an extra 5 minutes would have eliminated that outage and gotten us that much closer to solving the problem.

Brandon also gave a great example of a corrupted index and how easy it can be to solve. If your boss is pressuring you for a solution NOW and you don’t have the opportunity to take those 5 minutes, you might make a poor decision that leads to a larger issue.

My take away for today is three fold:

  1. Be prepared to take 5 minutes in an emergency
  2. Take 5 minutes today, to talk to your manager about taking 5 minutes in an emergency. Let them know NOW that you plan on taking those 5 minutes to calm down, regroup, maybe discuss with others what’s going on and THEN you will respond. This isn’t you being a slacker or ignoring the impact on the business, but you being proactive to ensure you don’t make a hasty decision that has a larger impact. It’s far easier to have this conversation today, than in the middle of a crisis.
  3. If you’re a manager, tell your reports, that you expect them to take 5 minutes in an emergency.

“Houston, we’re venting something into Space…”

This post is the result of several different thoughts running through my head combined with a couple of items I’ve seen on social media in the past few days. The first was a question posted to #SQLHelp on Twitter in regards to if a DBA came into a situation with a SQL Server in an unknown configuration what one would do. The second was a comment a friend made about how “it can’t get any worse” and several of us cheekily corrected him saying it can always get worse. And of course I’m still dealing with my server that died last week.

To the question of what to do with an unknown SQL Server, there were some good answers, but I chimed in saying my absolute first thing would be to make backups. Several folks had made good suggestions in regards to looking at system settings and possibly changing them, possibly re-indexing, etc. My point though was, all that could wait. If the server had been running up until now, while fixing those might be very helpful, the lack of fixing things would not make things worse. On the other hand, if there were no up to date backups and the server failed, the owner would be in a world of hurt. Now, for full disclosure, I was “one-upped” when someone pointed out that assuming they did have backups, what one really wanted to do was a restore. I had to agree. The truth is, no one needs backups, what they really need are restores. But the ultimate point is really the same, without a tested backup, your server can only get much worse if something goes wrong.

I’ve had to apply this thinking to my own dead server. Right now it’s running in a Frankenbeast mode on an old desktop with 2GB of RAM. Suffice to say, this is far from ideal. New hardware is on order, but in the meantime, most things work well enough.

I actually have a newer desktop in the house I could in theory move my server to. It would be a vast improvement over the current Frankenbeast; 8GB of RAM and a far faster CPU. But, I can’t. It doesn’t see the hard drive. Or more accurately, it won’t see an OS on it. After researching, I believe the reason comes down to a technical detail about how the hard drive is setup (namely the boot partition is what’s known as a MBR and it needs to be GPT). I’ll come back to this in a minute.

In the meantime, let’s take a little detour to mid April, 1970. NASA has launched two successful Lunar landings and the third, Apollo 13 is on its way to the Moon. They had survived their launch anomaly that came within a hair’s breadth of aborting their mission before they even made orbit. Hopes were high. Granted, Ken Mattingly was back in Houston, a bit disappointed he had been bumped from the flight due to his exposure to rubella. (The vaccine had just been released in 1969 and as such, he had never been vaccinated, and had not had it as a child. Vaccines work folks. Get vaccinated lest you lose your chance to fly to the Moon!)

Stack of Swiss cheese slices showing holes lined up.

A routine mission operation was to stir the oxygen tanks during the flight. Unfortunately, due to a Swiss Cheese effect of issues, this nearly proved disastrous when it caused a spark which caused an “explosion” which blew out the tank and ruptured a panel on the Service Module and did further damage. Very quickly the crew found themselves in a craft quickly losing oxygen but more importantly, losing electrical power. Contrary to what some might think, the loss of oxygen wasn’t an immediate concern in terms of breathing or astronaut health. But, without oxygen to run through the fuel cells, it meant there was no electricity. Without electricity, they would soon lose their radio communication to Earth, the onboard computer used for navigation and control of the spacecraft and their ability to fire the engines. Things were quickly getting worse.

I won’t continue to go into details, but through a lot of quick thinking as well as a lot of prior planning, the astronauts made it home safely. The movie Apollo 13, while a somewhat fictionalized account of the mission (for example James Lovell said the argument among the crew never happened, and Ken Mattingly wasn’t at KSC for the launch), it’s actually fairly accurate.

As you may be aware, part of the solution was to use the engine on the Lunar Module to change the trajectory of the combined spacecraft. This was a huge key in saving the mission.

But this leads to two questions that I’ve seen multiple times. The first is why they didn’t try to use the Service Module (SM) engine, since it was far more powerful and had far more fuel and they in theory could have turned around without having to loop around the Moon. This would have saved some days off the mission and gotten the astronauts home sooner.

NASA quickly rejected this idea for a variety of reasons, one was a fairly direct reason: there didn’t appear to be enough electrical power left in the CSM (Command/Service Module) stack to do so. The other though was somewhat indirect. They had no knowledge of the state of the SM engine. There was a fear that any attempt to use it would result in an explosion, destroying the SM and very likely the CM, or at the very least, damaging the heatshield on the CM and with a bad heatshield that would mean a dead crew. So, NASA decided to loop around the Moon using the LM descent engine, a longer, but far less risky maneuver.

Another question that has come up was why they didn’t eject the now dead and deadweight, SM. This would have meant less mass, and arguably been easier for the LM to handle. Again, the answer is because of the heatshield. NASA had no data on how the heatshield on the CM would hold up after being exposed to the cold of space for days and feared it could develop cracks. It had been designed to be protected by the SM on the flight to and from the Moon. So, it stayed.

The overriding argument here was “don’t risk making things worse.” Personally, my guess is given the way things were, firing the main engine on the SM probably would have worked. And exposing the heatshield to space probably would have been fine (since it was so overspecced to begin with). BUT, why take the risk when they had known safer options? Convenience is generally a poor argument against potentially catastrophic outcomes.

So, in theory, these days it’s trivial to upgrade a MBR disk to a GPT one. But, if something goes wrong, or that’s not really the root cause of my issues, I end up going from a crippled, but working server to a dead server I have to rebuild from scratch. Fortunately, I have options (including now a new disk so I can essentially mirror the one disk, have an exact copy and try the MBR->GPT solution on that one) but they may take another day or two to implement.

And in the same vein, if it’s a known SQL Server, or an unknown one, you’re working on, PLEASE make backups before you make changes, especially anything dramatic that risks data loss. (and I’ll add a side note, if you can, avoid restarting SQL Server when diagnosing issues, you lose a LOT of valuable information in the DMV tables.

So things CAN get worse. But that doesn’t mean there’s any need to take steps that will. Be cautious. Have a backout plan.

Broken Potshards

Another email from a customer: “Greg, I can’t invoice this client, it keeps coming up blank, why?”

I grab the most recent copy of their database, go through the steps and find out now only is she right, it’s worse than she described. If I pick a random client, the invoice appears correctly AND I can even rebill them. But if I pick the client in question, not only does their current invoice come up blank, when I go to rebill them, the resulting PDF not only shows the invoices for EVERY client, but shows EVERY invoice that client has ever had (fortunately this organization only bills once a year.)

This is a custom app that a local vendor had written for them years ago but has since gone out of business. My customer approached me about 3 years ago to fix a few bugs in their app and since they’ve become a small but reliable source of income. While they call the app “the database” the reality of course is that while there’s a SQL Server database backing the app, most of my work is done actually supporting the app in VB.Net.

I generally don’t consider myself a VB.Net programmer, despite having done a fair amount of work in it for this customer, for an app for the National Cave Rescue Commission (NCRC), and for a large multinational several years ago. I generally prefer DBA work. So why do I do it?

Because it’s fun and because it involves what I call software archeology. I liken my work for this customer and the work for the multinational to what an archeologist does when they find a bunch of potshards: they try to reassemble them and figure out what they were intended to do.

For this customer, often I’m actually not fixing code, I’m drilling into to the code and the database to determine “what did the original developer intend?” “What business assumptions were made in the original design specs?” This means sometimes the customer will email me and say something like, “Greg, when I try to add a member to this group, why does it not work?” And I dig through the code and realize it was never intended that you could add a specific individual to a group. What you do is say that client has the following positions on that that group and then within the client “this person fills this position.” In other words, the original business case as that as a client updated its own individuals, the memberships in groups would reflect that. It’s a fine way of approaching the problem and honestly, works well. Except the current main user of the program was approaching the issue from the opposite direction: a new client had signed up and wanted to have people in specific groups so she went to the groups and tried to add specific people.

So, there was nothing wrong with the code, nor was there anything wrong with the design, just a different approach. But it took me several hours of digging through the undocumented code to determine why she couldn’t do what she wanted and how to go about doing it.

So what’s the deal with the most recent case? Well, it’s not a bug per se, though I’ll probably fix some code to prevent the problem. The issue turns out that their clients are charged based on how many groups or committees they’re members of and if they’re domestic or international clients (and in some cases can be both). There’s code to calculate a discount if they’re an international client and a domestic client and how many committees they’re on. However, the code assumes that the discount only applies up to so many committee memberships. It’s not hardcoded, but more a result of some math that in this specific case instead of returning a discount (even a $0 one) was failing to return any discount because half-way through the SQL calculations it was returning a NULL and of course $discount = $numcomm1-$numcomm2 where one of those is NULL will result in $discount being NULL.

So, technically the code should handle this better, but it was obvious once the pot shards were put together that the original designers and design specs never envisioned this particular combination of memberships (and in fact I think in this case it’s a mistake since in previous years the client was only a domestic client, not both).

It was a fun little mystery and I think I’ve solved the current issue for my customer, but eventually we’ll need to think about how to approach this issue in case it happens again in the future.

Personally, I tend to enjoy these little mysteries of trying to figure out what the code is doing, but more importantly why. It can be insightful.

And now back to some PowerShell code for my largest client that actually involves some real database work.

The Value of Testing

This is one of those posts where I wish I could show actual code snippets, but since it involves a 3rd party vendor for one of my clients and I don’t have permission, I can’t.

So, I’m forced unfortunately to talk about the issue in a roundabout way.

My client uses a 3rd party tool to track documents. I’ve mentioned this before. They’ve been growing fairly fast and running into performance issues. I suppose growing fast is a good thing, but having performance issues is not.

In any case, using Query Store, I was able to send the vendor a list of queries and stats about them for them to review and to ideally improve the queries that needed work.

Yesterday they got back to me. The email was essentially we took this first query (let’s call it Doubly-Joined) and rewrote it as this second query (let’s call it Singly-Joined). I looked at the two queries, which join 4 tables. They’re very similar to each other, but the first one did join in the main table a second time (hence why I’m calling it Doubly-Joined). It’s not clear why this was done. The second query basically removed the second join and in the select clause, changed the aliases to the second join to the first join. This does give them a slightly different query plan, but ultimately, they return the same number of rows.

The first query plan
The second query plan

As you can see, the 2nd query plan is definitely a bit simpler (ignore the one warning, it’s not something that appears to be fixable here).

So, a naive take would be “we removed an unnecessary join, so of course it should be faster!” But is it?

Sometimes intuition can be correct, sometimes not so much. In this case though, it’s easy to confirm by seeing exactly how many rows are being read in each query.

I wrapped each query in a

Set Statistics IO ON/OFF
Set Statistics TIME ON/OFF

block and ran it. Here are the results

The Doubly-Joined

Table 'Table1'. Scan count 0, logical reads 337264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table2'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table3'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table4'. Scan count 1, logical reads 396, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The Singly-joined

Table 'Table1'. Scan count 0, logical reads 337260, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table2'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table3'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Table4'. Scan count 1, logical reads 396, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

I’ve highlighted the relevant change. The single-joined query consistently performed with 4 fewer logical reads. Now, if the original number had been 8 and had dropped in half to 4, I’d be happy. But the change from 337264 to 337260 leaves me a bit underwhelmed. Furthermore, under multiple runs, the second query did not consistently use less CPU time, sometimes it took faster to run. Further testing was consistent in the lack of apparent improvement.

Needless to say, I don’t think this query improvement will help much. I’ve reached out to the vendor to see if they can provide more details, but honestly, I’m not hoping for much.

Query Store Saves the Day

It’s never a good thing when you get an impromptu meeting invite on the weekend and the subject line is “Sync Error”. I honestly didn’t even see the invite until the meeting had been going on for over an hour.

I called in and was brought up to speed. A 3rd party tool one of my client uses was having major timeout issues. Normally it’s fine, but my client was taking advantage of the weekend to do a very large import of data and the tool wasn’t keeping up.

I both love and hate being thrown into situations like this. I hate it because often I have very little information to go on, but also love it, because it can be a good challenge. So, I wanted to collect some data. Fortunately the database in question runs on SQL Server 2016. This blog post covers a bit of what we did and ends with why I am so grateful for Query Store.

Query Store and the First Graphs

I quickly enabled Query Store and grabbed a quick report. Based on help with the 3rd party support, I was able to focus on a particular query.

Query Store first graph

Initial Query Store screen grab

So, right away, I knew that at times this query could flip to a pretty bad query plan. I was curious as to why. But while poking around, I noticed something else going on. The database was at the SQL Server 2008 compatibility level, despite running on SQL Server 2016. Now I know when we upgraded the server a year ago the 3rd party vendor didn’t guarantee compatibility with 2016, so we had left it in its old compatibility level. Since then apparently the vendor had qualified it and I confirmed with their support who was on the line that I could change the compatibility level to SQL Server 2016. Of course, I wanted to see if this would make a difference, so I grabbed another one of the problematic queries and looked at the query plan both before and after.

Compatibility level 100

Query Plan at SQL Server 2008 Level

Compatibility level 130

Query Plan at SQL Server 2016 Level

As you’ll note, the 2008 plan uses 2 hash matches, the 2016 uses two merge joins. That’s interesting by itself, but after collecting a bit of data, I saw the 2016 plan was running in an average of 45ms. The 2008 plan had been averaging 1434ms. That’s quite the improvement, simply by a single change!

That said, I still wasn’t entirely comfortable with what was going on and dug a bit deeper.

Digging Deeper

The change to the compatibility level had essentially eliminated the green bar in the above graph. This was good. But the blue bar to the left of it was still an issue. It also had a similar issue with flipping between two different query plans, but this was even worse.

Query Store second graph

Better, but not that one query really stands out!

I find this particular chart to be the most useful. I set a custom time frame (in this case 3 hours) and looked at the total duration of 25 queries that had accumulated the most time running. It’s pretty clear that one query dominates and working on this is probably where I want to spend my efforts. It’s also very hard to pick out, but the query (#12) from the first graph that I had looked at, has improved so much that it’s now moved to 12th on the list from the 2nd position.  That’s quite an improvement and simply by changing the compatibility level! More on my thoughts on that below.

The more I thought about it, the more I started to focus on statistics. This was an educated guess based on the fact that my client was doing a LOT of inserts and updates into a particular table. There’s another issue I’ll also discuss. This one I couldn’t fix unfortunately, but if the 3rd party can, I think they’ll see a HUGE improvement in performance.

Slow query plan

Slow version of the query

Fast version of the query

Fast version of the query

These look VERY similar, except the position of the Key lookups and the Index Seeks are swapped. That may not seem like much but the slow version was on average taking about 93.95 ms and the fast version was on average taking about .11ms. That’s a HUGE difference, about 850x difference! It took me a bit to realize what was going on, but let’s talk about that Key lookup. Even with the faster version, it’s obvious that if I can eliminate that, I could get things to be even faster!  The problem is that the query wants to return some columns not covered in the IX_FileID index. That’s generally easy to fix and while I’m loathe to make updates to 3rd party packages, I was willing to test this one out by making it a covering index. Unfortunately, this is where I was stymied. One of the columns is an IMAGE datatype and you can’t throw those into an index. I’ve recommended to the 3rd party vendor they try to change this. It wouldn’t be easy, but it could have dramatic performance improvements here and elsewhere (I had run into this problem last year while trying to tackle another performance issue).

I should note, that even though this query is actually very fast, it is executed so much that its total time dominates in the system. This is one reason why any improvement here would have a dramatic impact.

Statistics

In any case, looking at these two query plans and doing some further testing confirmed my hypothesis and also suggests why changing the compatibility level helped so much: statistics were very quickly getting out of whack.

I was able to confirm this by grabbing some data from the query store for just the last hour and it showed only the slow version of the query was running. I then forced an update of stats on the table in question and immediately saw the query flip over to the faster plan. This continued for awhile before it flipped back to the slower version.

We developed a plan, which I’ll admit upfront didn’t work. We decided that updating the stats on that particular table every hour might give us tremendous performance gains. And in fact it did initially. BUT, what we found was that after an hour of inserts, running the update stats for that table took about 45-60 seconds and the vendors tool has a hard-coded 30 second timeout. And because of the way this particular tool works, it means after a failure you have to start from scratch on every run. Since the job can take 4-6 hours to run, we couldn’t update stats every hour, even though I would have liked to.

Query Store third graph

The graph that should our plan wasn’t working

Above shows how at the time the update stats was running (that particular column of the query story graphic is cut off) the query times jumped to 30 seconds.  So while overall updating the stats is a good thing, here it was definitely killing our process.

Above I mentioned that changing the compatibility level still had an impact here. What I didn’t show here was that I was also looking at a bunch of statistics histograms and could see how badly things had gotten in some cases. But this is an area where SQL Server 2016 makes a difference! It can do more in the background better to help keep statistics a bit more accurate (still not as good as a full update, but it can dramatically help.) This is a hug part I believe of why the first query addressed above improved AND stayed improved.

Loving Query Store

They say a picture is worth a 1000 words. Honestly, I probably could have figured out the above issues with running a bunch of queries, looking at some DMVs, statistics histograms and the like. But it would have taken longer. Note too you can query the query store. But, the ability to instantly look at a graph, see what’s taking the most time, or executing the most, or a variety of other parameters makes the graphical interface to Query store EXTREMELY valuable. I was able to instantly zero in on a couple of key queries and focus my energies there. By varying the timeframes I was looking at, I could try changes and see the impact within minutes. I could also look at the stored query plans and make judgments based on what they showed.

If you’re NOT using Query Store to debug performance issues, start doing it. To be honest, I haven’t used it much. I wouldn’t call myself an expert in it by any means. But, I was able to pull it up and almost instantly have insight into my client’s issues and was able to make actionable suggestions.

And to quote the product manager there after I fixed the first query simply by changing compatibility mode, “A good DBA is like having a good mechanic to work on your car.” That one made me smile.

Oh and I’ve been known to swap out the alternator on my old Subaru in under 10 minutes and have replaced the brakes a number of times. So if this DBA thing doesn’t work out, I guess I’ve got another career I can look into!

Final Note

Per my NDA, I obviously haven’t named my client. But also, simply out of respect, I haven’t named the third party tool. I don’t want folks thinking I’m trying to besmirch their name. Their product is a fine one and I’d recommend it if asked. But my client is one of their larger users and sometimes pushes it to the limits so we sometimes find some of the edge cases. So nothing here is meant to disparage the 3rd party tool in an way (though they should replace that image field since it really doesn’t need to be one!)

 

 

Trust but Verify

This is one of those posts where you’ll just have to trust me. Honestly.

I want to talk about indexes.

About a week ago, a friend on a chat system I use mentioned how one of their colleagues had mentioned, “oh, we don’t have to optimize the database, the server is fast enough” or words to those effect. All of us in the discussion blanched a bit. Yes, when I started in the business a 10GB database was considered large and because of the memory limit with 32-bit SQL, we were limited to 2GB (or 3GB if you took the right steps) of memory so it was literally impossible to keep a large database in memory. Of course now we routinely deal with databases 100s of GB in size with machines that can easily have .5TB of memory or more. This means except for writes, an entire database can easily be kept in memory.

But that said, optimization still matters. Last week I was debugging an ETL process that I’ve helped a client with. I’d love to show screen shots, but my NDA won’t allow me (hence my asking you to trust me). Ok, that’s partly a lie. I couldn’t provide too many details if I wanted to, but the bigger issue is, I’ve since closed the windows I that showed the scripts in questions and the results of my changes.

One of the last things each step in the ETL does is write back to the source table an updated Sales Force id. It’s actually a bit more complicated because what it really does is write to either a Success table or an Error table and depending on a factor or two, a trigger will then update the source table. I had previously debugged and improved the performance of the trigger. But something was still bothering me about the performance. I looked a bit deeper and one of the things that trigger does if there’s a success is make sure to remove the row from the Error table. This was taking longer than I suspected it should, so I dug into it and I noticed that the Error table had no index.  

I can’t show the original queries I used, but I can show an example of the impact of adding a simple clustered index. (See, you can’t even trust me to say I won’t show any examples! You’d better read the entire post to verify what I’m really writing!)

Here’s an example query (with some changes to hide client specific data)

select * from ErrorTable where SF__External_Id__c='005A000022IouWqIAX'

It’s a very simple query (and simpler than the actual one I was dealing with) but is enough to show the value of a proper index.

Now, in my original query, the Query Tuning Advisor actually suggested an index on SF__External_ID__c. In the example above it didn’t. There’s a canard among many DBAs that the QTA is generally useless and often it is, though I think it’s gotten better. As a consultant, I can often come into a new client and can tell when someone has gone crazy with the QTA and adopted EVERY SINGLE suggestion. In other words, they trusted it, but they never verified it. Why is this a problem? Well at times the QTA can be overly aggressive in my experience, suggesting indices that really provide little benefit, or if you add an index in response to a select query that is run say once a day, but where there are 1000s of updates a day, you might actually slow down your updates (since now the update also has to update the index). And as mentioned above, sometimes it might fail to suggest an index. (I think in this case, it didn’t suggest one on my example because the size of the underlying table was far smaller than before).

So, I like to verify that the index I’ll add will make a difference. In cases like this, I often go old school and simply bracket my test queries

set statistics IO ON
set statistics Time ON
select * from ErrorTable where SF__External_Id__c='005A000022IouWqIAX'
set statistics IO OFF
set statistics Time OFF

And then I enable Actual Execution Plan.

The results I received without any sort of index are below. Some key numbers are highlighted in red.

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 47 ms, elapsed time = 63 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(2 rows affected)
Table 'ErrorTable'. Scan count 1, logical reads 3570, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 15 ms.
SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

You’ll notice the physical reads are 0. This is nice. This means everything is in memory.

In this case, because I’m familiar with how the ErrorTable is accessed I decided a clustered index on SF__External_Id__c would be ideal. (all my updates, inserts, deletes, and selects use that to access this table).

I added the index and my reran the query:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(2 rows affected)
Table 'ErrorTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Note the number of logical reads dropped by about a factor of 1000. My elapsed time dropped from 15 ms to 0 ms (or rather less than .05 ms so SQL Server rounded down).

If we look at the graphical query plan results we something similar:

First, without the index:

Trust_but_Verify_Query Table Scan

Table scan to find 2 rows

Trust_but_Verify_Query Table Seek

Table Seek to find 2 rows

That’s nice, I now know I’m doing a seek rather than a scan, but is that enough? I mean if the ErrorTable only has 2 rows, a seek is exactly the same as a scan!

So let’s dig deeper:

Trust_but_Verify_Query Table Scan Details

Query plan showing details for a scan

Trust_but_Verify_Query Table Seek Details

Query plan showing details for a seek

Here you can definitely see the dramatic improvement. Instead of reading in over 100,00 rows (at a bit over 2.5 KB per row, or over 270MB) we only need to read in 2 rows, for a total of just over 5 KB of data.

Now wonder it’s faster. In fact, in the ETL process where it was originally taking about 1 minute to process 1000 rows, my query with the index was now executing 3000 rows in under 10 seconds.

The above is a bit of a contrived example, but it’s based on actual performance tuning I did last week. And this isn’t meant to be a lesson in actual performance tuning, but more to show that if you make a chance (in this case adding an index) you can’t just trust it will work, but you should VERIFY that it has made a difference, and more importantly, that it makes a difference for your workload. I’ve seen GTA often make valid, but useless index suggestions because someone ran an uncommonly used query against it and assumed the recommendation was good. Or, they’ve made assumptions about the size of the table.

So never just trust an index will help, but actually VERIFY it will help.