“Houston, We Have an Opportunity”

This is not quite the famous quote from the movie Apollo 13, one of my favorite movies. And, well we have a problem. But also an opportunity.  I’ll get to the opportunity in a bit, but first, the problem.

My Event and Disappointment

The problem of course is COVID-19. As the end of the week in which I’m writing this, I was scheduled to help host the National Weeklong Cave Rescue training seminar for the NCRC. For years, I and others had been working on the planning of this event. It’s our seminal event and can attract 100 or more people from across the country and occasionally from other countries. Every year we have it in a different state in order to allow our students to train in different cave environments across the country (New York caves are different from Georgia caves which are different from Oregon caves and more) as well as to make it easier for attendees to attend a more local event.

Traditionally, the events held in NY (This would have been the fourth National Seminar) have attracted fewer people than our events in Alabama which is famous for its caves. But this year was different, with lots of marketing and finding a great base camp, we were on track for not only for the largest seminar yet in New York, but for one on par with our largest seminars anywhere.

Then, in February I started to get nervous. I had been following the news and seeing that unlike previous outbreaks of various flus and other diseases, COVID-19 looked like it would be something different. This wasn’t going to pass quite as quickly. This was going to have an impact. As a result, I started to make contingency plans with the rest of my planning staff. I consulted with our Medical Coordinator. I talked to the camp. By March I started to regain some optimism, but I still wasn’t 100% confident we could pull this off. And then, the questions from attendees started to come in. “Are we still having it?” “What are the plans?” etc.  Another week or two later, “I need to cancel. My agency/school/etc won’t cover the cost this year.”  Finally by the start of April, after talking to several of my planning staff and my fellow regional coordinators it became obvious, we could not, in good conscience host a seminar in June. Yes, here in upstate New York the incidence of COVID-19 is dropping quickly. We’re starting the re-opening process. Honestly, if folks came here, I would NOT be worried about them getting infected from a local source.

However, we would have nearly 100 people coming from across the country, including states where the infection rates are climbing. Many would be crammed into planes for hours, or making transfers at airports with other travelers. So, while locally we might be safe; if we held this event, where folks are in classrooms for 3-4 hours a day, then in cars to/from cave and cliff sites and then often in caves for hours, it’s likely we would have become ground zero for a spike in infections. That would not have been a wise nor ethical thing to do. So, we postponed until next year.

I mention all that because of what happened to me about three weeks ago and then the news from last week: 2020 PASS Summit is going Virtual. About three weeks ago I was asked to participate in a meeting with some of the folks who help to run PASS as well as other User Group leaders. The goal was to discuss how to make a Virtual Summit a great experience if it went virtual. This was one of several such meetings and I know a lot of ideas we brought up and discussed. NONE of us were looking forward to PASS 2020 being virtual, but we all agreed that it was better than nothing. And of course, as you’re well aware, last week PASS made the decision, I suspect due in large part for the very same reasons we postponed our cave rescue training event.

Sadness and Disappointment

Mostly I’ve heard, if not happy feedback, at least resigned feedback. People have accepted the reality that PASS will be virtual. I wrote above about my experience with having to postpone our cave rescue training (because it’s so hand’s on, it’s impossible to host it virtually). It was not an easy decision. I’ll admit I was frustrated, hurt, disappointed and more. I and others had put in a LOT of hard work only to have it all delayed. I know the organizers of Summit must be feeling the same way. And I know many of us attendees must feel the same. Sure, Houston is not Seattle and I’ve come to have a particular fondness for Seattle, in part because of an opportunity to see friends there, but I was looking forward to going to Houston this year (as was my wife) and checking out a new city.

One thing that has helped buoy my emotions in regards to our weeklong cave rescue class is that over 1/2 the attendees said, “roll my registration over to next year. I’m still planning on coming!” That was refreshing and unexpected. Honestly, I was hoping for maybe 1/4 of them at best to say so. This gave me hope and the warm fuzzies.

Opportunity

Let me start with stating the obvious: a virtual event will NOT be the same as in-person event. There will most definitely be things missing. Even with attempts that PASS will be making to try to recreate the so-called “Hallway track” of impromptu discussions and hosting other virtual events to mimic the real thing, it won’t be the same. You won’t get to check out the Redgate Booth in person, hang out on the sofa at Minionware, or get your free massage courtesy of VMWare.

20191106_123733

After a great massage courtesy VMWare.

And we’ll miss out on:

20191105_143514

Achievement unlocked: PASS Summit 2019 Selfie with Angela Tidwell!

But, we’ll still have a LOT of great training and vendors will have virtual rooms and more.

So what’s the opportunity? Accessibility!

Here’s the thing, I LOVE PASS Summit. I think it’s a great training and learning opportunity. But let’s face it. It can get expensive, especially when you figure in travel costs, hotel costs and food costs. This year though, most of those costs disappear. This means that when you go to your boss, they have even less of an excuse to say, “sorry it’s not in the budget”.  And honestly, if they DO say that, I would seriously suggest that you consider paying for it out of your own budget. Yes, I realize money might be tight, but after all the wonderful training you can then update your resume and start submitting it to companies that actually invest in training their employees.

I would also add, from my understanding, while convention centers by law ADA accessible, this doesn’t necessarily mean everyone with a disability can attend. There can be non-physical barriers that interfere. Hosting virtually gives more people the ability to “attend” in a way that works for them. It might be in a quiet, darkened room if they’re sensitive to noise and lights. It might be replaying sessions over and over again if they need to hear things in that fashion. It very well could be taking advantage of recorded sessions and the like in ways that I, an able-bodied person isn’t even aware of. So that’s a second way in which it’s accessible.

Now, I know folks are questioning “well if its virtual, why should we pay anything, especially if vendors are still paying a sponsorship fee?” There’s several answers to that and none of them by themselves are complete, but I’ll list some. For one, I haven’t confirmed, but I’m fairly confident that vendors are paying a lot less for sponsorship, because they won’t get the same face to face contact. For another, PASS takes money to run. While we often think of it as a single big weeklong event, there’s planning and effort that goes on throughout the year. This is done by an outside organization that specializes in running organizations like PASS. (Note the PASS Board is still responsible for the decision making that goes on and the direction of PASS as a whole, but day to day operations are generally outsourced. This is far from uncommon. Those costs don’t disappear. There’s other costs that don’t automatically disappear because the event is no longer physical. And of course there are costs that a virtual event has that the physical event doesn’t. Now EVERY single session will be available as a live-stream (as well as recorded for later download) and this requires enough bandwidth and tools to manage them. And it requires people to help coordinate.  Making an event virtual doesn’t automatically make it free to run.

The Future

Now, I know right now I’m on track for hosting the NCRC Weeklong Cave Rescue training event next year at the location we planned on for this year. Our hope of course is that by then COVID-19 will be a manageable problem. But in the meantime, I’ll keep practicing my skills and sharing my knowledge and when and where I can, caving safely. And as always willing to take new folks caving. If you’re interested, just ask!

I don’t know what PASS 2021 Summit will bring or even where it will be. But I know this year we can make the most of the current situation and turn this into an opportunity to turn PASS into something new and more affordable. Yes, it will be different. But we can deal with that. So, register today and let’s have a great PASS 2020 Summit in the meantime!  I look forward to seeing you there. Virtually of course!

“It’s a Jump to the left…

… and then a double-hop to the right.” Or something like that.

I’ve commented before on the fact that I’m a consultant. I enjoy it. People will ask me what I do, and it varies. At one client they refer to their VB app as “the database” and because they found an ad of mine on Google where I talked about database administration, they hired me. About 80% of the work I do for them is actually on the VB app or related, very little is actually what I’d consider traditional database work. But that’s ok, they’re a pleasure to work with and I enjoy the work. Another client I recently worked with, asked me to help them conduct an audit of their web based product and help them with some steps to make it more secure. I was more than happy to help.

And then there’s my largest, by far, client. I actually get to do a fair amount of work that most of my #sqlfamily would recognize as “database work”. But there, perhaps more than any other, I describe my duties as “DBA and other duties as assigned.”  So between the work at this client and all my other clients, I’m often jumping or stepping around stuff.

The Double-Hop

This time though I was asked to double-hop. What is that exactly? It’s an issue that has to do with how Windows can pass security credentials from one server to another. This article, while old, describes it well. This was essentially the situation my client was trying to solve: Users needed to use their Active Directory (AD) Credentials to log into the Reporting Server (RS_Server) and run a report that in turn accessed data on a separate database server (DB_Server), and thus, the double-hop. Now, from my point of view, this isn’t really database work, but since the reporting server talks to the database server it was dropped in my lap under “other duties as assigned.”

Now, honestly, this SHOULD be simple to solve. It wasn’t. One reason was in part because, like many companies, this client has a separate team that handles much of their infrastructure needs, such as AD requests. And they have to go through tickets. To be clear, I support this concept, in theory. In practice, it can often take 2-3 weeks for even simple requests to go through. This meant that my first attempt at solving the double-hop failed. Their IT department did exactly what I requested. Unfortunately there was a typo in my ticket. So it failed. So round two. And round two didn’t work. Nor did round three. At this point though it wasn’t due to typos or mistakes on my end.

I started reading every article I could. My great editor at Red-Gate, Kathi Kellenberger has one, and trust me I wasn’t too shy to ask at that point! But nothing was working. I even asked another DBA at the client (they actually head up a different group and is their lead DBA). She pointed me to one of her people saying, “talk to him, he solved it.” I did, and he hadn’t. His solution was the one we were trying to avoid (basically using a fixed user in the datasource).

Frustration Sets In

I was getting frustrated. Fortunately at this point I started to exploit a loophole in the ticketing process. Since the problem wasn’t being fixed, I was able to keep it open and ended up getting assigned someone from their IT group who was as interested in fixing this as I was. This meant rather than “open a ticket, wait 1-2 weeks, have ticket be closed as complete, test, find out it failed, rinse” we could now actually schedule Zoom sessions and make changes in real-time. AND…. nothing we tried worked.

At this point you’re probably saying, “Yeah, yeah, get to the point. Did you solve it?”  The answer is yes, but I wanted you to feel a bit of my pain first, and I needed to make this post long enough to make it worth posting.

A Solution!

Now, let me say, I wish I could write out an exact recipe card solution for you. For various reasons, I can’t. But bear with me.

Finally, we found an additional resource in the IT group who had solved this before. His first recommendation was yet again, the solutions I mentioned above. He saw they didn’t work, agreed we were trying the right thing. So he said, “well let’s try a solution known as “Resource-based Kerberos Constrained Delegation“. This didn’t work at first either.

But then he suggested that we turn on 128 and 256 bit encryption on the DB_Server SQL account. Bingo that worked. Mostly. More on that in a second.

So here’s the setup we ended up with.

  1. RS_Server – running reporting services under an account domain\RS_Server_Service
  2. DB_Server – running SQL Server under an account domain\DB_Server_Service
  3. Setup some SPNs
    1. MSSQLsvc/DB_Server domain\DB_Server_Service
    2. MSSQLsvc/DB_Server.domain.com domain\DB_Server_Service
    3. Note in this case you do not appear to need one on the RS_Server side.
  4. Run a Powershell script
    1. $FEIdentity1 = Get-ADUser -Identity domain\RS_Server_Service
    2. $BEIdentity = Get-ADUser -Identity domain\DB_Server_Service
    3. Set-ADuser $BEIdentity -PrincipalsAllowedToDelegateToAccount $FEIdentitity1
  5. At this point things should have worked, but they didn’t until we then enabled the encryption options:

    encryption options for Kerberos

    Kerberos Encryption

  6. Then on our test box, things magically worked! Ok, not quite magically, but things worked. We had a solution.

And I was even more ecstatic when later that day, I tested this on a second report server box we had and it too suddenly was working without any changes. And this was a box where we had NOT even setup an SPN for the original double-hop solution, so I was pretty confident that the Resource-based Kerberos Constrained Delegation was working. In addition, in the rsserver.config file, the only authentication enabled was NTLM.

The next step was to try this on a production server. In that case, I did have to reconfigure the service it was running under to use the domain account domain\RS_Server_Service.

And… my test failed.

I was at wit’s end. I couldn’t quite figure out what was different. I checked my service names, my SPNs, the rsserver.config file, and more. Nothing was working. I took a break and came back and had an idea. In the datasource I changed it from:

Data source=DB_Server;Initial Catalog=TestDB

to

Data source=DB_Server.domain.com;Initial Catalog=TestDB

Bingo, it worked! A little digging confirmed my suspicion. This client actually has multiple DNS domains and the ordering and like under the TCP/IP settings was different on this box from the other two boxes. And that made the difference.

Sure enough when I tried deploying to a fourth box, I had the same issue, but changing it to the Fully Qualified Domain Name (FQDN) solved my issue.

So, my take-aways for this week:

  •  Resource-based Kerberos Constrained Delegation may be a better solution at solving the Double-Hop solution than the solution generally proposed.
  • Once you’ve setup the “target” SQL Server service account and source Reporting Server Service accounts, additional reporting servers can be added to the mix without needing assistance from a domain admin.
  • It appears you still need an SPN (well multiple) for the SQL Server itself.
  • You need to run a PowerShell Script to setup the accounts. Note that if you run it again, it overwrites the old settings, so you need to add ALL of the source accounts in a single step.
  • Depending on your domains security setup, you may need to enable 128/256 bit Kerberos authentication.
  • DNS resolution may determine if you can use just the NetBIOS name or the FQDN in your data sources.
  • This solution will NOT work if you need to cross domains or have more complex setups, but in general, it can be simpler to setup and to maintain, especially if you have limited access to making changes to AD in the first place.
  • My reading indicates this only works on Windows 2012 and beyond. But you shouldn’t be running older versions of Windows in any case!

If I get the time and energy, I may setup a test environment in my home lab to further experiment with this and write up better demos, but for now, use this as you can. Hopefully it’ll save you some of the stress I experienced.

And that’s it from here, back to other duties as assigned.

 

 

Quiet Time and Errors

I wrote last week about finally taking apart our dryer to solve the loud thumping issue it had. The dryer noise had become an example of what is often called normalization of deviance. I’ve written about this before more than once. This is a very common occurrence and one I would argue is at times acceptable. If we reacted to every change in our lives, we’d be overwhelmed.

That said, like the dryer, some things shouldn’t be allowed to deviate too far from the norm and some things are more important than others. If I get a low gas warning, I can probably drive another 50 miles in my car. If I get an overheated engine warning, I probably shouldn’t try to drive another 50 miles. The trick is knowing that’s acceptable and what’s not.

Yesterday I wrote about some scripting I had done. This was in response to an issue that came up at a customer site. Nightly a script runs to restore a database from one server to a second server. Every morning we’d get an email saying it was successful. But, there was a separate email about a separate task that was designed to run against that database that indicated a failure. And that particular failure was actually pretty innocuous. In theory.

You can see where I’m going here. Because we were trusting the email of the restore job over the email from the second job, we assumed the restore was fine. It wasn’t. The restore was failing every night but sending us an email indicating a success.

We had unwittingly accepted a deviance from the norm. Fortunately the production need for this database hadn’t started yet. But it will soon. This is what lead my drive to rewrite and redeploy the scripts on Friday and on Monday.

And here’s the kicker. With the new script, we discovered the restore had also been failing on a second server (for a completely different reason!)

Going back to our dryer here, it really is amazing how much we had come to expect the thunking sound and how much quieter it is. I’ve done nearly a half-dozen loads since I finally put in the new rollers and every time I push the start button I still cringe, waiting to hear the first thunk. I had lived with the sound so long I had internalized the sounds as normal. It’s going to take awhile to overcome that reaction.

And it’s going to take a few days or even weeks before I fully trust the restore scripts and don’t cringe a bit every morning when I open my email for that client and check for the status of overnight jobs.

But I’m happy now. I have a very quiet dryer and I have a better set of scripts and setup for deploying them. So the world is better. On to the next problems!

Checking the Setup

A quick post outside of my usual posting schedule.

I was rewriting a T-SQL sproc I have that runs nightly to restore a database from one server to another. It had been failing for reasons beyond the scope of this article. But one of the issues we had was, we didn’t know it was failing. The error-checking was not as good as I would have liked. I decided to add a step that would email me on an error.

That’s easy enough to do. In this case I wanted to be able to use the stored procedure sp_notify_operator. This is useful since I don’t have to worry about passing in an email address or changing it if I need to update things. I can update the operator. However, the various servers at this client had been installed over a several year period and I wasn’t sure that all of them had the same operator configured. And I was curious as to who the emails the operators went to on those machines.  Now, I had a decent number of machines I wanted to check.

Fortunately, due to previous work (and you can read more here) I have a JSON file on my box so I can quickly loop through a list of servers (or if need be by servers in a particular environment like DEV or QA).

$serverobjlist = Get-Content -Raw -Path “$env:HomeDrive$env:HomePath\documents\WindowsPowerShell\Scripts\SQLServerObjectlist.json” | ConvertFrom-Json
 
foreach ($computername in $serverobjlist.computername)
{
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name, email_address from msdb.dbo.sysoperators”
write-host $computername $results.name $results.email_address
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name from msdb.dbo.sysmail_profile”
write-host $computername $results.name `n
}

This gave me a list of what operators were on what servers and who the emails went to. Now if this were a production script I’d probably have made things neater, but this worked well enough to do what I needed. Sure enough, one of the servers (ironically one of the ones more recently installed) was missing the standard mail Profile we setup. That was easy to fix because of course I have that scripted out. Open the T-Sql script on that server, run it, and all my servers now had the standard mail profile.

Once I had confirmed my new restore script could run on any of the servers and correctly send email if there was an error it was time to roll it out.

deploy

Successful deploy to the UAT environment

So one quick PowerShell Script, an updated T-SQL Script and a PowerShell Deploy Script and my new sproc has been deployed to UAT and other environments.

And best of all, because it was logged, I knew exactly when I had done it and on what servers and that everything was consistent.

I call that a win for a Monday. How is your week starting?

 

 

Scripting Made Easier

I was recently asked to write a quick blog for a local Tech Group. You can read it here. I commented on some trends in SQL Server there. But I wanted to add a thought on a particular positive trend I’ve been seeing in the design of SQL Seerver itself.

As regular readers will know I’m a huge fan of PowerShell and have been using and writing about it more and more. But, I came across a requirement last week where PowerShell would have been overkill. Basically, my client and I had identified a number of databases that had the wrong owner. We wanted to change them over to the proper owner.

Now, being old school, my first reaction was to call sp_changedbowner. Any DBA who has been around for enough years has probably seen this and used it. But, if you look at the most recent webpage from Microsoft, you’ll notice a light blue box that has a warning:

 Important

This feature is in maintenance mode and may be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature. Use ALTER AUTHORIZATION instead.

Now on one hand, since this was a one-off script, I could have safely used sp_changedbowner. But there is an issue with it. It has to be run in the database you want to change the owner of. There is no parameter to specify the database. Now to change the owner on a single database, that’s not an issue. Go into the database and run the sproc.

But, when you might have a dozen databases, that becomes a lot of work.

Now, my first inclination for something like this is to write a query similar to:

select name, ‘sp_changedbowner ”newowner”’ from sys.databases where owner_sid != SUSER_ID(‘oldowner’);

This would generate a nice list, except that we’d still have to go into each database and execute the command. Now, I could get a little smarter and write something like:

select ‘use ‘ + name + ‘ sp_changedbowner ”newowner” ‘ from sys.databases where owner_sid != SUSER_ID(‘oldowner’);

This has the advantage of putting in the USE command so I can change databases, but I still would have to manually insert carriage/return linefeeds.

So it’s a slight improvement, but only a slight one. Now, I could resort to PowerShell, get a list of databases and loop through them with a ForEach, but as I said, I wanted to keep this simple as it was basically a one-off (well really a three-off since I had to do it on DEV, UAT, PROD) but the point was still the same, I wanted to keep it fairly simple and try to keep it all in T-SQl.

But, if we go back to the notice that Microsoft I quoted above, we end up with a better solution.  Alter Authorization lets you put the name of the database right in the command.  So, something like this will work quite smoothly.

select ‘alter authorization on database::[‘+name+’] to newowner’ from sys.databases where owner_sid != SUSER_ID(‘oldowner’);

This results in a column that I can then cut and paste into a new window and execute without having to make any edits. It’s also a solution I know will work in the future.

I’ll add a final way some DBAs might have approached this and that is with the undocumented stored procedure sp_msforeachdb. However, I try to avoid that as much as possible because it is in fact undocumented and honestly, I find the syntax a bit confusing at times.

Anyway, the real point I wanted to make in this post was not necessarily how to change the DB owner, but to make an observation. Over the years, I’ve noticed that Microsoft has been improving T-SQL and making tasks that once were unncessarily complicated easier to script out. It’s easier to change a DB owner now than it was to do so years ago. This is evident in other areas of SQL Server. There are many items that previously were easier done via SSMS or perhaps could pretty much only be done by SSMS that can now be done via script (and of course pretty much any action you can do in SSMS now has a “Script” option so you can script it out, modify it if needed and save it.)

So, my pattern lately has been, “script everything” and Microsoft has certainly been making that easier!

An Ounce of Prevention?

There’s an old saying in medicine, “when you hear hoof beats, think horses, not zebras.”  Contrary to what one might think after watching House the truth is, when you get presented a set of symptoms, you start with the most likely, well because it is most likely! But as House illustrates, sometimes it can be the unlikely.

In First Aid, especially wilderness or backcountry medicine, there’s an acronym that is often used called SAMPLE. This is a mnemonic to help rescuers remember what data to gather:

  • S – Signs/Symptoms – What do you see, observe? (i.e. what’s going on).
  • A – Allergies – Perhaps the problem is an allergic reaction, or they might be allergic to whatever drug you want to give them.
  • M – Medicines – What medications are they on? Perhaps they’re diabetic and haven’t taken their insulin, or they’re on an anti-seizure medicine and need some.
  • – Past, pertinent medical history. You don’t care they broke their ankle when they were 5. But perhaps they just underwent surgery a few weeks ago? Or perhaps they have a history of dislocating their shoulder.
  • L – Last oral intake. Have they eaten or drunk anything recently. This will drive your decision tree in a number of ways.
  • E – Events leading up to the injury.  Were they climbing to the top of the cliff and fell, or did they simply collapse at the bottom? The former may suggest you look for a possible spinal injury, the latter probably indicates something else.

I’m reminded of this because of how I spent my Presidents’ Day.  I woke up and checked a few emails and noticed that two I were expecting from a client’s servers never arrived. I logged in to see what was going on.  It turns out there was nothing going on. No, not as in, “nothing wrong going on” but more as in “nothing at all was happening, the databases weren’t operating right.”  The alert system could connect to the database server, so no alerts had been sent, but actually accessing several of the databases, including the master resulted in errors.

Master.mdf corruption

Not an error you want to wake up to!

And what was worse, was the DR server was exhibiting similar symptoms!

So modifying SAMPLE a bit:

  • S – Signs/Symptoms – Well, databases are throwing corruption errors on two servers.  This was extended to the ERRORLOG files on both servers.
  • A – Allergies – Well, servers don’t have allergies, but how about known bugs? That’s close enough.  Nope, nothing that seems to apply here.
  • M – Medicines – I’ll call this antivirus software and <redacted>. (For client privacy reasons I can’t specify the other piece of software I want to specify, but I’ll come back to it.)
  • – Past, pertinent medical history.  Nothing, these servers had been running great up until now. One has been in production for over 2 years, the other, up for about 2 months, being brought up as a DR box for the first.
  • L – Last oral intake. Let’s make this last data intake. Due to forensics, we determined the corruption on both servers occurred around 3:00 AM EST.  Checking our logs, jobs, and other processes, there’s nothing special about the data the primary server took in at this time. If anything, disk I/O was lower than average.  And, fortunately, we can easily recreate any data that was sent to the server after the failure.
  • E – Events leading up to the injury. This is where things get interesting.
    • There were some zero-day patches applied to both servers over the weekend.
    • On Saturday, I had finally setup log-shipping between the two servers

So, we’ve got three possibilities, well four really.

  1. The zero day patches caused an issue about 48 hours later.  Possible, but unlikely, given the client has about 1600 servers that were also patched and have not had issues.
  2. Log-shipping somehow caused problems.  But again, the new log-shipping setup had run for about 36 hours without issue. And, best we can tell, the corruption occurred on the secondary BEFORE the primary. And log-shipping doesn’t apply to the Master database or the ERRORLOG file.
  3. Some unknown interaction.  This I think is the most likely and is where the <redacted> from the M above comes into play.
  4. Pure Random, and it’ll never happen again. I hate this option because it just leaves me awake at night.  This I added only for completeness.

Without going into detail, our current theory is that some weird interaction between <redacted> and log-shipping is our cause. Of course the vendor of <redacted> is going to deny this (and has) but it’s the only combination of factors that seems to explain everything. (I’ve left out a number of additional details that also helped us get to this conclusion).

So for now, we’ve disabled log-shipping and are going to make some changes to the environment before we try log-shipping again.

Normally I think horses, but we might have a herd of zebras on this one. And ironically, setting up for DR, may have actually caused a Sev 1 outage. So the ounce of prevention here may not have been worth it!

And who said my Wilderness First Aid wouldn’t come in handy?

 

 

Them’s the brakes…

I need brake work done on my car.  I scheduled an appointment with my mechanic for Thursday.

Now normally brake work might not be worth writing about, but this time it got me thinking. About 9 years ago I taught myself to change the brakes on my Subaru. It wasn’t that hard, but it definitely took longer than I would have liked. I’ve learned how to do it much faster since then.

Back then I did it because I was between jobs and had more time than money. It made sense at the time. Since then I’ve generally continued to do them myself. It doesn’t take long and as I’ve said in the past, sometimes I like working with my hands. It reminds me not all of life is simply databases and servers.

So why this time? The simple answer is, my garage floor is cold!  Ayup, it’s a funny thing, but in the winter, things get cold! And honestly, I just don’t feel like sitting/laying on the concrete while I do the work. So, I’ll pay an expert who can pop it up on a lift and do the work in probably half the time I can.  I remind myself it’s why I work, so I can pay others to do work I don’t care to do, or that they can do better than I can.

Years back, at my last job as the Director, and later VP of IT, I changed our email provider from one company to another (don’t ask me who they were, I can’t recall over a decade later).  My team didn’t like this. They kept insisting they could run the mail servers themselves. I had no doubt that they could. Running an Exchange server isn’t the most difficult thing in the world. That said, I kept saying no.  For a simple reason: “we weren’t an email company!” Providing reliable email for a business like we were meant at the very least strong spam protection and of course high-availability with ideally geographic redundancy. This meant at least two Exchange servers, spam filters and more. This alone would have cost more than what we paid annually. Now add in the time my team would have spent on email issues, it just wasn’t going to be worth it.

And, we’re seeing that more and more with services such as Azure. Sure, many businesses run change their own brakes, err, run their own SQL Server. But more and more are outsourcing to platforms such as Azure. But there are still a number of companies that for various reasons don’t do that. Fortunately for them, there are consultants like me to help them with their servers. SQL Server has always been easier to maintain than some of its competition, especially when it was first available. But that has never meant it needed no intervention. Someone still needs to make sure that jobs are being run, that backups are available and more.

So, some days I change my own brakes because it’s fun and easy, some days I pay someone.  Some days my clients handle their own SQL Server issues, and other days, they pay me.

No real revelation here. No real advice on when to draw the line on outsourcing, just an observation. But in my case, my concrete is too cold, so I’ll pay someone else to do the dirty work.

How Much We Know

Last night I had the privilege of introducing Grant Fritchey  as our speaker to our local user group. He works for Redgate who was a sponsor. The topic was on 10 Steps Towards Global Data Compliance.  Between that and a discussion I had with several members during the informal food portion of our meeting I was reminded me of something that’s been on my mind for awhile.

As I’ve mentioned in the past, I’ve worked with SQL Server since the 4.21a days. In other words, I’ve worked with SQL Server for a very long time. As a result, I recall when SQL Server was just a database engine. There was a lot to it, but I think it was safe to say that one could justifiably consider themselves an expert in it with a sufficient amount of effort. And as a DBA, our jobs were fairly simple: tune a query here, setup an index update job there, do a restore from backups once in awhile. It wasn’t hard but there was definitely enough to keep a DBA busy.

But, things have changed.  Yes, I still get called upon to tune a query now and then. Perhaps I making sure stats are updated instead of rerunning an index rebuild, and I still get called upon to restore a database now and then. But, now my job includes so much more. Yesterday I was writing a PowerShell script for a client. This script calls an SFTP server, downloads a file, unzips it and then calls a DTSX package to load it into the database.  So now I’m expected to know enough PowerShell to get around. I need to know enough SSIS to write some simple ETL packages. And the reason I was rewriting the PowerShell script was to make it more robust and easier to deploy so that when I build out the DR box for this client, I can more easily drop it in place and maintain it going forward.  Oh, did I mention that we’re looking at setting up an Availability Group using an asynchronous replica in a different data center? And I should mention before we even build that out, I need to consult with the VMWare team to get a couple of quick and dirty VMs setup so I can do some testing.

And that was just Monday.  Today with another client I need to check out the latest build of their application, deploy a new stored procedure, and go over testing it with their main user. Oh, and call another potential client about some possible work with them. And tomorrow, I’ll be putting the finishing touches on another PowerShell article.

So what does this have to do with last night’s meeting on Global Data Compliance? Grant made a point that in a sense Data Compliance (global or otherwise) is a business problem. But guess who will get charged with solving it, or at least portions of it?  Us DBAs.

As I started out saying, years ago it was relatively easy to be an expert in SQL Server. It was basically a single product and the lines tended to be fairly distinct and well drawn between it and other work. Today though, it’s no longer just a database engine. Microsoft correctly calls it a data platform.  Even PASS has gone from being an acronym for Professional Association of SQL Server to simply PASS.

Oh, there are still definitely experts in specific areas of of the Microsoft Data Platform, but I’d say they’re probably more rare now than before.  Many of us are generalists.

I mentioned above too that I’d probably be more likely to update stats than an index these days.  And while I still deal with backups, even just the change to having compression has made that less onerous as I worry less about disk space, network speed and the like. In many ways, the more mundane tasks of SQL Server have become automated or at least simpler and take up less of my time. But that’s not a problem for me, I’m busier than ever.

So, long gone are the days where knowing how to install SQL Server and run a few queries is sufficient. If one wants to work in the data platform, one has to up their game. And personally, I think that’s a good thing. What do you think? How has your job changed over the past decade or more. I’d love to hear your input.

Crossing the Threshold…

So it’s the usual story. You need to upgrade a machine, the IT group says, “no problem, we can virtualize it, it’ll be better! Don’t worry, yes, it’ll be fewer CPUs, but they’ll be much faster!”

So, you move forward with the upgrade. Twenty-three meetings later, 3 late nights, one OS upgrade, and two new machines forming one new cluster, you’re good. Things go live.  And then Monday happens. Monday of course is the first full day of business and just so happens to be the busiest day of the week.

Users are complaining. You look at the CPU and it’s hitting 100% routinely. Things are NOT distinctly better.

You look at the CPUs and you notices something striking:

cpu not being used

CPU 8 is showing a problem

4 of the CPUs (several are missing on this graphic) are showing virtually no utilization while the other  8 are going like gang-busters.  Then it hits you, the way the IT group setup the virtual CPUs was not what you needed.  They setup 6 sockets with 2 cores each for a total of 12 cores. This shouldn’t be a problem except that SQL Server Standard Edition uses the lower of either 4 sockets or 24 cores. Because your VM has 6 sockets, SQL Server refuses to use two of them.

You confirm the problem by running the following query:

SELECT scheduler_id, cpu_id, status, is_online FROM sys.dm_os_schedulers

This shows only 8 of your 12 CPUs are marked visible_online.

This is fortunately an easy fix.  A quick outage and your VM is reconfigured to 2 sockets with 6 cores a piece. Your CPU graphs now look like:

better CPU

A better CPU distribution

This is closer to what you want to see, but of course since you’re doing your work at night, you’re not seeing a full load. But you’re happier.

Then Monday happens again.  Things are better, but you’re still not happy. The CPUs are running on average at about 80% utilization. This is definitely better than 100%. But your client’s product manager knows they’ll need more processing power in coming months and running at 80% doesn’t give you much growth potential. The product manager would rather not have to buy more licenses.

So, you go to work. And since I’m tired of writing in the 2nd person, I’ll start writing in 1st person moving forward.

There’s a lot of ways to approach a problem like this, but often when I see heavy CPU usage, I want to see what sort of wait stats I’m dealing with. It may not always give me the best answer, but I find them useful.

Here’s the results of one quick query.

Fortunately, this being a new box, it was running SQL Server 2016 with the latest version service pack and CU.  This mean that I had some more useful data.

CXPackets

CXPackets and CXConsumer telling the tale

Note one of the suggestions: Changing the default Cost Threshold for Parallelism based on observed query cost for your entire workload.

Given the load I had observed, I guessed the Cost Threshold was way too low. It was in fact set to 10.  With that during testing I saw a CPU graph that looked like this:

43 percent CPU

43.5% at Cost Threshold of 10

I decided to change the Cost Threshold to 100 and the graph quickly became:

25 percent CPU

25% at Cost Threshold of 100

Dropping from 43.5% to 25.6%. That’s a savings you can take to the bank!

Of course that could have been a fluke, so I ran several 5 minute snapshots where I would set the threshold to 10, collect some data and then to 100 for 5 minutes and collect data.

CXPacket_10      CXPacket_10_Waittime_MS
635533 5611743
684578 4093190
674500 4428671
CXConsumer_10              CXConsumer_10_Waittime_MS
563830 3551016
595943 2661527
588635 2853673
CXPacket_100   CXPacket_100_Waittime_MS
0 0
41 22
1159 8156
CXConsumer_100            CXConsumer_100_Waittime_MS
0 0
13 29443
847 4328

You can see that over 3 runs the difference between having a threshold of 10 versus 100 made a dramatic difference in the total time spent waiting in the 5 minute window.

The other setting that can play a role in how parallelization can impact performance is MAXDOP. In this case testing didn’t show any real performance differences with changing that value.

At the end of the day though, I call this a good day. A few hours of my consulting time saved the client $1,000s of going down the wrong and expensive road of adding more CPUs and SQL licenses. There’s still room for improvement, but going from a box where only 8 of the 12 CPUs were being used and were running at 100% to a box where the average CPU usage is close to 25% is a good start.

What’s your tuning success story?

52

52 is an interesting number.  It’s the number of weeks in the year. It’s the number of cards in a deck. It’s a number of Earths in the DC Multiverse. It’s an untouchable number, something I just learned. It’s the atomic number of tellurium. In fact it has a number of interesting trivia associated with it according to Wikipedia: 52.

It also just happens to be the number of times I’ve been around the Sun, though strictly speaking that depends if you’re counting sidereal or the tropical year and the fact that I was born at night. But I think we’re close enough.

And it just so happens my birthday falls on the day I usually blog. So rather than something technical (though if I can get permission from a client, I may have something fun and technical soon) I thought I’d post some reflections and thoughts.

For me birthdays are both interesting and boring. I’m glad I’ve reached another milestone. But honestly, after age 25 when my car insurance rates went down, I haven’t given individual birthdays much thought. That’s not strictly true, I sometimes think about the fact that I’ve passed the point where statistically I’m looking at fewer days ahead of me than behind me.

I grew up in a small town, Falls Village CT, and parts of me never have left it. When I stop to daydream, my thoughts take me to the town green where we often played, or the woods behind my dad’s house, or the sand quarry behind the depot I grew up in. It was a safe and quiet life. I watched the world move from bell-bottoms to Reagan power ties.

While just a teenager, I and friends ran not one, but two Monopoly marathons to raise money for the Muscular Dystrophy Association. The first year we played for 100 hours (in teams) and the second 150 hours. I was and am still quite proud of the organization that took and the money we raised.

Since then I’ve done a lot.  I got thinking about that last night at the Capital Area SQL Server User Group meeting. I’m proud to lead this group.  I really enjoy, as I’ve noted before, giving back to the community that has helped me so much.

I’m proud to be a Regional Coordinator for the National Cave Rescue Commission. I can literally say the work the NCRC does saves lives. It’s an honor and humbling when folks come up to me and tell me how their training has made a difference in the lives.

I’m proud of much of what I’ve done in as my avocations and vocations, even if at times I’m often a victim of imposter syndrome. There’s still many times when someone will ask me a question and my first thought is, “why are they asking me, I’m just a kid and… oh wait… no I am the expert here and I’m far from being a kid.”  This is especially true when people I look up turn around and ask me for advice.  This happens a lot in the SQL world.

I’m very proud of my family, especially my son and daughter who are growing up to be wonderful adults, capable of critical and deep thinking. They will make an impact on the world and I don’t think as a parent I could want for anything more.

And of course proud of my wife, but I can’t take credit there, she’s a wonderful person in her own right. I just married well.

One common thread I’ve realized in my life that I enjoy is teaching and sharing my knowledge. I also, as anyone knows me, love a good debate.  Bring your facts to the table. Teach me something, change my mind, or be willing to have yours changed. I still recall a debate I had with someone once about a detail of the Constitution. She made one claim, I made another. We finally settled it by finding a copy of the text and realizing who was right. Afterwards there was no rancor or hurt. We both had appreciated the intellectual exercise and the correction of fact.  But even opinions can at times be changed. At Summit I had two pieces of white chocolate from New Zealand. They changed my mind about white chocolate! I enjoyed them.

People often ask me what I want for my birthday and the truth is, I rarely want material things. Honestly, unless it’s a new Tesla (and NOT the truck) I can and will probably buy it for myself.

But here goes:

  • Another 52 years – hey, why not? We’re making medical breakthroughs, and it’s possible I’m wrong and I’ve got more days ahead of me than behind me. Right now, I’d love that.
  • Learn something – challenge yourself in the next year to learn a new skill or a new topic. Don’t get stuck doing the same things all the time.  Earlier this year I finally took the time to start learning about Extended Events. Who knows what I’ll learn in this coming year.
  • Teach someone something – everyone one of you has a skill someone else doesn’t. Share.
  • Related to that: if you’re a caver, or have a friend or family member who is a caver, get them to take the 2020 National Cave Rescue Seminar.
  • Have a friendly debate with someone. Realize it’s not about winning or losing, but an exchange of ideas. Bring your facts to the table and recognize your opinions. Be open-minded.  Be prepared to say, “You know what, you’re right, I was wrong.” This is not losing a debate.  And be prepared to acknowledge someone saying the above to you. Accept the words graciously, don’t lord it over them. This is not about winning a debate.
  • Be kind.  If nothing else in the coming year, be kind.

And that’s it for turning 52.

P.S. Unrelated, but check out my latest article at Redgate’s Simple-talk: Building a Countdown Timer with PowerShell