“It’s a Jump to the left…

… and then a double-hop to the right.” Or something like that.

I’ve commented before on the fact that I’m a consultant. I enjoy it. People will ask me what I do, and it varies. At one client they refer to their VB app as “the database” and because they found an ad of mine on Google where I talked about database administration, they hired me. About 80% of the work I do for them is actually on the VB app or related, very little is actually what I’d consider traditional database work. But that’s ok, they’re a pleasure to work with and I enjoy the work. Another client I recently worked with, asked me to help them conduct an audit of their web based product and help them with some steps to make it more secure. I was more than happy to help.

And then there’s my largest, by far, client. I actually get to do a fair amount of work that most of my #sqlfamily would recognize as “database work”. But there, perhaps more than any other, I describe my duties as “DBA and other duties as assigned.”  So between the work at this client and all my other clients, I’m often jumping or stepping around stuff.

The Double-Hop

This time though I was asked to double-hop. What is that exactly? It’s an issue that has to do with how Windows can pass security credentials from one server to another. This article, while old, describes it well. This was essentially the situation my client was trying to solve: Users needed to use their Active Directory (AD) Credentials to log into the Reporting Server (RS_Server) and run a report that in turn accessed data on a separate database server (DB_Server), and thus, the double-hop. Now, from my point of view, this isn’t really database work, but since the reporting server talks to the database server it was dropped in my lap under “other duties as assigned.”

Now, honestly, this SHOULD be simple to solve. It wasn’t. One reason was in part because, like many companies, this client has a separate team that handles much of their infrastructure needs, such as AD requests. And they have to go through tickets. To be clear, I support this concept, in theory. In practice, it can often take 2-3 weeks for even simple requests to go through. This meant that my first attempt at solving the double-hop failed. Their IT department did exactly what I requested. Unfortunately there was a typo in my ticket. So it failed. So round two. And round two didn’t work. Nor did round three. At this point though it wasn’t due to typos or mistakes on my end.

I started reading every article I could. My great editor at Red-Gate, Kathi Kellenberger has one, and trust me I wasn’t too shy to ask at that point! But nothing was working. I even asked another DBA at the client (they actually head up a different group and is their lead DBA). She pointed me to one of her people saying, “talk to him, he solved it.” I did, and he hadn’t. His solution was the one we were trying to avoid (basically using a fixed user in the datasource).

Frustration Sets In

I was getting frustrated. Fortunately at this point I started to exploit a loophole in the ticketing process. Since the problem wasn’t being fixed, I was able to keep it open and ended up getting assigned someone from their IT group who was as interested in fixing this as I was. This meant rather than “open a ticket, wait 1-2 weeks, have ticket be closed as complete, test, find out it failed, rinse” we could now actually schedule Zoom sessions and make changes in real-time. AND…. nothing we tried worked.

At this point you’re probably saying, “Yeah, yeah, get to the point. Did you solve it?”  The answer is yes, but I wanted you to feel a bit of my pain first, and I needed to make this post long enough to make it worth posting.

A Solution!

Now, let me say, I wish I could write out an exact recipe card solution for you. For various reasons, I can’t. But bear with me.

Finally, we found an additional resource in the IT group who had solved this before. His first recommendation was yet again, the solutions I mentioned above. He saw they didn’t work, agreed we were trying the right thing. So he said, “well let’s try a solution known as “Resource-based Kerberos Constrained Delegation“. This didn’t work at first either.

But then he suggested that we turn on 128 and 256 bit encryption on the DB_Server SQL account. Bingo that worked. Mostly. More on that in a second.

So here’s the setup we ended up with.

  1. RS_Server – running reporting services under an account domain\RS_Server_Service
  2. DB_Server – running SQL Server under an account domain\DB_Server_Service
  3. Setup some SPNs
    1. MSSQLsvc/DB_Server domain\DB_Server_Service
    2. MSSQLsvc/DB_Server.domain.com domain\DB_Server_Service
    3. Note in this case you do not appear to need one on the RS_Server side.
  4. Run a Powershell script
    1. $FEIdentity1 = Get-ADUser -Identity domain\RS_Server_Service
    2. $BEIdentity = Get-ADUser -Identity domain\DB_Server_Service
    3. Set-ADuser $BEIdentity -PrincipalsAllowedToDelegateToAccount $FEIdentitity1
  5. At this point things should have worked, but they didn’t until we then enabled the encryption options:

    encryption options for Kerberos

    Kerberos Encryption

  6. Then on our test box, things magically worked! Ok, not quite magically, but things worked. We had a solution.

And I was even more ecstatic when later that day, I tested this on a second report server box we had and it too suddenly was working without any changes. And this was a box where we had NOT even setup an SPN for the original double-hop solution, so I was pretty confident that the Resource-based Kerberos Constrained Delegation was working. In addition, in the rsserver.config file, the only authentication enabled was NTLM.

The next step was to try this on a production server. In that case, I did have to reconfigure the service it was running under to use the domain account domain\RS_Server_Service.

And… my test failed.

I was at wit’s end. I couldn’t quite figure out what was different. I checked my service names, my SPNs, the rsserver.config file, and more. Nothing was working. I took a break and came back and had an idea. In the datasource I changed it from:

Data source=DB_Server;Initial Catalog=TestDB

to

Data source=DB_Server.domain.com;Initial Catalog=TestDB

Bingo, it worked! A little digging confirmed my suspicion. This client actually has multiple DNS domains and the ordering and like under the TCP/IP settings was different on this box from the other two boxes. And that made the difference.

Sure enough when I tried deploying to a fourth box, I had the same issue, but changing it to the Fully Qualified Domain Name (FQDN) solved my issue.

So, my take-aways for this week:

  •  Resource-based Kerberos Constrained Delegation may be a better solution at solving the Double-Hop solution than the solution generally proposed.
  • Once you’ve setup the “target” SQL Server service account and source Reporting Server Service accounts, additional reporting servers can be added to the mix without needing assistance from a domain admin.
  • It appears you still need an SPN (well multiple) for the SQL Server itself.
  • You need to run a PowerShell Script to setup the accounts. Note that if you run it again, it overwrites the old settings, so you need to add ALL of the source accounts in a single step.
  • Depending on your domains security setup, you may need to enable 128/256 bit Kerberos authentication.
  • DNS resolution may determine if you can use just the NetBIOS name or the FQDN in your data sources.
  • This solution will NOT work if you need to cross domains or have more complex setups, but in general, it can be simpler to setup and to maintain, especially if you have limited access to making changes to AD in the first place.
  • My reading indicates this only works on Windows 2012 and beyond. But you shouldn’t be running older versions of Windows in any case!

If I get the time and energy, I may setup a test environment in my home lab to further experiment with this and write up better demos, but for now, use this as you can. Hopefully it’ll save you some of the stress I experienced.

And that’s it from here, back to other duties as assigned.

 

 

Quiet Time and Errors

I wrote last week about finally taking apart our dryer to solve the loud thumping issue it had. The dryer noise had become an example of what is often called normalization of deviance. I’ve written about this before more than once. This is a very common occurrence and one I would argue is at times acceptable. If we reacted to every change in our lives, we’d be overwhelmed.

That said, like the dryer, some things shouldn’t be allowed to deviate too far from the norm and some things are more important than others. If I get a low gas warning, I can probably drive another 50 miles in my car. If I get an overheated engine warning, I probably shouldn’t try to drive another 50 miles. The trick is knowing that’s acceptable and what’s not.

Yesterday I wrote about some scripting I had done. This was in response to an issue that came up at a customer site. Nightly a script runs to restore a database from one server to a second server. Every morning we’d get an email saying it was successful. But, there was a separate email about a separate task that was designed to run against that database that indicated a failure. And that particular failure was actually pretty innocuous. In theory.

You can see where I’m going here. Because we were trusting the email of the restore job over the email from the second job, we assumed the restore was fine. It wasn’t. The restore was failing every night but sending us an email indicating a success.

We had unwittingly accepted a deviance from the norm. Fortunately the production need for this database hadn’t started yet. But it will soon. This is what lead my drive to rewrite and redeploy the scripts on Friday and on Monday.

And here’s the kicker. With the new script, we discovered the restore had also been failing on a second server (for a completely different reason!)

Going back to our dryer here, it really is amazing how much we had come to expect the thunking sound and how much quieter it is. I’ve done nearly a half-dozen loads since I finally put in the new rollers and every time I push the start button I still cringe, waiting to hear the first thunk. I had lived with the sound so long I had internalized the sounds as normal. It’s going to take awhile to overcome that reaction.

And it’s going to take a few days or even weeks before I fully trust the restore scripts and don’t cringe a bit every morning when I open my email for that client and check for the status of overnight jobs.

But I’m happy now. I have a very quiet dryer and I have a better set of scripts and setup for deploying them. So the world is better. On to the next problems!

Getting My Hands Dirty…

… and my clothes cleaned. Or more importantly, dried.

Before I was a programmer I worked for my dad in construction over the summers of high school. It was good solid work. I enjoy working with my hands at times. For one thing, you see and feel the results of your accomplishments in a very tangible manner. For another, you generally can measure the impact of your effort.

After my dad died, I wrote about using a drill of his to work on the addition that was to become my office. I liked the heft and feel of it. I knew I was accomplishing something with it. Being a programmer, sometimes it’s hard to experience that. Currently for example I’m working on an ETL script using PowerShell to SFTP down a file, extract it to some tables and then feed it into Salesforce. For me, it’s just a bunch of data. Yeah, there’s some fun challenges; learning how to setup and deal with GPG and designing a robust and secure information because some of the data is sensitive. But, once the project is finished, it’ll run silently and other than an occasional email, I won’t think much about it. It won’t impact my day to day life in any way and I won’t be able to point to it and say, “See, THERE is something I did.”

But, my dryer on the other hand, now that’s different. For awhile now (say at least a year or two) whenever I’ve run a load it’s made a fearsome rumbling sound. It’s been annoying, but we’ve managed to live with it up until 6 or 7 weeks ago. Generally I’d do most of the laundry on Sundays and if there was a load or two left, on Monday while everyone else was out of the house or at school. But obviously things changed. My wife’s office is in the room next to the laundry room. Whereas for me the rumbling was faint and simply background noise, for her it was quite noticeable.  I tried to work the loads around her work schedule, especially since she’s on so many conference calls for her job, but it was getting less and less practical.

It was finally time for me to do something about it. Now, had I been smart, I’d have started the project on a Monday. But, I’m not always that smart. So, Saturday came around and I disassembled the dryer.

I was fairly confident I knew what the problem was. I assumed that either something had wrapped around one of the rollers for the drum, or a bearing in a roller had seized. If it was the former, the fix would be trivial and I’d have the whole thing back together before dinner. If it was the latter, I figured a shot of silicone or other lubricant and I could at least get a few more weeks out of it while I ordered the parts. And since the tight screws were now loosened and I knew how to take it apart, the final fix would go quickly.

Well, as they say, you know what happens when one assumes. I was wrong about the first guess, it was not something as simple as something wrapped around the roller. And I was even more wrong about it being a seized or flattened bearing. See for that assumption to be valid the bearing assembly inside the wheel has to actually exist.

20200502_162747

Bearing Assembly? What Bearing Assembly?

It’s a bit hard to make out, but inside the blue part of the wheel above, and behind the plastic triangle, there is supposed to be a nice little bearing assembly.  There is none.

20200502_163501

Better view of the roller

You can see the wear on the inner hub.  This is what in the trade is called “less than optimal.”

More seriously though, it unfortunately meant that this was not going to be a quick fix. I had been planning on ordering the parts, but this made it a bit more of a rush. The dryer contains four of these rollers and as such I ordered a four pack, since generally my assumption on items like this is is that if one has worn, all four are worn. Now, none of the other three have shown nearly the damage, but figure, I’m in there, I might as well make it right.

What’s most interesting to me, is that there’s literally NO sign of the roller assembly in the dryer. However it got destroyed, it was pretty cataclysmic.

I also took the time to clean out the rest of the interior space and correctly deduced that the moisture sensor was covered with lint. Now that I know where it is, I can keep that clean in the future.

In any case, sometime later this week, I’ll get my package, swap out the rollers and reassemble the dryer and start doing laundry again. Quietly.

But, unlike the ETL I’m writing above, this change will have a direct noticeable impact on my life I’ll be aware of every time I do a load of clothes. I like that.

This week’s takeaway? I do enjoy my job and the challenges that come with it, but there’s something to be said for doing work you can touch and feel and experience the tangible impact.

20200505_090057

My best sourdough yet!

And perhaps I shouldn’t be posting pictures of homemade bread after talking about dirty hands. Don’t worry, I washed my hands!

An Ounce of Prevention?

There’s an old saying in medicine, “when you hear hoof beats, think horses, not zebras.”  Contrary to what one might think after watching House the truth is, when you get presented a set of symptoms, you start with the most likely, well because it is most likely! But as House illustrates, sometimes it can be the unlikely.

In First Aid, especially wilderness or backcountry medicine, there’s an acronym that is often used called SAMPLE. This is a mnemonic to help rescuers remember what data to gather:

  • S – Signs/Symptoms – What do you see, observe? (i.e. what’s going on).
  • A – Allergies – Perhaps the problem is an allergic reaction, or they might be allergic to whatever drug you want to give them.
  • M – Medicines – What medications are they on? Perhaps they’re diabetic and haven’t taken their insulin, or they’re on an anti-seizure medicine and need some.
  • – Past, pertinent medical history. You don’t care they broke their ankle when they were 5. But perhaps they just underwent surgery a few weeks ago? Or perhaps they have a history of dislocating their shoulder.
  • L – Last oral intake. Have they eaten or drunk anything recently. This will drive your decision tree in a number of ways.
  • E – Events leading up to the injury.  Were they climbing to the top of the cliff and fell, or did they simply collapse at the bottom? The former may suggest you look for a possible spinal injury, the latter probably indicates something else.

I’m reminded of this because of how I spent my Presidents’ Day.  I woke up and checked a few emails and noticed that two I were expecting from a client’s servers never arrived. I logged in to see what was going on.  It turns out there was nothing going on. No, not as in, “nothing wrong going on” but more as in “nothing at all was happening, the databases weren’t operating right.”  The alert system could connect to the database server, so no alerts had been sent, but actually accessing several of the databases, including the master resulted in errors.

Master.mdf corruption

Not an error you want to wake up to!

And what was worse, was the DR server was exhibiting similar symptoms!

So modifying SAMPLE a bit:

  • S – Signs/Symptoms – Well, databases are throwing corruption errors on two servers.  This was extended to the ERRORLOG files on both servers.
  • A – Allergies – Well, servers don’t have allergies, but how about known bugs? That’s close enough.  Nope, nothing that seems to apply here.
  • M – Medicines – I’ll call this antivirus software and <redacted>. (For client privacy reasons I can’t specify the other piece of software I want to specify, but I’ll come back to it.)
  • – Past, pertinent medical history.  Nothing, these servers had been running great up until now. One has been in production for over 2 years, the other, up for about 2 months, being brought up as a DR box for the first.
  • L – Last oral intake. Let’s make this last data intake. Due to forensics, we determined the corruption on both servers occurred around 3:00 AM EST.  Checking our logs, jobs, and other processes, there’s nothing special about the data the primary server took in at this time. If anything, disk I/O was lower than average.  And, fortunately, we can easily recreate any data that was sent to the server after the failure.
  • E – Events leading up to the injury. This is where things get interesting.
    • There were some zero-day patches applied to both servers over the weekend.
    • On Saturday, I had finally setup log-shipping between the two servers

So, we’ve got three possibilities, well four really.

  1. The zero day patches caused an issue about 48 hours later.  Possible, but unlikely, given the client has about 1600 servers that were also patched and have not had issues.
  2. Log-shipping somehow caused problems.  But again, the new log-shipping setup had run for about 36 hours without issue. And, best we can tell, the corruption occurred on the secondary BEFORE the primary. And log-shipping doesn’t apply to the Master database or the ERRORLOG file.
  3. Some unknown interaction.  This I think is the most likely and is where the <redacted> from the M above comes into play.
  4. Pure Random, and it’ll never happen again. I hate this option because it just leaves me awake at night.  This I added only for completeness.

Without going into detail, our current theory is that some weird interaction between <redacted> and log-shipping is our cause. Of course the vendor of <redacted> is going to deny this (and has) but it’s the only combination of factors that seems to explain everything. (I’ve left out a number of additional details that also helped us get to this conclusion).

So for now, we’ve disabled log-shipping and are going to make some changes to the environment before we try log-shipping again.

Normally I think horses, but we might have a herd of zebras on this one. And ironically, setting up for DR, may have actually caused a Sev 1 outage. So the ounce of prevention here may not have been worth it!

And who said my Wilderness First Aid wouldn’t come in handy?

 

 

Brake (sic) it to Fix it!

Last week I wrote about getting my brakes fixed. Turns out I made the right choice, besides shoes and pads, I needed new calipers.  Replacing them was a bigger job than I would have wanted to deal with. Of course it cost a bit more than I preferred, but I figure being able to make my car stop is a good thing. I had actually suspected I’d need new calipers because one of the signs of my brakes needing to be replaced was the terrible grinding and shrieking sound my left front brake was making. That’s never a good sign.

As a result I started to try to brake less if I could help it. And I cringed a bit every time I did brake. It became very much a Pavlovian response.

About two weeks ago, a colleague of mine said, “Hey, did you notice the number of errors for that process went way up?” I had to admit, nope, I had not. I had stopped looking at the emails in detail. They were for an ETL process that I had written well over a year ago. About 6 months ago, due to new, and bad, data being put into the source system, the ETL started to have about a half dozen rows it couldn’t process.  As designed, it sent out an email to the critical parties and I received copies.  We talked about the errors and decided that they weren’t worth tracking down at the time.  I objected because I figure if you have an error, you really should fix it. But I got outvoted and figured it wasn’t my concern at that point. As a result, we simply accepted that we’d get an email every morning with a list of rows in error.

But, as my colleague pointed out, about 3 months ago, the number of errors had gone up. This time it wasn’t about a half dozen, it was close to 300. And no one had noticed.  We had become so used to the error emails, our Pavlovian response was to ignore them.

But, this number was too large to ignore. I ended up doing two things. The first, and one I could deploy without jumping through hoops was to update the error email. Instead of simply showing the rows of errors, it now included a query that placed a table at the top that showed how many errors and in which tables. This was much more effective because now a single glance can easily show if the number of errors has increased or gone down (if we get no email that means we’ve eliminated all the errors, the ultimate goal in my mind.)

I was able to track down the bulk of the 300 new errors to a data dictionary disagreement (everyone raise their hands who has had a customer tell you one thing about data only to discover that really the details are different) that popped up when a large amount of new data was added to the source system.

I’ve since deployed that change to the DEV environment and now that we’re out of the end of the month code freeze for this particular product, will be deploying to production this week.

Hopefully though the parties that really care about the data will then start paying attention to the new email and squawking when they see a change in the number of bad rows.

In the meantime, it’s going to take me awhile to stop cringing every time I press my brakes. They no longer make any bad sounds and I like that, but I’m not used to to absence of grinding noises again.  Yet. In both cases, I and my client had accepted the normalization of deviance and internalized it.

I wrote most of this post in my head last week while remembering some other past events that are in part related to the same concept.  As a result this post is dedicated to the 17 American Astronauts who have perished directly in the service of the space program (not to diminish the loss of the others who died in other ways).

  • Apollo 1 – January 27th, 1967
  • Challenger – STS-51L – January 28th, 1986
  • Columbia – STS-107 – Launch January 16th, 2003, break-up, February 1st, 2003.

 

 

Them’s the brakes…

I need brake work done on my car.  I scheduled an appointment with my mechanic for Thursday.

Now normally brake work might not be worth writing about, but this time it got me thinking. About 9 years ago I taught myself to change the brakes on my Subaru. It wasn’t that hard, but it definitely took longer than I would have liked. I’ve learned how to do it much faster since then.

Back then I did it because I was between jobs and had more time than money. It made sense at the time. Since then I’ve generally continued to do them myself. It doesn’t take long and as I’ve said in the past, sometimes I like working with my hands. It reminds me not all of life is simply databases and servers.

So why this time? The simple answer is, my garage floor is cold!  Ayup, it’s a funny thing, but in the winter, things get cold! And honestly, I just don’t feel like sitting/laying on the concrete while I do the work. So, I’ll pay an expert who can pop it up on a lift and do the work in probably half the time I can.  I remind myself it’s why I work, so I can pay others to do work I don’t care to do, or that they can do better than I can.

Years back, at my last job as the Director, and later VP of IT, I changed our email provider from one company to another (don’t ask me who they were, I can’t recall over a decade later).  My team didn’t like this. They kept insisting they could run the mail servers themselves. I had no doubt that they could. Running an Exchange server isn’t the most difficult thing in the world. That said, I kept saying no.  For a simple reason: “we weren’t an email company!” Providing reliable email for a business like we were meant at the very least strong spam protection and of course high-availability with ideally geographic redundancy. This meant at least two Exchange servers, spam filters and more. This alone would have cost more than what we paid annually. Now add in the time my team would have spent on email issues, it just wasn’t going to be worth it.

And, we’re seeing that more and more with services such as Azure. Sure, many businesses run change their own brakes, err, run their own SQL Server. But more and more are outsourcing to platforms such as Azure. But there are still a number of companies that for various reasons don’t do that. Fortunately for them, there are consultants like me to help them with their servers. SQL Server has always been easier to maintain than some of its competition, especially when it was first available. But that has never meant it needed no intervention. Someone still needs to make sure that jobs are being run, that backups are available and more.

So, some days I change my own brakes because it’s fun and easy, some days I pay someone.  Some days my clients handle their own SQL Server issues, and other days, they pay me.

No real revelation here. No real advice on when to draw the line on outsourcing, just an observation. But in my case, my concrete is too cold, so I’ll pay someone else to do the dirty work.

Fixing a “Simple” Leak

I was reminded of the 80/20 rule yesterday as I was briefing my team on a project I needed them to work on.  Ok, actually I was telling my kids about this, but it was still for a project.

In this case, I was pointing out that 80% of the effort for a project often is used on completing only 20% of the project. Researching this a bit further, there’s an actual name for this rule: the Pareto Principle. We’ve probably all come across this rule or some variation of it in our lives.

While not exactly the situation here, the point I was trying to make with them is that sometimes when fixing a simple problem, the issues and work that spin off easily take 10 times as much time and effort as the original problem.

In this specific case, I was asking them to do what was essentially an annoying and dirty job, clean up some edges to sheetrock before I finally re-enclosed a portion of the basement ceiling. They started later in the day than I would have preferred, but honestly, as this project has been sitting on hold for a few months, a couple more hours didn’t really matter. That said, I think the actual work took them longer than they thought it would.

So what prompted yesterday’s work was a small leak that I fixed over a year ago.

Over the years, we had noticed a slight leak in the downstairs bathroom. It wasn’t always apparent, but it was slowly getting worse. Essentially water was soaking into the sheetrock and walls of the finished portion of our basement.

Now, I’m a fairly handy guy, I spent several summers in high school and a bit of college working for my father in the construction trade. While he’d point out my finish work needed more work, in general, if it’s involved in residential construction, I’ve done it and I can feel comfortable doing it. But, besides learning how to use the tools, I also learned an important lesson: no project is ever as simple as it appears. This is actually one reason I hate starting home-improvement projects. I know that it’s going to turn into a lot more work than it originally looks like. It isn’t exactly the 80/20 rule, but I’m reminded of it. So let me dive a bit into what was and still is involved in fixing this simple leak.

First, I had to identify where it was.  From lots of inspection, guesswork and experience, I guessed in the wall behind the tiles.  Ok, so that means, rip out the tiles and the plaster and lathe underneath.  That’s simple enough, and honestly fun, albeit it dusty.

Second, once the leak was found, replace the plumbing.  With modern Pex plumbing, that’s actually the quickest part of the work. So yes, actually FIXING the leak, took maybe an hour. I will note I also took the opportunity to move the showerhead up by about 9″. One thing I hate are low showerheads!

WP_20181223_001

New Pex plumbing to showerhead

But, hey, while I’m in here, I might as well run the wiring for a fan for the bathroom because it’s always needed one.  And given a weird quirk of construction in this house, the easiest way is to run it into the long wall of the bathtub/shower, then over to the outside wall and then up to the approximate location of where the fan will go.  So there’s another hour or two for a project to was started to fix a leak.

Oh and while I’m at it, let me take some photos with a tape measure in them of where the pipes are for future reference. So there’s a bit more work.

Great, the leak is fixed.

Except, obviously the shower can’t be used as is with open walls. So now it’s a matter of getting backerboard and putting that in, and sealing it.  What I used is waterproof as it is, so we left it at that. And I say we, because for much of this product both the kids were helping with it. And quite honestly, that’s where this part of the  project sat for months. The shower was usable, though a bit ugly.

But, we still had the basement to deal with. That meant ripping out the damaged sheetrock and studs that had rotted. That was fun. Not! For that I actually used a full respirator, body suit, and sprayed anti-fungal stuff liberally. Some of the water damage here was actually older than the bathroom leak and was due to poor grading and then more recently, runoff from the roof of my addition (a problem that gutters finally solved).

But hey, if we’re putting in new walls, might as well put in better insulation and if we can seal the old concrete (hint that went poorly) and oh, put in a couple of dimmable lights for a work area and some network jacks.

WP_20190106_001

Basement wall in progress

Once that was all roughed in, it was time to at least sheetrock the walls.

WP_20190106_004

Sheetrocked walls

And that is basically where the basement project sat until yesterday. I’ll come back to that in a minute.

As for the bathroom, there was only so long I wanted to look at the backerboard. It was time to finally tile it.  Oh, but before I could do that, I had to put that fan in. Besides finding just enough room in the outside wall between the framing for the 2nd floor and the window and other vagaries, it just fit. Of course that was just 1/2 the battle. The other 1/2 was then wiring up the switch. Oh and while I’m at it, might as well run a circuit for a GFCI outlet since the bathroom was lacking one.  Once all that was done, THEN, I could tile.

WP_20190804_001

Tiled and Grouted

And yes, you may note the window does intrude into the shower space. Hey, I didn’t build the house!  What you can’t see is the replaced trim on the top edge and left edge of the tile that my daughter literally spent hours sanding and resealing. It looks great.

Oh and I still have to find replacement cones for behind the handles! So that’s another thing to do for the project.

But back to the basement. The area in front of the wall has become my son’s de facto computer space when he’s home from college. It was ‘good enough’.  But the ceiling still needed its sheetrock replaced and I needed to tape and paint the new wall.  This has waited until now.

The problem with the ceiling is when I pulled down the old stuff, I didn’t have nice clean edges to butt the new sheetrock against.  It was ragged where it had broken, or broke at awkward places so I couldn’t easily put in new sheetrock.

But I also took advantage of this time to reroute all my network drops so they will be hidden in the ceiling and come out nicely to my rack.

So yesterday, the kids did the dirty work of trimming the edges, cleaning stuff up, etc. It looks great and will make my job of sheetrocking much easier.

20200121_081823

Open basement ceiling

By the way, I should note that the board you see sticking out is what I had used in the past when I had to slide in here to do some wiring or other work. This was sort of my own private Jefferies Tube. This should now be relatively easy to sheetrock, right?

Well, except for one small detail.

20200121_081837

Houston, we have a problem.

Yes, that is a piece of electrical cable that was run OUTSIDE the studs and essentially between the join of where two pieces of sheetrock met at an inside corner.  I absolutely HAVE to move this before I can sheetrock.

So that’s going to be a few more hours of work before I can even start to sheetrock. I have to identify which circuit this is, cut power, cut the wire, reroute it, put the ends in a junction box (which code says can’t be hidden!) and then make sure it’s safe.

After all that work, I can finally get around to sheetrocking the ceiling. Then I’ll have to mud and tape all the joints, sand, mud again, prime and then finally get the walls and ceiling painted.

But the good news is, the leak is fixed. That was the easy party!

Does this “simple” project of fixing a leak remind you of any projects at work? It does for me!

Crossing the Threshold…

So it’s the usual story. You need to upgrade a machine, the IT group says, “no problem, we can virtualize it, it’ll be better! Don’t worry, yes, it’ll be fewer CPUs, but they’ll be much faster!”

So, you move forward with the upgrade. Twenty-three meetings later, 3 late nights, one OS upgrade, and two new machines forming one new cluster, you’re good. Things go live.  And then Monday happens. Monday of course is the first full day of business and just so happens to be the busiest day of the week.

Users are complaining. You look at the CPU and it’s hitting 100% routinely. Things are NOT distinctly better.

You look at the CPUs and you notices something striking:

cpu not being used

CPU 8 is showing a problem

4 of the CPUs (several are missing on this graphic) are showing virtually no utilization while the other  8 are going like gang-busters.  Then it hits you, the way the IT group setup the virtual CPUs was not what you needed.  They setup 6 sockets with 2 cores each for a total of 12 cores. This shouldn’t be a problem except that SQL Server Standard Edition uses the lower of either 4 sockets or 24 cores. Because your VM has 6 sockets, SQL Server refuses to use two of them.

You confirm the problem by running the following query:

SELECT scheduler_id, cpu_id, status, is_online FROM sys.dm_os_schedulers

This shows only 8 of your 12 CPUs are marked visible_online.

This is fortunately an easy fix.  A quick outage and your VM is reconfigured to 2 sockets with 6 cores a piece. Your CPU graphs now look like:

better CPU

A better CPU distribution

This is closer to what you want to see, but of course since you’re doing your work at night, you’re not seeing a full load. But you’re happier.

Then Monday happens again.  Things are better, but you’re still not happy. The CPUs are running on average at about 80% utilization. This is definitely better than 100%. But your client’s product manager knows they’ll need more processing power in coming months and running at 80% doesn’t give you much growth potential. The product manager would rather not have to buy more licenses.

So, you go to work. And since I’m tired of writing in the 2nd person, I’ll start writing in 1st person moving forward.

There’s a lot of ways to approach a problem like this, but often when I see heavy CPU usage, I want to see what sort of wait stats I’m dealing with. It may not always give me the best answer, but I find them useful.

Here’s the results of one quick query.

Fortunately, this being a new box, it was running SQL Server 2016 with the latest version service pack and CU.  This mean that I had some more useful data.

CXPackets

CXPackets and CXConsumer telling the tale

Note one of the suggestions: Changing the default Cost Threshold for Parallelism based on observed query cost for your entire workload.

Given the load I had observed, I guessed the Cost Threshold was way too low. It was in fact set to 10.  With that during testing I saw a CPU graph that looked like this:

43 percent CPU

43.5% at Cost Threshold of 10

I decided to change the Cost Threshold to 100 and the graph quickly became:

25 percent CPU

25% at Cost Threshold of 100

Dropping from 43.5% to 25.6%. That’s a savings you can take to the bank!

Of course that could have been a fluke, so I ran several 5 minute snapshots where I would set the threshold to 10, collect some data and then to 100 for 5 minutes and collect data.

CXPacket_10      CXPacket_10_Waittime_MS
635533 5611743
684578 4093190
674500 4428671
CXConsumer_10              CXConsumer_10_Waittime_MS
563830 3551016
595943 2661527
588635 2853673
CXPacket_100   CXPacket_100_Waittime_MS
0 0
41 22
1159 8156
CXConsumer_100            CXConsumer_100_Waittime_MS
0 0
13 29443
847 4328

You can see that over 3 runs the difference between having a threshold of 10 versus 100 made a dramatic difference in the total time spent waiting in the 5 minute window.

The other setting that can play a role in how parallelization can impact performance is MAXDOP. In this case testing didn’t show any real performance differences with changing that value.

At the end of the day though, I call this a good day. A few hours of my consulting time saved the client $1,000s of going down the wrong and expensive road of adding more CPUs and SQL licenses. There’s still room for improvement, but going from a box where only 8 of the 12 CPUs were being used and were running at 100% to a box where the average CPU usage is close to 25% is a good start.

What’s your tuning success story?

Small Victories

Ask most DBAs and they’ll probably tell you they’re not a huge fan of triggers.  They can be useful, but hard to debug.  Events last week reminded me of that. Fortunately a little debugging made a huge difference.

Let me set the scene, but unfortunately since this was work for a client, I can’t really use many screenshots. (or rather to do so would take far too long to sanitize them in the time I allocate to write my weekly blog posts.)

The gist is, my client is working on a process to take data from one system and insert it into their Salesforce system.  To do so, we’re using a 3rd party tool called Pentaho. It’s similar to SSIS in some ways, but based on Java.

Anyway, the process I was debugging was fairly simple. Take account information from the source and upsert it into Salesforce. If the account already existed in Salesforce, great, simply perform an update. If it’s new data, perform an insert.  At the end of the process Pentaho returns a record that contains the original account information and the Salesforce ID.

So far so good. Now, the original author of the system had setup a trigger so when these records are returned it can update the original source account record with the Salesforce ID if it didn’t exist previously. I should note that updating the accounts is just one of many possible transformations the entire process runs.

After working on the Pentaho ETL (extract, transform, load) for a bit and getting it stable, I decided to focus on performance. There appeared to be two main areas of slowness, the upsert to Salesforce and the handling of the returned records. Now, I had no insight into the Salesforce side of things, so I decided to focus on handling the returned records.

The problem of course was that Pentaho was sort of hiding what it was doing. I had to get some insight there. I knew it was doing an Insert into a master table of successful records and then a trigger to update the original account.

Now,  being a 21st Century DBA and taking into account Grant Fritchey’s blog post on Extended Events I had previously setup a Extended Events Session on this database. I had to tweak it a bit, but I got what I wanted in short order.

CREATE EVENT SESSION [Pentaho Trace SalesForceData] ON SERVER
ADD EVENT sqlserver.existing_connection(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.login(SET collect_options_text=(1)
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.logout(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.rpc_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho'))
ADD TARGET package0.ring_buffer(SET max_memory=(1024000))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

It’s not much, but it lets me watch incoming transactions.

I could then fire off the ETL in question and capture some live data. A typical returned result looked like

exec sp_execute 1,N'SourceData',N'GQF',N'Account',N'1962062',N'a6W4O00000064zbUAA','2019-10-11 13:07:22.8270000',N'neALaRggAlD/Y/T4ign0vOA==L',N'Upsert Success'

Now that’s not much, but I knew what the Insert statement looked like so I could build an insert statement wrapped with a begin tran/rollback around it so I could test the insert without actually changing my data.  I then tossed in some set statistics IO ON and enabled Include Actual Execution Plan so I could see what was happening.

“Wait, what’s this? What’s this 300K rows read? And why is it doing a clustered index scan on this table?”  This was a disconcerting. The field I was comparing was the clustered index, it should be a seek!

So I looked more closely at the trigger. There were two changes I ended up making.

       -- Link Accounts
       --MERGE INTO GQF_AccountMaster T
       --USING Inserted S
       --ON (CAST(T.ClientId AS VARCHAR(255)) = S.External_Id__c
       --AND S.Transformation in ('Account'))
       --WHEN MATCHED THEN UPDATE
       --SET T.SFID = S.Id
       --;
       
       if (select transformation from Inserted) ='Account'
       begin
              MERGE INTO GQF_AccountMaster T
              USING Inserted S
              ON T.ClientId  = S.External_Id__c
              WHEN MATCHED THEN UPDATE
              SET T.SFID = S.Id
       end

An astute DBA will notice that CAST in there.  Given the design, the Inserted table field External_Id__C is sort of a catch all for all sorts of various types of IDs and some in fact could be up to 255 characters. However, in the case of an Account it’s a varchar(10).

The original developer probably put the CAST in there since they didn’t want to blow up the Merge statement if it compared a transformation other than an Account. (From what I can tell, T-SQL does not guarantee short-circuit evaluation, if I’m wrong, please let me know and point me to definitive documentation.) However, the minute you cast that, you lose the ability to seek using the index, you have to use a scan.

So I rewrote the commented section into an IF to guarantee we were only dealing with Account transformations and then I stripped out the cast.

Then I reran and watched. My index scan of 300K rows was down to a seek of 3 rows. The trigger now performed in subsecond time. Not bad for an hour or so of work. That and some other improvements meant that now we could handle a few 1000 inserts and updates in the time it was previously taking to do 10 or so.  It’s one of those days where I like to think my client got their money’s worth out of me.

Slight note: Next week I will be at PASS Summit so not sure if/when I’ll be blogging. But follow me on Twitter @stridergdm.

IIS FTP 530 error

To my usual readers, you can ignore this. This is simply my way of making sure that the next time I have to google an issue, or someone else days, there’s a better chance of finding a solution. Note, there are other places posting the same solution.

But simply put, over the weekend client’s IT group rebooted a server and an FTP process started to fail. It took a lot of digging to solve it.

It appears that two things happened:

The IUSR to the FTP root directory got munged (it’s not 100% clear and this may not be a necessary step. But, things did NOT work for me until I did this, but then they continued to work when I removed the IUSR to see if I could recreate the problem.)

The part that was 100% necessary was this: https://manage.accuwebhosting.com/knowledgebase/941/FTP-Error-530-User-cannot-log-in-home-directory-inaccessible.html go to Step 8 and add all users. Why this was removed, I don’t know. But adding this after re-adding the IUSR step seems to have solved the issue.

That said, before anyone asks, “why in the world are you using FTP in the 21st century?” I won’t disagree other than to say this is purely an internal process that simply moves some non-PII data to a 2nd server.  Not a huge justification, but there it is.