A Speaker’s Timeline

This post will be short, for reasons that are hopefully obvious by the end.

Sometime in February

Hmm, I should put together some ideas to submit to present to SQL Summit in Houston (not Dallas as Mistress SQL pointed out to me) this year.

March 16th

An update, the call for speakers has been postponed. Darn.

March 23rd

Call for speakers is finally open!

March 30th

Submit 3 possible topics.

April 1st

Approach a fellow speaker about a possible joint session, but after discussion, decide not to go ahead with the idea.

June 3rd

Get an update, Summit will be virtual this year. Thankfully I didn’t book any tickets or hotel rooms in Dallas.

July 20th 6:49 PM EDT

Woohoo! I got the email! One of my submissions got selected to present!

July 20th 6:50 PM EDT

Crap, now I actually have to write the entire thing!

July 20th 6:51 PM EDT

Wait, and it’s going to be virtual too. That’s going to make it a bit more of a challenge to present. But I’m up to it!

Sometime in August

I really should get started. Hmm, here’s one of the scripts I want to present.

But honestly, I’m preparing to teach a bunch of cavers and medical students cave rescue, I need to concentrate on that first.

September 5th

I just biked over 100 miles. I’m certainly not working on my presentation THIS weekend.

Later in September

Ok, now I’m going to sit down and really work through this. Here’s a basic outline.

October 1st

Oh wait, it’s going to be virtual AND I have to prerecord it? How is that supposed to work? I had better read up at the speaker portal!

October 2nd

Huh, ok, that sorta makes sense, upload the slides, do a recording, but I still don’t get how it’ll work with a presentation like mine with lots of demos. Well I’ll figure it out.

October 6th around 11 PM EDT

Well the PowerPoint template deck they provided looks pretty slick. I should start prepping my slides.

October 6th, approximately 5 minutes later

There, got the first slide done. Of course it’s only my name and pronouns, etc. But it’s a start.

Oh and the 2nd slide is done, but that’s simply the default PASS slide talking about chapters, SQL Saturday etc, so technically I didn’t do anything there.

I’ll start working on the closing slides.

October 7th, sometime after midnight

Ok, about 5 slides done. I’ll like to myself and say I’ve made great progress!

October 9th, approximately 10:00 PM EDT

Ok, I’ll at least start writing out the scripts I need.

October 9th, 20 minutes later

What the bloody hell? Why is this script failing? I’ve got to present this. If I can’t get this script working how is anyone going to believe that I know PowerShell, let alone actually use it.

October 9th, 5 minutes later

Well, damn, that was an embarrassing mistake, just had the , in the wrong place

October 10th around 9:00 PM EDT

Hmm, to properly demo this, I really need to run against 3-4 SQL Servers and I really don’t want to spin up a bunch of VMS and I can’t use my development one, too much proprietary data there.

I know, NOW is a perfect time to start to learn to use Docker! Why not? And besides Cathrine Wilhemsen has a great post on it. I’ll simply follow that.

2 hours and 1 reboot later

Hey, would you look at that? I’ve actually got a docker container running SQL. This is awesome!

Another minute later

But why can’t I actually connect? What network is it on? Why did I decide docker was easier? Why did I even submit this proposal? What the heck am I doing here? What is the meaning of life?

5 more minutes

That’s it, I’m going to bed.

October 11th, late night

Oh, I get it it now, I didn’t setup a full separate network, it’s bridged and that’s why it’s showing 0.0.0.0. I just need to change the port and I’m good to go!

A minute later

This is pretty awesome. Not what I’d do for a production setup, but definitely works for my demos. Now if I were really smart, I’d also setup persistent storage and the like, but this is good enough. And honestly now, setup a loop, increment a variable and bam, I’ve got 4 instances of SQL running in docker, 2 are 2017 and 2 are 2019. This is really incredible. I’m proud of myself.

Oh and even better, I’m doing all this in a PowerShell script, so I can actually make it PART of my presentation!

October 12th 2:26 PM EDT

Send off an email to the Program folks at PASS asking about how the recording stuff works with demos. Eagerly awaiting a reply.

October 15th, another late night

Yes, there’s a theme here, much of my work is being done late at night. It seems to work for me. But dang that deadline is getting closer!

October 16th, late night, again

Watched some Schitt$ Creek with the family. “Why didn’t we start watching this sooner? It’s hilarious! But I need to work on my presentation some more.”

Get all the PowerShell scripts basically done. I’m happy with it, need to work on my speaking script some.

October 19th 3:00 PM EDT

Get off the phone with a fellow Cave Rescue expert. Just before I get off, I mention my upcoming virtual, prerecorded session I have to finish. He says, “Oh, you know I just did 2-3 of those for a rescue conference, exact same format. It worked out really well. I can send you some details and feedback.”

I find that reassuring.

Also recheck email, still no answer from the folks at PASS on my questions about demos, etc.

October 19th, guess what time

I’ve finished everything, even updated the slides and scripts a bit more. I’m a bit worried I’m going to run too long, but decide to do my first of several practice run throughs.

Do my first full run through. Stop and correct a few mistakes or rough edges here and there. I’m not too worried if I run over now since I know I’ve artificially added some time.

October 19th, 42 minutes later

I get done, look at the PowerPoint timer: 42 minutes. “CRAP! I need this to be 60 minutes!” I’m not too worried, I can add more, but I’m not sure where and I don’t want to simply add fluff for the sake of fluff. I need to give this some thought.

Later on October 19th

Talking to a friend of mine who among other things has a background in adult education. She doesn’t know SQL or PowerShell, but she’s a good sounding board and she’s going to sit through my next run-through, not so much for the technical details but to give feedback on the flow and perhaps suggestions on where I may be making too many assumptions on what my listeners will know.

October 20th Early Morning

It’s a Tuesday, time to blog. As always I face that question, what should I blog about?

“I know, I’ll blog about how I’m getting my presentation together and the deadline is fast approaching. I can’t be the only speaker that often finds themselves up against the deadline and panicking.”

Next 36 hours

Add a bit more content and run through it 2-3 more time and then… RECORD! (technically it looks like I have until the 26th to upload my recording, but I want to get done early).

Conclusion

The above may or may not be a wholly accurate timeline or description of the process I’ve gone through trying to get my presentation ready for Pass Virtual Summit. I may have elided a few details and over-hyped a few others, but in general it’s close to true and accurate. Despite my always best intentions, I find myself often working up close to the deadline for submissions. Since for Summit they want NEW presentations, I can’t simply dust-off one of my previous presentations and use that, so there’s definitely more work involved here.

And honestly up until I learned it was going to be prerecorded, I thought I’d have most of October to work on it. The deadline to get the slides and recordings submitted sort of threw my original timeline for working on it in the dumpster so I’m actually a bit further behind than I expected to be.

On the other hand, I really did learn to use Docker and I think that’s valuable and I am making that part of my presentation. And, when all is said and done, I think I’ll be happy with it. I think though like any good speaker, I’ll look back and think “well next time, I’ll have to improve this or that.” There’s always room for improvement. I’m not keen on giving it prerecorded. I value the instantaneous feedback I get from the audience. So that will be different. But I at least can elicit questions during the presentation and there’s a life Q&A afterwards. But, I’ll still be nervous.

I’m in awe of speakers who get their presentations all prepped and prepared months in advance, but I suspect there’s a number out there like me, that don’t operate that way. And I suspect there’s a few who are even more nervous than I thinking, “OMG, am I the only one in this spot?” Nope, you’re not. Or rather, “Please let me know I’m not the only one!”

See you all at Summit, at least virtually!

And in the meantime there’s another possible deadline coming up I need to think about…

Projects You’re Proud Of?

When I present, I start my presentations with a brief bio of myself and one item on there I generally have is a comment that I like to solve problems. This may sound obvious, but it’s true and I think describes my goal well. Yesterday, while on a 3 hour zoom call with a client, we got talking about various projects and it made me think about some of the problems I’ve solved over the years.

There are several I could talk about, but one came up yesterday. Several years ago at a previous client, the head of their call center came to me with an issue. They had a process where they’d export their call center logs and input them into SQL Server. Except to call it a process was a bit of an overstatement. It was a series of about 4-5 steps, all except the initial export were done manually. This meant that every morning one of their IT people would take a file from a Linux server, copy it locally, and then import it into Access where several macros were run on it to transform it and then the person would import it into the SQL Server where they could then run reports. There were several possible areas for mistakes to happen and while mistakes weren’t routine, they tended to happen about once or twice a month. On a good day, it would take about 1/2 an hour to do the manual import, on a bad day, over an hour. So in a month, one of their IT people could easily spend 15 or more hours on it, or over 180 hours a year.

In addition, adding new meta-data into the process was error-prone and he couldn’t do as often as he liked. He asked if I could take a look at it and automate it. While SSIS is not an area of expertise, I was familiar enough with it to know it was a good fit and said I’d work on a solution. It took some effort, but eventually I had a solution in place. The entire process now runs automatically in about 5 minutes and he can add or remove the meta-data he needs to by updating a single SQL table. He was quite pleased.

I’m also proud to say the only real time there’s been an issue with the process is when they had to for business reasons IP their entire internal network. They unfortunately scheduled this for a week when I was not only on vacation, but spending that week at some National Parks and Forests in the South Dakota area. The remoteness of these locations meant that my connectivity was very limited. I let their IP team know what changes had to be made to a config file to make things work, but in the aftermath of other issues they had to deal with this was missed. Fortunately, once I found the right place to sit in the National Forest we were camped in and get enough of a cell signal to log into their network, I was able to make the update and fix things. Since then, things have worked without a hitch.

I like this particular project, not just because it’s been so problem free, but because I think I can clearly point to a problem a client had and that I helped solve. Now that IT person can spend their time on more important issues.

It also is an example of a mantra I think is generally true:

Anything that can be automated should be automated.

There’s other projects I may write about at other times (including a few involving PowerShell) but that’s it for today.

What projects are YOU proud of? I’d love to hear from you.

Trust but Verify

This is one of those posts where you’ll just have to trust me. Honestly.

I want to talk about indexes.

About a week ago, a friend on a chat system I use mentioned how one of their colleagues had mentioned, “oh, we don’t have to optimize the database, the server is fast enough” or words to those effect. All of us in the discussion blanched a bit. Yes, when I started in the business a 10GB database was considered large and because of the memory limit with 32-bit SQL, we were limited to 2GB (or 3GB if you took the right steps) of memory so it was literally impossible to keep a large database in memory. Of course now we routinely deal with databases 100s of GB in size with machines that can easily have .5TB of memory or more. This means except for writes, an entire database can easily be kept in memory.

But that said, optimization still matters. Last week I was debugging an ETL process that I’ve helped a client with. I’d love to show screen shots, but my NDA won’t allow me (hence my asking you to trust me). Ok, that’s partly a lie. I couldn’t provide too many details if I wanted to, but the bigger issue is, I’ve since closed the windows I that showed the scripts in questions and the results of my changes.

One of the last things each step in the ETL does is write back to the source table an updated Sales Force id. It’s actually a bit more complicated because what it really does is write to either a Success table or an Error table and depending on a factor or two, a trigger will then update the source table. I had previously debugged and improved the performance of the trigger. But something was still bothering me about the performance. I looked a bit deeper and one of the things that trigger does if there’s a success is make sure to remove the row from the Error table. This was taking longer than I suspected it should, so I dug into it and I noticed that the Error table had no index.  

I can’t show the original queries I used, but I can show an example of the impact of adding a simple clustered index. (See, you can’t even trust me to say I won’t show any examples! You’d better read the entire post to verify what I’m really writing!)

Here’s an example query (with some changes to hide client specific data)

select * from ErrorTable where SF__External_Id__c='005A000022IouWqIAX'

It’s a very simple query (and simpler than the actual one I was dealing with) but is enough to show the value of a proper index.

Now, in my original query, the Query Tuning Advisor actually suggested an index on SF__External_ID__c. In the example above it didn’t. There’s a canard among many DBAs that the QTA is generally useless and often it is, though I think it’s gotten better. As a consultant, I can often come into a new client and can tell when someone has gone crazy with the QTA and adopted EVERY SINGLE suggestion. In other words, they trusted it, but they never verified it. Why is this a problem? Well at times the QTA can be overly aggressive in my experience, suggesting indices that really provide little benefit, or if you add an index in response to a select query that is run say once a day, but where there are 1000s of updates a day, you might actually slow down your updates (since now the update also has to update the index). And as mentioned above, sometimes it might fail to suggest an index. (I think in this case, it didn’t suggest one on my example because the size of the underlying table was far smaller than before).

So, I like to verify that the index I’ll add will make a difference. In cases like this, I often go old school and simply bracket my test queries

set statistics IO ON
set statistics Time ON
select * from ErrorTable where SF__External_Id__c='005A000022IouWqIAX'
set statistics IO OFF
set statistics Time OFF

And then I enable Actual Execution Plan.

The results I received without any sort of index are below. Some key numbers are highlighted in red.

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 47 ms, elapsed time = 63 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(2 rows affected)
Table 'ErrorTable'. Scan count 1, logical reads 3570, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 15 ms.
SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

You’ll notice the physical reads are 0. This is nice. This means everything is in memory.

In this case, because I’m familiar with how the ErrorTable is accessed I decided a clustered index on SF__External_Id__c would be ideal. (all my updates, inserts, deletes, and selects use that to access this table).

I added the index and my reran the query:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 1 ms.
SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.

(2 rows affected)
Table 'ErrorTable'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row affected)

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 0 ms.

Note the number of logical reads dropped by about a factor of 1000. My elapsed time dropped from 15 ms to 0 ms (or rather less than .05 ms so SQL Server rounded down).

If we look at the graphical query plan results we something similar:

First, without the index:

Trust_but_Verify_Query Table Scan

Table scan to find 2 rows

Trust_but_Verify_Query Table Seek

Table Seek to find 2 rows

That’s nice, I now know I’m doing a seek rather than a scan, but is that enough? I mean if the ErrorTable only has 2 rows, a seek is exactly the same as a scan!

So let’s dig deeper:

Trust_but_Verify_Query Table Scan Details

Query plan showing details for a scan

Trust_but_Verify_Query Table Seek Details

Query plan showing details for a seek

Here you can definitely see the dramatic improvement. Instead of reading in over 100,00 rows (at a bit over 2.5 KB per row, or over 270MB) we only need to read in 2 rows, for a total of just over 5 KB of data.

Now wonder it’s faster. In fact, in the ETL process where it was originally taking about 1 minute to process 1000 rows, my query with the index was now executing 3000 rows in under 10 seconds.

The above is a bit of a contrived example, but it’s based on actual performance tuning I did last week. And this isn’t meant to be a lesson in actual performance tuning, but more to show that if you make a chance (in this case adding an index) you can’t just trust it will work, but you should VERIFY that it has made a difference, and more importantly, that it makes a difference for your workload. I’ve seen GTA often make valid, but useless index suggestions because someone ran an uncommonly used query against it and assumed the recommendation was good. Or, they’ve made assumptions about the size of the table.

So never just trust an index will help, but actually VERIFY it will help.

 

Giving Blood and Pride Month

I gave blood yesterday. It got me thinking. First, let me show a few screenshots:

male blood donor shot 1

7 Male Donor #1 screen shot

female blood donor shot 1

Female Donor #1 screen shot

Let me interject here I’m using the terms Male and Female based on the criteria I selected in the American Red Cross’s Fast Pass screen. More on why I make that distinction further on. But first two more screen shots.

female blood donor shot 2

Pregnancy question highlighted for female

male blood donor shot 2

No pregnancy question for males

Now, on the face of it, this second set of questions especially almost seems to make sense: I mean if I answered Male early on in the questionnaire, why by asked about a pregnancy? But what I’m asked at the beginning is about my gender, not my actual child-bearing capability. Let me quote from Merriam-Webster:

2-b: the behavioral, cultural, or psychological traits typically associated with one sex

Or from the World Health Organization:

Gender refers to the roles, behaviours, activities, attributes and opportunities that any society considers appropriate for girls and boys, and women and men. Gender interacts with, but is different from, the binary categories of biological sex.

Who can be pregnant?

So above, really what the Red Cross is asking isn’t about my gender, but really my ability to be pregnant. Now, this is a valid medical concern. There are risks they want to avoid in regards to pregnant women, or recently pregnant women giving blood. So their ultimate goal isn’t the problem, but their initial assumption might be. A trans-man might still be able to get pregnant, and a trans-woman might be incapable of getting pregnant (as well as a cis-woman might be incapable.) And this is why I had the caveat above about using the terms male and female. I’m using the terms provided which may not be the most accurate.

Assumptions on risk factors

The first set of images is a problematic in another way: it is making assumptions about risk factors. Now, I think we can all agree that keeping blood borne pathogens such as HIV out of the blood supply is a good one. And yes, while donated blood is tested, it can be even safer if people who know they are HIV or at risk for it can potentially self-select themselves out of the donation process.

But…

Let me show the actual question:

Male Male 3 month contact question

Question 21, for Men

This is an improvement over the older restrictions that were at one year and at one point “any time since 1977”. Think about that. If a man had had sex with another man in 1986, but consistently tested negative for HIV/AIDS for the following 30+ years, they could not give blood under previous rules. By the way, I will make a note here that these rules are NOT set by the American Red Cross, but rather by the FDA. So don’t get too angry at the Red Cross for this.

The argument for a 3 month window apparently was based on the fact that HIV tests now are good enough that they can pick up viral particles after that window (i.e. at say 2 months, you may be infected, but the tests may not detect it.)

Based on the CDC information I found today, in 2018, male-to-male sexual contact resulted in 24,933 new infections. The 2nd highest category was heterosexual contact (note the CDC page doesn’t seem to specify the word sexual there.) So yes, statistically it appears male-male sexual contact is a high-risk category.

But…

I know a number of gay and bisexual men. I don’t inquire about their sexual habits. However, a number are either married or appear to be in monogamous relationships. This means if they want to give blood and not lie on the forms, they have to be celibate for at least 3 months at a time!  But hey if you’re a straight guy and had sex with 4 different women in the last week, no problem, as long as you didn’t pay any of them for sex! I’ll add that more than one gay man I know wants to give blood and based on their actual behavior are in a low risk category, but can’t because of the above question.

Why do I bring all this up at the end of Pride Month and what, if anything does it have to do with database design (something I do try to actually write about from time to time)?

As a cis-het male (assigned at birth and still fits me) it’s easy to be oblivious to the problematic nature of the questions on such an innocuous and arguably well-intended  form. The FDA has certain mandates that the Red Cross (and other blood donation agencies) must follow. And I think the mandates are often well-intended. But, there are probably better ways of approaching the goals, in the examples given above, of helping to rule out higher-risk donations. I’ll be honest, I’m not always sure the best way.  To some extent, it might be as simple as rewording the question. In others, it might be necessary to redesign the database to better reflect the realities of gender and sex, after all bits are cheap.

But I want to tie this into something I’ve said before: diversity in hiring is critical and I think we in the data world need to be aware of this. There are several reasons, but I want to focus on one for now.

Our Databases Model the World as We Know It.

The way we build databases is an attempt to model the world. If we are only aware of two genders, we will build our databases to reflect this. But sometimes we have to stop and ask, “do we even need to ask that question?” For one thing, we potentially add the issue of having to deal with Personally Identifiable Information that we don’t really need.  For another, we can make assumptions: “Oh they’re male, they can’t get pregnant so this drug won’t be an issue.”

Now, I’m fortunate enough to have a number of friends who fall into various places on the LGBTQIA+ (and constantly growing collection of letters) panoply and the more I listen, the more complexity I see in the world and how we record it.

This is not to say that you must go out instantly and hire 20 different DBAs, each representing a different identity. That’s obviously not practical. But, I suspect if your staff is made up of cis-het men, your data models may be suffering and you may not even be aware of it!

So, listen to others when they talk about their experiences, do research, get to know more people with experiences and genders and sexualities different from yours. You’ll learn something and you also might build databases. But more importantly, you’ll get to know some great people and become a better person yourself. Trust me on that.

 

 

 

Checking the Setup

A quick post outside of my usual posting schedule.

I was rewriting a T-SQL sproc I have that runs nightly to restore a database from one server to another. It had been failing for reasons beyond the scope of this article. But one of the issues we had was, we didn’t know it was failing. The error-checking was not as good as I would have liked. I decided to add a step that would email me on an error.

That’s easy enough to do. In this case I wanted to be able to use the stored procedure sp_notify_operator. This is useful since I don’t have to worry about passing in an email address or changing it if I need to update things. I can update the operator. However, the various servers at this client had been installed over a several year period and I wasn’t sure that all of them had the same operator configured. And I was curious as to who the emails the operators went to on those machines.  Now, I had a decent number of machines I wanted to check.

Fortunately, due to previous work (and you can read more here) I have a JSON file on my box so I can quickly loop through a list of servers (or if need be by servers in a particular environment like DEV or QA).

$serverobjlist = Get-Content -Raw -Path “$env:HomeDrive$env:HomePath\documents\WindowsPowerShell\Scripts\SQLServerObjectlist.json” | ConvertFrom-Json
 
foreach ($computername in $serverobjlist.computername)
{
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name, email_address from msdb.dbo.sysoperators”
write-host $computername $results.name $results.email_address
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name from msdb.dbo.sysmail_profile”
write-host $computername $results.name `n
}

This gave me a list of what operators were on what servers and who the emails went to. Now if this were a production script I’d probably have made things neater, but this worked well enough to do what I needed. Sure enough, one of the servers (ironically one of the ones more recently installed) was missing the standard mail Profile we setup. That was easy to fix because of course I have that scripted out. Open the T-Sql script on that server, run it, and all my servers now had the standard mail profile.

Once I had confirmed my new restore script could run on any of the servers and correctly send email if there was an error it was time to roll it out.

deploy

Successful deploy to the UAT environment

So one quick PowerShell Script, an updated T-SQL Script and a PowerShell Deploy Script and my new sproc has been deployed to UAT and other environments.

And best of all, because it was logged, I knew exactly when I had done it and on what servers and that everything was consistent.

I call that a win for a Monday. How is your week starting?

 

 

Crossing the Threshold…

So it’s the usual story. You need to upgrade a machine, the IT group says, “no problem, we can virtualize it, it’ll be better! Don’t worry, yes, it’ll be fewer CPUs, but they’ll be much faster!”

So, you move forward with the upgrade. Twenty-three meetings later, 3 late nights, one OS upgrade, and two new machines forming one new cluster, you’re good. Things go live.  And then Monday happens. Monday of course is the first full day of business and just so happens to be the busiest day of the week.

Users are complaining. You look at the CPU and it’s hitting 100% routinely. Things are NOT distinctly better.

You look at the CPUs and you notices something striking:

cpu not being used

CPU 8 is showing a problem

4 of the CPUs (several are missing on this graphic) are showing virtually no utilization while the other  8 are going like gang-busters.  Then it hits you, the way the IT group setup the virtual CPUs was not what you needed.  They setup 6 sockets with 2 cores each for a total of 12 cores. This shouldn’t be a problem except that SQL Server Standard Edition uses the lower of either 4 sockets or 24 cores. Because your VM has 6 sockets, SQL Server refuses to use two of them.

You confirm the problem by running the following query:

SELECT scheduler_id, cpu_id, status, is_online FROM sys.dm_os_schedulers

This shows only 8 of your 12 CPUs are marked visible_online.

This is fortunately an easy fix.  A quick outage and your VM is reconfigured to 2 sockets with 6 cores a piece. Your CPU graphs now look like:

better CPU

A better CPU distribution

This is closer to what you want to see, but of course since you’re doing your work at night, you’re not seeing a full load. But you’re happier.

Then Monday happens again.  Things are better, but you’re still not happy. The CPUs are running on average at about 80% utilization. This is definitely better than 100%. But your client’s product manager knows they’ll need more processing power in coming months and running at 80% doesn’t give you much growth potential. The product manager would rather not have to buy more licenses.

So, you go to work. And since I’m tired of writing in the 2nd person, I’ll start writing in 1st person moving forward.

There’s a lot of ways to approach a problem like this, but often when I see heavy CPU usage, I want to see what sort of wait stats I’m dealing with. It may not always give me the best answer, but I find them useful.

Here’s the results of one quick query.

Fortunately, this being a new box, it was running SQL Server 2016 with the latest version service pack and CU.  This mean that I had some more useful data.

CXPackets

CXPackets and CXConsumer telling the tale

Note one of the suggestions: Changing the default Cost Threshold for Parallelism based on observed query cost for your entire workload.

Given the load I had observed, I guessed the Cost Threshold was way too low. It was in fact set to 10.  With that during testing I saw a CPU graph that looked like this:

43 percent CPU

43.5% at Cost Threshold of 10

I decided to change the Cost Threshold to 100 and the graph quickly became:

25 percent CPU

25% at Cost Threshold of 100

Dropping from 43.5% to 25.6%. That’s a savings you can take to the bank!

Of course that could have been a fluke, so I ran several 5 minute snapshots where I would set the threshold to 10, collect some data and then to 100 for 5 minutes and collect data.

CXPacket_10      CXPacket_10_Waittime_MS
635533 5611743
684578 4093190
674500 4428671
CXConsumer_10              CXConsumer_10_Waittime_MS
563830 3551016
595943 2661527
588635 2853673
CXPacket_100   CXPacket_100_Waittime_MS
0 0
41 22
1159 8156
CXConsumer_100            CXConsumer_100_Waittime_MS
0 0
13 29443
847 4328

You can see that over 3 runs the difference between having a threshold of 10 versus 100 made a dramatic difference in the total time spent waiting in the 5 minute window.

The other setting that can play a role in how parallelization can impact performance is MAXDOP. In this case testing didn’t show any real performance differences with changing that value.

At the end of the day though, I call this a good day. A few hours of my consulting time saved the client $1,000s of going down the wrong and expensive road of adding more CPUs and SQL licenses. There’s still room for improvement, but going from a box where only 8 of the 12 CPUs were being used and were running at 100% to a box where the average CPU usage is close to 25% is a good start.

What’s your tuning success story?

Small Victories

Ask most DBAs and they’ll probably tell you they’re not a huge fan of triggers.  They can be useful, but hard to debug.  Events last week reminded me of that. Fortunately a little debugging made a huge difference.

Let me set the scene, but unfortunately since this was work for a client, I can’t really use many screenshots. (or rather to do so would take far too long to sanitize them in the time I allocate to write my weekly blog posts.)

The gist is, my client is working on a process to take data from one system and insert it into their Salesforce system.  To do so, we’re using a 3rd party tool called Pentaho. It’s similar to SSIS in some ways, but based on Java.

Anyway, the process I was debugging was fairly simple. Take account information from the source and upsert it into Salesforce. If the account already existed in Salesforce, great, simply perform an update. If it’s new data, perform an insert.  At the end of the process Pentaho returns a record that contains the original account information and the Salesforce ID.

So far so good. Now, the original author of the system had setup a trigger so when these records are returned it can update the original source account record with the Salesforce ID if it didn’t exist previously. I should note that updating the accounts is just one of many possible transformations the entire process runs.

After working on the Pentaho ETL (extract, transform, load) for a bit and getting it stable, I decided to focus on performance. There appeared to be two main areas of slowness, the upsert to Salesforce and the handling of the returned records. Now, I had no insight into the Salesforce side of things, so I decided to focus on handling the returned records.

The problem of course was that Pentaho was sort of hiding what it was doing. I had to get some insight there. I knew it was doing an Insert into a master table of successful records and then a trigger to update the original account.

Now,  being a 21st Century DBA and taking into account Grant Fritchey’s blog post on Extended Events I had previously setup a Extended Events Session on this database. I had to tweak it a bit, but I got what I wanted in short order.

CREATE EVENT SESSION [Pentaho Trace SalesForceData] ON SERVER
ADD EVENT sqlserver.existing_connection(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.login(SET collect_options_text=(1)
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.logout(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.rpc_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho'))
ADD TARGET package0.ring_buffer(SET max_memory=(1024000))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

It’s not much, but it lets me watch incoming transactions.

I could then fire off the ETL in question and capture some live data. A typical returned result looked like

exec sp_execute 1,N'SourceData',N'GQF',N'Account',N'1962062',N'a6W4O00000064zbUAA','2019-10-11 13:07:22.8270000',N'neALaRggAlD/Y/T4ign0vOA==L',N'Upsert Success'

Now that’s not much, but I knew what the Insert statement looked like so I could build an insert statement wrapped with a begin tran/rollback around it so I could test the insert without actually changing my data.  I then tossed in some set statistics IO ON and enabled Include Actual Execution Plan so I could see what was happening.

“Wait, what’s this? What’s this 300K rows read? And why is it doing a clustered index scan on this table?”  This was a disconcerting. The field I was comparing was the clustered index, it should be a seek!

So I looked more closely at the trigger. There were two changes I ended up making.

       -- Link Accounts
       --MERGE INTO GQF_AccountMaster T
       --USING Inserted S
       --ON (CAST(T.ClientId AS VARCHAR(255)) = S.External_Id__c
       --AND S.Transformation in ('Account'))
       --WHEN MATCHED THEN UPDATE
       --SET T.SFID = S.Id
       --;
       
       if (select transformation from Inserted) ='Account'
       begin
              MERGE INTO GQF_AccountMaster T
              USING Inserted S
              ON T.ClientId  = S.External_Id__c
              WHEN MATCHED THEN UPDATE
              SET T.SFID = S.Id
       end

An astute DBA will notice that CAST in there.  Given the design, the Inserted table field External_Id__C is sort of a catch all for all sorts of various types of IDs and some in fact could be up to 255 characters. However, in the case of an Account it’s a varchar(10).

The original developer probably put the CAST in there since they didn’t want to blow up the Merge statement if it compared a transformation other than an Account. (From what I can tell, T-SQL does not guarantee short-circuit evaluation, if I’m wrong, please let me know and point me to definitive documentation.) However, the minute you cast that, you lose the ability to seek using the index, you have to use a scan.

So I rewrote the commented section into an IF to guarantee we were only dealing with Account transformations and then I stripped out the cast.

Then I reran and watched. My index scan of 300K rows was down to a seek of 3 rows. The trigger now performed in subsecond time. Not bad for an hour or so of work. That and some other improvements meant that now we could handle a few 1000 inserts and updates in the time it was previously taking to do 10 or so.  It’s one of those days where I like to think my client got their money’s worth out of me.

Slight note: Next week I will be at PASS Summit so not sure if/when I’ll be blogging. But follow me on Twitter @stridergdm.

Call 911, If You Can

Also known as “things have changed”

For one my of clients I monitor and maintain some of the jobs that run on their various servers. One of them had started to fail about two weeks ago. The goal of the job was basically to download a file from one server, transfer it to another and upload it.  Easy-peasy. However, sometimes the job fails because there’s no file to transfer (which really shouldn’t be a failure, but just a warning).  So, despite the fact that it had failed multiple days in a row, I hadn’t looked at it. And of course no one was complaining (though that’s not always a good reason to ignore a job failure!)

So yesterday I took a look and realized the error message was in fact incorrect. It wasn’t failing because of a lack of a new file, but because it could no longer log into the primary server. A quick test showed the password had been changed. This didn’t really surprise me as this client is going through and updating a number of accounts and passwords. This was simply obviously one and we had missed this one. (Yes, this is where better documentation would obviously be a good idea.  We’re working on that.)

So, I figured the fix would be easy, simply email the right person, get the new password and update the process.  I also was taking the time to update the script to that the password would be encrypted moving forward, right now it’s in plain text and to give the correct error in the event of login in failure.

Well, the person who should have the password wasn’t even aware of this process. As we exchanged emails, and the lead developer chimed in, the conclusion was that this process probably shouldn’t be using this account, and that perhaps even then, this process may no longer be necessary.

So, now my job is to track down the person who did or does rely on this process, find out if they still are and then finish updating the password.  Of course if they’re not, we’ll stop this process. In some ways that’s preferable since it’s one less place to worry about a password and one less place to maintain.

Now, the above details are somewhat specific to this particular job, but, I’m sure all of us have found a job running on a server someplace and wondered, “What is this doing?” Sometimes we find out it’s still important. Sometimes we discover that it’s no longer necessary. In a perfect world, our documentation would always be up to date and our procedures would be such that we immediately remove unnecessary jobs.

But the real world is far messier unfortunately.

(and since the full photo got cropped in the header, here it is again)

Call 911. If you can

Apparently not only can guest rooms can not be called from this phone

And as a reminder, if you enjoy my posts, please make sure to subscribe.

 

SSMS 2017 RegEx

A short technical post on one thing I’ve found annoying.

Anyone who has worked with computers in the last decade has probably used regular expressions in some form or another, or “regexs” as they’re known as. First (as far as I know) popularized in Perl (though their history stretches back to the 1950s), they’ve become standard fair in most languages and tools and are very useful for complex matching and find and replaces.  And for anyone who has worked with computers for over two decades they often look like line noise when someone would pick up the phone in the house when you were dialed in via an actual modem.

That said, I first learned about their value in SQL Server Management Studio due to a great talk by Sean McCown. (note the post’s byline is Jen’s, but trust me, I saw Sean give the talk. 😉

One of the powerful features is the ability to “tag” part of an expression so you can use it in your replace statement (see the above link for more details.)

But, here’s the thing, somewhere along the line (I think it was post SSMS 2014) Microsoft changed the rules of the game and it can be hard to find the new rules!  They have a post on using regexp in SSMS 2017. But as far as I can tell, it’s the old 2014 info, simply rebranded. Some of it, especially the tagging part does not appear to work for me. If anyone CAN make it work in 2017, please let me know how.

Let me give you an example. Just today I was given a script that had a lot of statements similar to:

DROP TABLE If Exists ##TempAttribute

Let me say I LOVE the new “If Exists” option for DROP Table, but, I needed to run this on SQL Server 2008R2 (I know, don’t ask!) and that syntax won’t work.

I needed to replace it with something like:

IF OBJECT_ID('##TempAttribute', 'U') IS NOT NULL
  DROP TABLE ##TempAttribute; 

Now, I’m naturally lazy and didn’t want to have to find and replace all 100 or so instances of this by hand. So, I reached for regexes… and… well it didn’t go well.

Based on the old syntax my find should look something like for the find:

DROP Table if exists {\#\#[A-z_0-9]+}

And for the replace

if object_ID('\1', 'U') is not null drop table \1;

Except, for the life of me, that wasn’t working. Every time I tried to tag the table name using the braces {} my regex would fall apart.  So of course I searched and got the link from Microsoft above that still suggests tagging with braces.  From previous experience I knew it was wrong, but that IS the official Microsoft page after all, so I kept doubting myself.

But, I was right in remembering things had changed.

The proper syntax is:

Drop table if exists (\#\#[A-z_0-9]+)

and

if object_ID('$1', 'U') is not null drop table $1;

The change to the search expression is subtle, changes braces to curved parenthesis .

The change to the replace is about the same, changing a \ to a $ but I suspect (I have not confirmed) that you’re no longer limited to just 9 tagged expressions.

And before anyone chimes in, I do realize there are some other ways of writing the search expression (such as I could have used :w+ instead in SSMS 2014 that would have worked in my particular case, since there were no temp tables with numbers, but this would not have worked in SSMS 2017), but this worked for me and wasn’t the point of this post. My goal was to focus on the change in tagging expressions.

Regular Expressions are still one of those things that don’t come very easily to me so I often struggle with the syntax and it doesn’t help when Microsoft changes the syntax rules between versions of SSMS (my understand they did so to make SSMS functionality better match Visual Studio, so I’m ok with this), but overall, I find them EXTREMELY useful and if you haven’t played with them, I highly recommend you start learning. There can be a bit of a learning curve for anything too complex, but it’s worth it.

My advice, start with learning how to “grab” the beginning and the “end” of a line and then go from there. This is the most useful thing to me at times when I want to add or remove something at the start or end of every line.

Happy expressing!

DTSX Error

Not really a blog post of the typical form, this is more so add content to the Internet and hopefully have Google find it for someone else.

So, I inherited a DTSX package from a former project. Who hasn’t been in that position before, right?

No problem I could make most of it do what I wanted except for ONE Data Flow Task. Or more accurately, ADO NET Source.  This was connecting to a 3rd party database on the client server.  Not a problem, except I can’t hit that 3rd party database from my desktop and, unfortunately, I can’t install Visual Studio on the client’s server. So, for most of my changes, I had to disable that data flow task to make my other edits.  Annoying, but not a show-stopper in this particular case.

Until… I had to actually edit that Source.  I could not add new OUTPUT Columns under the Source Output.  I think this is because I couldn’t connect to the actual data source to validate stuff. I could be wrong. But anyway, I had to resort to editing the XML directly. This is always a bit dangerous, but Danger is my middle name. (Ok, maybe not, but my middle initial IS D.)

And then I committed my changes, loaded it to the client computer and ran it.

Well, sort of.  The data flowed like it should and then I got:

System.NullReferenceException: Object reference not set to an instance of an object.   at Microsoft.SqlServer.Dts.Pipeline.DataReaderSourceAdapter.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)  at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput(IDTSManagedComponentWrapper100 wrapper, Int32 outputs, Int32[] outputIDs, IDTSBuffer100[] buffers, IntPtr ppBufferWirePacket)

The other error was:

SSIS Error Code DTS_E_PRIMEOUTPUTFAILED.  The PrimeOutput method on ADO Source returned error code 0x80004003.  The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.  There may be error messages posted before this with more information about the failure

I added some more error handling and tried everything, but, I couldn’t stop the error, even though all the data actually was flowing.  This was weird. The data WAS flowing exactly the way I wanted. But, the package would fail due to the above error.

I finally created a test package with JUST the Data Flow Task and tried debugging that. I still had no luck. But at least the XML was far easier to parse.

After looking at it for the 42nd time, I finally noticed… I had added the column to the Output Columns under the ADO NET Source Output, but I had NOT put it under the ADO NET Source Error Output.

So, even though there were no errors, apparently DTSX would still fail because of the missing column. Once I added that, everything was solved.