Giving Blood and Pride Month

I gave blood yesterday. It got me thinking. First, let me show a few screenshots:

male blood donor shot 1

7 Male Donor #1 screen shot

female blood donor shot 1

Female Donor #1 screen shot

Let me interject here I’m using the terms Male and Female based on the criteria I selected in the American Red Cross’s Fast Pass screen. More on why I make that distinction further on. But first two more screen shots.

female blood donor shot 2

Pregnancy question highlighted for female

male blood donor shot 2

No pregnancy question for males

Now, on the face of it, this second set of questions especially almost seems to make sense: I mean if I answered Male early on in the questionnaire, why by asked about a pregnancy? But what I’m asked at the beginning is about my gender, not my actual child-bearing capability. Let me quote from Merriam-Webster:

2-b: the behavioral, cultural, or psychological traits typically associated with one sex

Or from the World Health Organization:

Gender refers to the roles, behaviours, activities, attributes and opportunities that any society considers appropriate for girls and boys, and women and men. Gender interacts with, but is different from, the binary categories of biological sex.

Who can be pregnant?

So above, really what the Red Cross is asking isn’t about my gender, but really my ability to be pregnant. Now, this is a valid medical concern. There are risks they want to avoid in regards to pregnant women, or recently pregnant women giving blood. So their ultimate goal isn’t the problem, but their initial assumption might be. A trans-man might still be able to get pregnant, and a trans-woman might be incapable of getting pregnant (as well as a cis-woman might be incapable.) And this is why I had the caveat above about using the terms male and female. I’m using the terms provided which may not be the most accurate.

Assumptions on risk factors

The first set of images is a problematic in another way: it is making assumptions about risk factors. Now, I think we can all agree that keeping blood borne pathogens such as HIV out of the blood supply is a good one. And yes, while donated blood is tested, it can be even safer if people who know they are HIV or at risk for it can potentially self-select themselves out of the donation process.

But…

Let me show the actual question:

Male Male 3 month contact question

Question 21, for Men

This is an improvement over the older restrictions that were at one year and at one point “any time since 1977”. Think about that. If a man had had sex with another man in 1986, but consistently tested negative for HIV/AIDS for the following 30+ years, they could not give blood under previous rules. By the way, I will make a note here that these rules are NOT set by the American Red Cross, but rather by the FDA. So don’t get too angry at the Red Cross for this.

The argument for a 3 month window apparently was based on the fact that HIV tests now are good enough that they can pick up viral particles after that window (i.e. at say 2 months, you may be infected, but the tests may not detect it.)

Based on the CDC information I found today, in 2018, male-to-male sexual contact resulted in 24,933 new infections. The 2nd highest category was heterosexual contact (note the CDC page doesn’t seem to specify the word sexual there.) So yes, statistically it appears male-male sexual contact is a high-risk category.

But…

I know a number of gay and bisexual men. I don’t inquire about their sexual habits. However, a number are either married or appear to be in monogamous relationships. This means if they want to give blood and not lie on the forms, they have to be celibate for at least 3 months at a time!  But hey if you’re a straight guy and had sex with 4 different women in the last week, no problem, as long as you didn’t pay any of them for sex! I’ll add that more than one gay man I know wants to give blood and based on their actual behavior are in a low risk category, but can’t because of the above question.

Why do I bring all this up at the end of Pride Month and what, if anything does it have to do with database design (something I do try to actually write about from time to time)?

As a cis-het male (assigned at birth and still fits me) it’s easy to be oblivious to the problematic nature of the questions on such an innocuous and arguably well-intended  form. The FDA has certain mandates that the Red Cross (and other blood donation agencies) must follow. And I think the mandates are often well-intended. But, there are probably better ways of approaching the goals, in the examples given above, of helping to rule out higher-risk donations. I’ll be honest, I’m not always sure the best way.  To some extent, it might be as simple as rewording the question. In others, it might be necessary to redesign the database to better reflect the realities of gender and sex, after all bits are cheap.

But I want to tie this into something I’ve said before: diversity in hiring is critical and I think we in the data world need to be aware of this. There are several reasons, but I want to focus on one for now.

Our Databases Model the World as We Know It.

The way we build databases is an attempt to model the world. If we are only aware of two genders, we will build our databases to reflect this. But sometimes we have to stop and ask, “do we even need to ask that question?” For one thing, we potentially add the issue of having to deal with Personally Identifiable Information that we don’t really need.  For another, we can make assumptions: “Oh they’re male, they can’t get pregnant so this drug won’t be an issue.”

Now, I’m fortunate enough to have a number of friends who fall into various places on the LGBTQIA+ (and constantly growing collection of letters) panoply and the more I listen, the more complexity I see in the world and how we record it.

This is not to say that you must go out instantly and hire 20 different DBAs, each representing a different identity. That’s obviously not practical. But, I suspect if your staff is made up of cis-het men, your data models may be suffering and you may not even be aware of it!

So, listen to others when they talk about their experiences, do research, get to know more people with experiences and genders and sexualities different from yours. You’ll learn something and you also might build databases. But more importantly, you’ll get to know some great people and become a better person yourself. Trust me on that.

 

 

 

Checking the Setup

A quick post outside of my usual posting schedule.

I was rewriting a T-SQL sproc I have that runs nightly to restore a database from one server to another. It had been failing for reasons beyond the scope of this article. But one of the issues we had was, we didn’t know it was failing. The error-checking was not as good as I would have liked. I decided to add a step that would email me on an error.

That’s easy enough to do. In this case I wanted to be able to use the stored procedure sp_notify_operator. This is useful since I don’t have to worry about passing in an email address or changing it if I need to update things. I can update the operator. However, the various servers at this client had been installed over a several year period and I wasn’t sure that all of them had the same operator configured. And I was curious as to who the emails the operators went to on those machines.  Now, I had a decent number of machines I wanted to check.

Fortunately, due to previous work (and you can read more here) I have a JSON file on my box so I can quickly loop through a list of servers (or if need be by servers in a particular environment like DEV or QA).

$serverobjlist = Get-Content -Raw -Path “$env:HomeDrive$env:HomePath\documents\WindowsPowerShell\Scripts\SQLServerObjectlist.json” | ConvertFrom-Json
 
foreach ($computername in $serverobjlist.computername)
{
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name, email_address from msdb.dbo.sysoperators”
write-host $computername $results.name $results.email_address
$results = Invoke-Sqlcmd -ServerInstance $computername -query “select name from msdb.dbo.sysmail_profile”
write-host $computername $results.name `n
}

This gave me a list of what operators were on what servers and who the emails went to. Now if this were a production script I’d probably have made things neater, but this worked well enough to do what I needed. Sure enough, one of the servers (ironically one of the ones more recently installed) was missing the standard mail Profile we setup. That was easy to fix because of course I have that scripted out. Open the T-Sql script on that server, run it, and all my servers now had the standard mail profile.

Once I had confirmed my new restore script could run on any of the servers and correctly send email if there was an error it was time to roll it out.

deploy

Successful deploy to the UAT environment

So one quick PowerShell Script, an updated T-SQL Script and a PowerShell Deploy Script and my new sproc has been deployed to UAT and other environments.

And best of all, because it was logged, I knew exactly when I had done it and on what servers and that everything was consistent.

I call that a win for a Monday. How is your week starting?

 

 

Crossing the Threshold…

So it’s the usual story. You need to upgrade a machine, the IT group says, “no problem, we can virtualize it, it’ll be better! Don’t worry, yes, it’ll be fewer CPUs, but they’ll be much faster!”

So, you move forward with the upgrade. Twenty-three meetings later, 3 late nights, one OS upgrade, and two new machines forming one new cluster, you’re good. Things go live.  And then Monday happens. Monday of course is the first full day of business and just so happens to be the busiest day of the week.

Users are complaining. You look at the CPU and it’s hitting 100% routinely. Things are NOT distinctly better.

You look at the CPUs and you notices something striking:

cpu not being used

CPU 8 is showing a problem

4 of the CPUs (several are missing on this graphic) are showing virtually no utilization while the other  8 are going like gang-busters.  Then it hits you, the way the IT group setup the virtual CPUs was not what you needed.  They setup 6 sockets with 2 cores each for a total of 12 cores. This shouldn’t be a problem except that SQL Server Standard Edition uses the lower of either 4 sockets or 24 cores. Because your VM has 6 sockets, SQL Server refuses to use two of them.

You confirm the problem by running the following query:

SELECT scheduler_id, cpu_id, status, is_online FROM sys.dm_os_schedulers

This shows only 8 of your 12 CPUs are marked visible_online.

This is fortunately an easy fix.  A quick outage and your VM is reconfigured to 2 sockets with 6 cores a piece. Your CPU graphs now look like:

better CPU

A better CPU distribution

This is closer to what you want to see, but of course since you’re doing your work at night, you’re not seeing a full load. But you’re happier.

Then Monday happens again.  Things are better, but you’re still not happy. The CPUs are running on average at about 80% utilization. This is definitely better than 100%. But your client’s product manager knows they’ll need more processing power in coming months and running at 80% doesn’t give you much growth potential. The product manager would rather not have to buy more licenses.

So, you go to work. And since I’m tired of writing in the 2nd person, I’ll start writing in 1st person moving forward.

There’s a lot of ways to approach a problem like this, but often when I see heavy CPU usage, I want to see what sort of wait stats I’m dealing with. It may not always give me the best answer, but I find them useful.

Here’s the results of one quick query.

Fortunately, this being a new box, it was running SQL Server 2016 with the latest version service pack and CU.  This mean that I had some more useful data.

CXPackets

CXPackets and CXConsumer telling the tale

Note one of the suggestions: Changing the default Cost Threshold for Parallelism based on observed query cost for your entire workload.

Given the load I had observed, I guessed the Cost Threshold was way too low. It was in fact set to 10.  With that during testing I saw a CPU graph that looked like this:

43 percent CPU

43.5% at Cost Threshold of 10

I decided to change the Cost Threshold to 100 and the graph quickly became:

25 percent CPU

25% at Cost Threshold of 100

Dropping from 43.5% to 25.6%. That’s a savings you can take to the bank!

Of course that could have been a fluke, so I ran several 5 minute snapshots where I would set the threshold to 10, collect some data and then to 100 for 5 minutes and collect data.

CXPacket_10      CXPacket_10_Waittime_MS
635533 5611743
684578 4093190
674500 4428671
CXConsumer_10              CXConsumer_10_Waittime_MS
563830 3551016
595943 2661527
588635 2853673
CXPacket_100   CXPacket_100_Waittime_MS
0 0
41 22
1159 8156
CXConsumer_100            CXConsumer_100_Waittime_MS
0 0
13 29443
847 4328

You can see that over 3 runs the difference between having a threshold of 10 versus 100 made a dramatic difference in the total time spent waiting in the 5 minute window.

The other setting that can play a role in how parallelization can impact performance is MAXDOP. In this case testing didn’t show any real performance differences with changing that value.

At the end of the day though, I call this a good day. A few hours of my consulting time saved the client $1,000s of going down the wrong and expensive road of adding more CPUs and SQL licenses. There’s still room for improvement, but going from a box where only 8 of the 12 CPUs were being used and were running at 100% to a box where the average CPU usage is close to 25% is a good start.

What’s your tuning success story?

Small Victories

Ask most DBAs and they’ll probably tell you they’re not a huge fan of triggers.  They can be useful, but hard to debug.  Events last week reminded me of that. Fortunately a little debugging made a huge difference.

Let me set the scene, but unfortunately since this was work for a client, I can’t really use many screenshots. (or rather to do so would take far too long to sanitize them in the time I allocate to write my weekly blog posts.)

The gist is, my client is working on a process to take data from one system and insert it into their Salesforce system.  To do so, we’re using a 3rd party tool called Pentaho. It’s similar to SSIS in some ways, but based on Java.

Anyway, the process I was debugging was fairly simple. Take account information from the source and upsert it into Salesforce. If the account already existed in Salesforce, great, simply perform an update. If it’s new data, perform an insert.  At the end of the process Pentaho returns a record that contains the original account information and the Salesforce ID.

So far so good. Now, the original author of the system had setup a trigger so when these records are returned it can update the original source account record with the Salesforce ID if it didn’t exist previously. I should note that updating the accounts is just one of many possible transformations the entire process runs.

After working on the Pentaho ETL (extract, transform, load) for a bit and getting it stable, I decided to focus on performance. There appeared to be two main areas of slowness, the upsert to Salesforce and the handling of the returned records. Now, I had no insight into the Salesforce side of things, so I decided to focus on handling the returned records.

The problem of course was that Pentaho was sort of hiding what it was doing. I had to get some insight there. I knew it was doing an Insert into a master table of successful records and then a trigger to update the original account.

Now,  being a 21st Century DBA and taking into account Grant Fritchey’s blog post on Extended Events I had previously setup a Extended Events Session on this database. I had to tweak it a bit, but I got what I wanted in short order.

CREATE EVENT SESSION [Pentaho Trace SalesForceData] ON SERVER
ADD EVENT sqlserver.existing_connection(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.login(SET collect_options_text=(1)
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.logout(
    ACTION(sqlserver.session_id)
    WHERE ([sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.rpc_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho')),
ADD EVENT sqlserver.sql_batch_starting(
    ACTION(sqlserver.session_id)
    WHERE ([package0].[greater_than_uint64]([sqlserver].[database_id],(4)) AND [package0].[equal_boolean]([sqlserver].[is_system],(0)) AND [sqlserver].[username]=N'TempPentaho'))
ADD TARGET package0.ring_buffer(SET max_memory=(1024000))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=ON,STARTUP_STATE=OFF)
GO

It’s not much, but it lets me watch incoming transactions.

I could then fire off the ETL in question and capture some live data. A typical returned result looked like

exec sp_execute 1,N'SourceData',N'GQF',N'Account',N'1962062',N'a6W4O00000064zbUAA','2019-10-11 13:07:22.8270000',N'neALaRggAlD/Y/T4ign0vOA==L',N'Upsert Success'

Now that’s not much, but I knew what the Insert statement looked like so I could build an insert statement wrapped with a begin tran/rollback around it so I could test the insert without actually changing my data.  I then tossed in some set statistics IO ON and enabled Include Actual Execution Plan so I could see what was happening.

“Wait, what’s this? What’s this 300K rows read? And why is it doing a clustered index scan on this table?”  This was a disconcerting. The field I was comparing was the clustered index, it should be a seek!

So I looked more closely at the trigger. There were two changes I ended up making.

       -- Link Accounts
       --MERGE INTO GQF_AccountMaster T
       --USING Inserted S
       --ON (CAST(T.ClientId AS VARCHAR(255)) = S.External_Id__c
       --AND S.Transformation in ('Account'))
       --WHEN MATCHED THEN UPDATE
       --SET T.SFID = S.Id
       --;
       
       if (select transformation from Inserted) ='Account'
       begin
              MERGE INTO GQF_AccountMaster T
              USING Inserted S
              ON T.ClientId  = S.External_Id__c
              WHEN MATCHED THEN UPDATE
              SET T.SFID = S.Id
       end

An astute DBA will notice that CAST in there.  Given the design, the Inserted table field External_Id__C is sort of a catch all for all sorts of various types of IDs and some in fact could be up to 255 characters. However, in the case of an Account it’s a varchar(10).

The original developer probably put the CAST in there since they didn’t want to blow up the Merge statement if it compared a transformation other than an Account. (From what I can tell, T-SQL does not guarantee short-circuit evaluation, if I’m wrong, please let me know and point me to definitive documentation.) However, the minute you cast that, you lose the ability to seek using the index, you have to use a scan.

So I rewrote the commented section into an IF to guarantee we were only dealing with Account transformations and then I stripped out the cast.

Then I reran and watched. My index scan of 300K rows was down to a seek of 3 rows. The trigger now performed in subsecond time. Not bad for an hour or so of work. That and some other improvements meant that now we could handle a few 1000 inserts and updates in the time it was previously taking to do 10 or so.  It’s one of those days where I like to think my client got their money’s worth out of me.

Slight note: Next week I will be at PASS Summit so not sure if/when I’ll be blogging. But follow me on Twitter @stridergdm.

Call 911, If You Can

Also known as “things have changed”

For one my of clients I monitor and maintain some of the jobs that run on their various servers. One of them had started to fail about two weeks ago. The goal of the job was basically to download a file from one server, transfer it to another and upload it.  Easy-peasy. However, sometimes the job fails because there’s no file to transfer (which really shouldn’t be a failure, but just a warning).  So, despite the fact that it had failed multiple days in a row, I hadn’t looked at it. And of course no one was complaining (though that’s not always a good reason to ignore a job failure!)

So yesterday I took a look and realized the error message was in fact incorrect. It wasn’t failing because of a lack of a new file, but because it could no longer log into the primary server. A quick test showed the password had been changed. This didn’t really surprise me as this client is going through and updating a number of accounts and passwords. This was simply obviously one and we had missed this one. (Yes, this is where better documentation would obviously be a good idea.  We’re working on that.)

So, I figured the fix would be easy, simply email the right person, get the new password and update the process.  I also was taking the time to update the script to that the password would be encrypted moving forward, right now it’s in plain text and to give the correct error in the event of login in failure.

Well, the person who should have the password wasn’t even aware of this process. As we exchanged emails, and the lead developer chimed in, the conclusion was that this process probably shouldn’t be using this account, and that perhaps even then, this process may no longer be necessary.

So, now my job is to track down the person who did or does rely on this process, find out if they still are and then finish updating the password.  Of course if they’re not, we’ll stop this process. In some ways that’s preferable since it’s one less place to worry about a password and one less place to maintain.

Now, the above details are somewhat specific to this particular job, but, I’m sure all of us have found a job running on a server someplace and wondered, “What is this doing?” Sometimes we find out it’s still important. Sometimes we discover that it’s no longer necessary. In a perfect world, our documentation would always be up to date and our procedures would be such that we immediately remove unnecessary jobs.

But the real world is far messier unfortunately.

(and since the full photo got cropped in the header, here it is again)

Call 911. If you can

Apparently not only can guest rooms can not be called from this phone

And as a reminder, if you enjoy my posts, please make sure to subscribe.

 

SSMS 2017 RegEx

A short technical post on one thing I’ve found annoying.

Anyone who has worked with computers in the last decade has probably used regular expressions in some form or another, or “regexs” as they’re known as. First (as far as I know) popularized in Perl (though their history stretches back to the 1950s), they’ve become standard fair in most languages and tools and are very useful for complex matching and find and replaces.  And for anyone who has worked with computers for over two decades they often look like line noise when someone would pick up the phone in the house when you were dialed in via an actual modem.

That said, I first learned about their value in SQL Server Management Studio due to a great talk by Sean McCown. (note the post’s byline is Jen’s, but trust me, I saw Sean give the talk. 😉

One of the powerful features is the ability to “tag” part of an expression so you can use it in your replace statement (see the above link for more details.)

But, here’s the thing, somewhere along the line (I think it was post SSMS 2014) Microsoft changed the rules of the game and it can be hard to find the new rules!  They have a post on using regexp in SSMS 2017. But as far as I can tell, it’s the old 2014 info, simply rebranded. Some of it, especially the tagging part does not appear to work for me. If anyone CAN make it work in 2017, please let me know how.

Let me give you an example. Just today I was given a script that had a lot of statements similar to:

DROP TABLE If Exists ##TempAttribute

Let me say I LOVE the new “If Exists” option for DROP Table, but, I needed to run this on SQL Server 2008R2 (I know, don’t ask!) and that syntax won’t work.

I needed to replace it with something like:

IF OBJECT_ID('##TempAttribute', 'U') IS NOT NULL
  DROP TABLE ##TempAttribute; 

Now, I’m naturally lazy and didn’t want to have to find and replace all 100 or so instances of this by hand. So, I reached for regexes… and… well it didn’t go well.

Based on the old syntax my find should look something like for the find:

DROP Table if exists {\#\#[A-z_0-9]+}

And for the replace

if object_ID('\1', 'U') is not null drop table \1;

Except, for the life of me, that wasn’t working. Every time I tried to tag the table name using the braces {} my regex would fall apart.  So of course I searched and got the link from Microsoft above that still suggests tagging with braces.  From previous experience I knew it was wrong, but that IS the official Microsoft page after all, so I kept doubting myself.

But, I was right in remembering things had changed.

The proper syntax is:

Drop table if exists (\#\#[A-z_0-9]+)

and

if object_ID('$1', 'U') is not null drop table $1;

The change to the search expression is subtle, changes braces to curved parenthesis .

The change to the replace is about the same, changing a \ to a $ but I suspect (I have not confirmed) that you’re no longer limited to just 9 tagged expressions.

And before anyone chimes in, I do realize there are some other ways of writing the search expression (such as I could have used :w+ instead in SSMS 2014 that would have worked in my particular case, since there were no temp tables with numbers, but this would not have worked in SSMS 2017), but this worked for me and wasn’t the point of this post. My goal was to focus on the change in tagging expressions.

Regular Expressions are still one of those things that don’t come very easily to me so I often struggle with the syntax and it doesn’t help when Microsoft changes the syntax rules between versions of SSMS (my understand they did so to make SSMS functionality better match Visual Studio, so I’m ok with this), but overall, I find them EXTREMELY useful and if you haven’t played with them, I highly recommend you start learning. There can be a bit of a learning curve for anything too complex, but it’s worth it.

My advice, start with learning how to “grab” the beginning and the “end” of a line and then go from there. This is the most useful thing to me at times when I want to add or remove something at the start or end of every line.

Happy expressing!

DTSX Error

Not really a blog post of the typical form, this is more so add content to the Internet and hopefully have Google find it for someone else.

So, I inherited a DTSX package from a former project. Who hasn’t been in that position before, right?

No problem I could make most of it do what I wanted except for ONE Data Flow Task. Or more accurately, ADO NET Source.  This was connecting to a 3rd party database on the client server.  Not a problem, except I can’t hit that 3rd party database from my desktop and, unfortunately, I can’t install Visual Studio on the client’s server. So, for most of my changes, I had to disable that data flow task to make my other edits.  Annoying, but not a show-stopper in this particular case.

Until… I had to actually edit that Source.  I could not add new OUTPUT Columns under the Source Output.  I think this is because I couldn’t connect to the actual data source to validate stuff. I could be wrong. But anyway, I had to resort to editing the XML directly. This is always a bit dangerous, but Danger is my middle name. (Ok, maybe not, but my middle initial IS D.)

And then I committed my changes, loaded it to the client computer and ran it.

Well, sort of.  The data flowed like it should and then I got:

System.NullReferenceException: Object reference not set to an instance of an object.   at Microsoft.SqlServer.Dts.Pipeline.DataReaderSourceAdapter.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)  at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput(IDTSManagedComponentWrapper100 wrapper, Int32 outputs, Int32[] outputIDs, IDTSBuffer100[] buffers, IntPtr ppBufferWirePacket)

The other error was:

SSIS Error Code DTS_E_PRIMEOUTPUTFAILED.  The PrimeOutput method on ADO Source returned error code 0x80004003.  The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.  There may be error messages posted before this with more information about the failure

I added some more error handling and tried everything, but, I couldn’t stop the error, even though all the data actually was flowing.  This was weird. The data WAS flowing exactly the way I wanted. But, the package would fail due to the above error.

I finally created a test package with JUST the Data Flow Task and tried debugging that. I still had no luck. But at least the XML was far easier to parse.

After looking at it for the 42nd time, I finally noticed… I had added the column to the Output Columns under the ADO NET Source Output, but I had NOT put it under the ADO NET Source Error Output.

So, even though there were no errors, apparently DTSX would still fail because of the missing column. Once I added that, everything was solved.

 

Moving the Needle – Hard

One of the things I enjoy is problem solving or “debugging”.  I don’t necessarily mean debugging code, though I’ve done plenty of that.  One particular class of problems I like solving is when something isn’t working “right”.  I’m currently involved on one such issue.

Just before the holidays, the lead developer at one my of my clients put me in touch with a team in another division to help them solve some performance issues they were having with their SQL Server. This is the sort of issue I generally like to sink my teeth into.

I started poking around and asking questions. I was a bit crushed when in the initial review they listed all the things they had tried and I had to nod my head sagely (which, being a remote worker went unnoticed by them) because they had tried all the basic things. They had, fortunately for them, ruled out a lot of the easy fixes.

So now it came down to some digging. I won’t go into too many details, but will cover some of the things uncovered and tried. For one thing, they have 44 SQL jobs that run every 20 seconds and basically do a poll of a database to see if there’s any work to be done. So, every 20 seconds 44 SQL jobs would fire up, do a quick select and then go back to sleep.  On their new server, they were on average taking 6 seconds a piece.  In addition, the CPU would spike to 100% for about 5-6 seconds and then drop back down. We are also seeing a lot of wait states of the MSQL_XP variety (accounting for about 1/2 the time the system is waiting and averaging about 61.1 ms each time. [Thanks to Brent Ozar’s script here!])

We tried three things, two helped, one didn’t.

First, I asked them to spread the jobs out. So now, basically 2-3 jobs are started every second. This means over a 20 second period all 44 jobs are run, but not all at once.  This had an immediate impact, the jobs now were taking about 2-3 seconds. A small victory.

Secondly, we changed the MAXDOP settings from 0 to 4.  This appeared to have no impact on the jobs. In retrospect makes a lot of sense. Each job is a separate task and basically single-threaded, so SQL Agent won’t care about the MAXDOP.

For those who aren’t familiar with SQL Server, MAXDOP is short for “Maximum Degree of Parallelism” This controls how much SQL Server will try to spread out a task among its CPUs. So for example you had 100 tests to grade and sort into alphabetical order and you had 1 person to grade them. That one person would have to do all the work. You might decide that having 100 people is 100 times faster since every person can grade a test at the same time. But then you have to hand out the 100 tests and then collect the tests and resort them back into alphabetical order, and this takes longer than you think.  So by playing around, you realize it’s actually faster to only have 10 people grade them and sort them.  In other words, sometimes, the effort of spreading out the work itself takes longer than the time saved by spreading it out.)

But, one thing that didn’t change was the CPU spike. But, since the poll jobs were twice as fast, we were happy with that improvement.

However, the real goal of the poll jobs was to wake up ETL jobs to handle large amounts of data. These were running about 1/2 as fast as they’d like or expected.

Here, MAXDOP does seem to have changed things.  In most cases, the ETL jobs are running close to twice as fast.

But, here’s the funny thing. I didn’t really care. Yes, that was our goal, but I’d have been content if they had run twice as slow. Why? Because at the point we changed the MAXDOP settings, my goal wasn’t to improve performance, it was simply to move the needle, hard.  What I meant by that was, by changing the MAXDOP from 0 (use all 32 CPUs) to 4 I was fairly confident, for a variety of reasons, I’d impact performance.  And I did in fact expect performance to improve.  But, there were really 3 possible outcomes:

  1. It improved. Great, we know we’re on the right track, let’s tweak it some more.
  2. It got worse. Great, this is probably NOT the solution, but let’s try it the other way and instead of 4 CPUs, try say 16 or even a larger value. At least we know that the MAXDOP is having an impact.
  3. Nothing change. In this case, we can pretty much rule out parallelization being a factor at all.

In other words by forcing SQL Server to use only 4 CPUs instead of all 32, I expected a change. If I didn’t see a change, one way or the other, I could mostly rule out parallelization.

Finally, once we saw that a MAXDOP of 4, we started to play with the threshold of parallelization. In this case we ended up with option 3 above. We tried a fairly small value (5) and a fairly large value (100) and haven’t seen much of a difference. So the cost threshold doesn’t seem to have much of an impact.

So, we’re not fully there yet, there’s a number of other factors we need to consider.  But sometimes when you’re approaching the problem, don’t be afraid to move the needle, in any direction, hard, can tell you if you should continue to try that approach. In this case with MAXDOP it indicated we were on the right track, but with the cost threshold, we’re probably not.

We’ve got a lot more to do, including seeing if we can eliminate or speed up the MSQL_XP wait states, but we’re on our way. (For the record, I don’t expect much change on this one, it’s really SQL Server saying, “hey, I called out to an external procedure and am waiting to hear back” so we can’t tweak the query or do other things that would make much of a difference.”

 

 

 

 

 

Oil Change Time and a Rubber Ducky

Sometimes, the inspiration for this blog comes from the strangest places. This time… it was an oil change.

I had been putting off changing my oil for far too long and finally took advantage of some free time last Friday to get it changed. I used to change it myself, but for some reason, in this new car (well new used car, but that’s a story for another day) I’ve always paid to get it changed. (And actually why I stopped changing it myself is also a blog post for another day.)

Anyway, I’ve twice now gone to the local Valvoline. This isn’t really an add for Valvoline specifically but more a comment on what I found interesting there.

So, most places where I’ve had my oil changed, you park, go in, give them your name and car keys and wait. Not here, they actually have you drive the car into the bay itself and you sit in the car the entire time. I think this is a bit more efficient, but since, instead of lifting the car, they have a pit under the car, I suppose they do risk someone driving their car into the pit (yes, it’s guarded by a low rail on either side, but you know there are drivers just that bad out there).

So, while sitting there I observed them doing two things I’m a huge fan of: using a checklist and calling out.

As I’ve talked about in my book and here in my own blogs, I love checklists. I recommend the book The Checklist Manifesto. They help reduce errors.  And while changing oil is fairly simple, mistakes do happen; the wrong oil gets put in, the drain plug isn’t properly tightened, too much gets put in, etc.

So hearing them call out and seeing them check off on the computer what they were doing, helps instill confidence. Now, I’m sure most, if not all oil change places do this, but if you’re sitting in the waiting room, you don’t get to see it.

But they also did something else which I found particularly interesting: they did a version of Pointing and Calling.  This is a very common practice in the Japanese railway system. One study showed it reduced accidents by almost 85%. So while changing my oil, the guy above would call out what he was doing. It was tough to hear everything he was calling out, but I know at one point the call was “4.5 Maxlife”  He then proceeded to put in what I presume was 4.5 quarts of the semi-synthetic oil into my engine (I know it was the right oil because I could see which nozzle he selected). I didn’t count the clicks, but I believe there was 9.  Now, other than the feedback of the 9 clicks, the guy in the pit couldn’t know for sure that it was the right oil and amount, but, I’m going to guess he had a computer terminal of his own and had his screen said “4 quarts standard” he’d have spoken up.  But even if he didn’t have a way of confirming the call, by speaking it out loud the guy above was engaging more of his brain in his task, which was more likely to reduce the chances of him making a mistake.

I left the oil change with a high confidence that they had done it right. And I was glad to know they actually were taking active steps to ensure that.

So, what about the rubber duck?

Well, a while back I started to pick up the habit of rubber duck debugging. Working at home, alone, it’s often hard to show another developer my code and ask, “Why isn’t this working?”  But, if I encounter a problem and I can’t seem to figure out why it’s not working. I now pull out a rubber duck and start working through the code line by line. It’s amazing how well this works.  I suspect that by taking the time to slow down to process the information and by engaging more of my brain (now the verbal and auditory portions), like pointing and calling, it helps bring more of my limited brain power to bear on the problem.  And if that doesn’t work, I still have my extended brain.

PS As a reminder, this coming Saturday I’m speaking at SQL Saturday Philadelphia. Don’t miss it!

Hours for the week

Like I say, I don’t generally post SQL specific stuff because, well there’s so many blogs out there that do. But what the heck.

Had a problem the other day. I needed to return the hours worked per timerange for a specific employee. And if they worked no hours, return 0.  So basically had to deal with gaps.

There’s lots of solutions out there, this is mine:

Alter procedure GetEmployeeHoursByDate @startdate date, @enddate date , @userID varchar(25)
as

— Usage exec GetEmployeeHoursByDate ‘2018-01-07’, ‘2018-01-13’, ‘gmoore’

— Author: Greg D. Moore
— Date: 2018-02-12
— Version: 1.0

— Get the totals for the days in question

 

 

set NOCOUNT on

— First let’s create simple table that just has the range of dates we want

; WITH daterange AS (
SELECT @startdate AS WorkDate
UNION ALL
SELECT DATEADD(dd, 1, WorkDate)
FROM daterange s
WHERE DATEADD(dd, 1, WorkDate) <= @enddate)

 

select dr.workdate as workdate, coalesce(a.dailyhours,0) as DailyHours from
(
— Here we get the hours worked and sum them up for that person.

select ph.WorkDate, sum(ph.Hours) as DailyHours from ProjectHours ph
where ph.UserID=@userid
and ph.workdate>= @startdate and ph.workdate <= @enddate
group by ph.workdate
) as a
right outer join daterange dr on dr.WorkDate=a.WorkDate — now join our table of dates to our hours and put in 0 for dates we don’t have hours for
order by workdate

GO

There’s probably better ways, but this worked for me. What’s your solution?