Crossing the Threshold…

So it’s the usual story. You need to upgrade a machine, the IT group says, “no problem, we can virtualize it, it’ll be better! Don’t worry, yes, it’ll be fewer CPUs, but they’ll be much faster!”

So, you move forward with the upgrade. Twenty-three meetings later, 3 late nights, one OS upgrade, and two new machines forming one new cluster, you’re good. Things go live.  And then Monday happens. Monday of course is the first full day of business and just so happens to be the busiest day of the week.

Users are complaining. You look at the CPU and it’s hitting 100% routinely. Things are NOT distinctly better.

You look at the CPUs and you notices something striking:

cpu not being used

CPU 8 is showing a problem

4 of the CPUs (several are missing on this graphic) are showing virtually no utilization while the other  8 are going like gang-busters.  Then it hits you, the way the IT group setup the virtual CPUs was not what you needed.  They setup 6 sockets with 2 cores each for a total of 12 cores. This shouldn’t be a problem except that SQL Server Standard Edition uses the lower of either 4 sockets or 24 cores. Because your VM has 6 sockets, SQL Server refuses to use two of them.

You confirm the problem by running the following query:

SELECT scheduler_id, cpu_id, status, is_online FROM sys.dm_os_schedulers

This shows only 8 of your 12 CPUs are marked visible_online.

This is fortunately an easy fix.  A quick outage and your VM is reconfigured to 2 sockets with 6 cores a piece. Your CPU graphs now look like:

better CPU

A better CPU distribution

This is closer to what you want to see, but of course since you’re doing your work at night, you’re not seeing a full load. But you’re happier.

Then Monday happens again.  Things are better, but you’re still not happy. The CPUs are running on average at about 80% utilization. This is definitely better than 100%. But your client’s product manager knows they’ll need more processing power in coming months and running at 80% doesn’t give you much growth potential. The product manager would rather not have to buy more licenses.

So, you go to work. And since I’m tired of writing in the 2nd person, I’ll start writing in 1st person moving forward.

There’s a lot of ways to approach a problem like this, but often when I see heavy CPU usage, I want to see what sort of wait stats I’m dealing with. It may not always give me the best answer, but I find them useful.

Here’s the results of one quick query.

Fortunately, this being a new box, it was running SQL Server 2016 with the latest version service pack and CU.  This mean that I had some more useful data.

CXPackets

CXPackets and CXConsumer telling the tale

Note one of the suggestions: Changing the default Cost Threshold for Parallelism based on observed query cost for your entire workload.

Given the load I had observed, I guessed the Cost Threshold was way too low. It was in fact set to 10.  With that during testing I saw a CPU graph that looked like this:

43 percent CPU

43.5% at Cost Threshold of 10

I decided to change the Cost Threshold to 100 and the graph quickly became:

25 percent CPU

25% at Cost Threshold of 100

Dropping from 43.5% to 25.6%. That’s a savings you can take to the bank!

Of course that could have been a fluke, so I ran several 5 minute snapshots where I would set the threshold to 10, collect some data and then to 100 for 5 minutes and collect data.

CXPacket_10      CXPacket_10_Waittime_MS
635533 5611743
684578 4093190
674500 4428671
CXConsumer_10              CXConsumer_10_Waittime_MS
563830 3551016
595943 2661527
588635 2853673
CXPacket_100   CXPacket_100_Waittime_MS
0 0
41 22
1159 8156
CXConsumer_100            CXConsumer_100_Waittime_MS
0 0
13 29443
847 4328

You can see that over 3 runs the difference between having a threshold of 10 versus 100 made a dramatic difference in the total time spent waiting in the 5 minute window.

The other setting that can play a role in how parallelization can impact performance is MAXDOP. In this case testing didn’t show any real performance differences with changing that value.

At the end of the day though, I call this a good day. A few hours of my consulting time saved the client $1,000s of going down the wrong and expensive road of adding more CPUs and SQL licenses. There’s still room for improvement, but going from a box where only 8 of the 12 CPUs were being used and were running at 100% to a box where the average CPU usage is close to 25% is a good start.

What’s your tuning success story?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s