Marshmallows Part II

I’ll have to admit, I can rarely tell in advance when one of my posts will hit all the buttons and generate views and when it’ll fall flat. But as I don’t always write for my audience, sometimes I write for my own reasons, I can live with that.

So, how to follow-up on a post that didn’t receive many views, write a follow-up post. You can call me a slow learner.

Actually, it’s about learning. Last time I wrote about my microwave and doing a quick experiment with marshmallows to prove it was really dead.  After 2 days without a microwave it was time to get a new one. Of course I couldn’t get what I wanted because the space it had to fit into was limited in size.  That could have been resolved, but would have meant redoing the cabinet space it had to fit into. And if I were going to redo the cabinet space there, I might as well redo the rest of the cabinets. And if I’m going to redo the cabinets, I really need to redo the counters. And very quickly a replacement $100 microwave I can get in an hour would become a 3-week $10,000 kitchen remodel. I opted for the $100 microwave over the one I really wanted.

And the results are shown at the top of post (and below in case the top doesn’t appear)

melted marshmallow picture

10 seconds of marshmallows in the microwave

It’s quite interesting to me. The best heating was beyond the area of the rotating plate.  But this also shows the value of the rotating plate since if there’s a few sports, if I put a something to heat and everything was stationary, it would take forever to heat since there’s little to no microwave energy there. (This can get complex because of the size of the wave and the height of the material, etc.)

Now, I’d have done more experiments, but it seems a certain someone in the house enjoys marshmallows more than I do and had eaten a bunch and this was all I had.

But, I have a working microwave and I’ve proven how important the rotating plate can be (not that I had much doubt).

And that’s science to me; doing experiments and learning.

Oh and about the SQL query I was updating. It’s going into production this week hopefully. I was about to eek out about a 10-20% improvement. Beyond that, not much I could do because it really ends up scanning an entire table, on purpose.  Only so much you can do there.

One last thing: there may not be a post next week because I’ll be teaching at the NCRC weeklong cave training class in Indiana and will have limited internet and time.

Marshmallows

Though I attended RPI, which is generally considered an engineering school, my degree is a BS in Computer Science. I say that because I consider myself more of a scientist than an engineer at times. And honestly, we all start out as scientists, but many of us lose that along the way.

Anyone who has had a small child has observed a scientist in action. No, they’re not in a lab full of test tubes and beakers and flasks giving off noxious smells. But they are in the biggest lab there is, the world. They also don’t necessarily realize it. Nor do parents. But every time they drop a Cheerio, they’re testing gravity.  Fortunately (or unfortunately depending on your point of view) so far every time they’ve managed to prove that gravity works. This is the most obvious example, but when you stop to think about it, much of the first few years of life is all about experimenting. Most of the time it goes well, but sometimes, as a burnt hand will attest, the experiment has a less than ideal outcome.

And it’s the fear of burned hands that leads to parents to utter that common  refrain, “Don’t touch that!” or the variation “Don’t do that!”.  Soon, over time, our experimentation starts to get reined in until we do very little of it. This can be inhibiting.

Years ago I used to teach an “Introduction to Windows” adult education class. It was I believe a 6 week class and I taught several over the course of a couple of years. It didn’t take me long to realize the biggest constraint on the students ability to succeed in the class was that they had internalized “Don’t do that, you might break something.” Once I realized that, half my teaching pedagogy simply became, “Touch that, you won’t break it, and if you do, it’s not a big deal, and if it is, we’ll fix it anyway.” Seriously, more than anything else, I had to encourage most of my students to experiment with the computer.

More recently I realized I had stopped doing as many experiments in my life as I should be doing. About 1.5 weeks ago I attended a Wilderness Medicine Conference a friend of mine had told me about. At the end of the very wet, cold, rainy day, a bunch of us went outside and tried to start a fire. Starting a fire, let alone in such conditions was something most of the students had never done. I had, but not in years. With some effort, and experimentation, including using the outside box of a single serving size package of Fruit Loops, we finally managed to get the fire going.

But this got me thinking. When I go hiking, I carry a tiny ziplock back in my jacket with some firestarting materials. They’re there in case of an emergency. But, the thing is, I had never actually tried them and realized if I didn’t know how well they worked in practice, I couldn’t rely on them in emergency. So, I went outside, and started a fire. And I learned that yes, my materials ARE adequate, but the dryer lint needed to be pulled apart more than I realized. I tried again later in the week, and added the use of a toilet paper roll to form sort of a chimney so the starting fire would draft better. This, and the better pulling of the lint worked even better and a single match was sufficient this time.  This gave me more confidence that in an emergency, in less than ideal conditions I could get an actual fire going.

But, I wasn’t done! Our microwave broke this weekend. But, before I wrote it off, I wanted to make sure it wasn’t a fluke or something else. So, in this case I decided to get a bag of marshmallows and lay them out inside the microwave to see if I was getting ANY energy out of the magnatron. Turns out, nope, nada, nothing. So, today or tomorrow I will be buying a new microwave. But, it was a fun, and later tasty experiment.

Without delving deep into the scientific method here, I’ll say at a simple level, science is about having a hypothesis and testing it. The testing it is important.

To bring this back to SQL. First, you have a hypothesis that your backups will work. Have you tested that hypothesis? If not, do so immediately. Even if they do, you might learn something now that will be important when you have to do it for real. Perhaps you learn the volume your backups are on only has write access. Or perhaps you learn you need to retrieve your encryption keys and the person who controls access to them is on vacation. Or perhaps your RPO is 4 hours and the restore takes 6 hours.  So, experiment.

query plan

Capture of a random query plan

Recently for one client I’ve spent some time experimenting with various changes to help improve the performance of some queries. Not everything I tried worked, but some things did. So, again experiment.

I’m curious what recent experiments you may have done, SQL or otherwise. What were their outcomes?

 

Punditry

We’re all experts on everything. Don’t think so? Go to any middle school or high school soccer game and you’ll be amazed at how many parents are suddenly experts on soccer. It’s also amazing at how many parents are parents of future NCAA Division I scholarship soccer players.

Seriously though, we’re all guilty of this from time to time. I’ve done it and if you’re honest, you’ll admit you’ve done it.

Yesterday the world suffered a loss, the near destruction of Notre Dame.  Early during the fire our President tweeted:

“Perhaps flying water tankers could be used to put it out. Must act quickly!”

As many have pointed out, this was actually a terrible idea. The idea of dropping 100s of kilograms of water onto an already collapsing roof is most likely to do more damage than not. But, while I think it’s easy to mock the President for his tweet, I won’t. In some ways it reminds me of the various suggestions that were made last summer during the Thai Cave Rescue. We all want to help and often will blurt out the first idea that comes to mind.  I think it’s human nature to want to help.

But, here’s the thing: there really are experts in the field (or to use a term I see in my industry that I dislike at times: SME (it just sounds bad) Subject Matter Expert.)

And sometimes, being a SME does allow you to have some knowledge into other domains and you can give some useful insight. But, one thing I’ve found is that no matter how much I know on any subject, there’s probably someone who knows more. I’ve written about plane crashes and believe I have a more than passing familiarity in the area. Perhaps a lot more than the average person. But, there’s still a lot I don’t know and if I were asked to comment by a news organization on a recent plane crash, I’d probably demur to people with far more experience than I have.

Having done construction (from concrete work in basements to putting the cap of a roof on), I again, have more than a passing familiarity with construction techniques and how fire can have an impact. That said, I’ll leave the real building and fire fighting techniques to the experts.

And I will add another note: even experts can disagree at times. Whether it’s attending a SQL Saturday or the PASS Conference itself, or sitting in a room with my fellow cave rescue instructors, it can be quite enlightening to see the different takes people will have on a particular question. Often no one is wrong, but they bring different knowledge to the table or different experiences.

And finally, you know what, sometimes the non-expert CAN see the problem, or a solution in a way that an expert can’t. But that said, at the end of the day, I’ll tend to trust the experts.

And that’s the truth because I’m an expert on punditry.

Shouldn’t that be plugged in?

That was the question a friend of mine in 6th grade asked. As a result I developed what I call the Charlie M. rule after my friend. It was sort of Show and Tell day in 6th grade and we were supposed to talk about our hobbies. I brought in a circle of HO scale track (18″ radius for those interested) and my locomotive (a model GP-38) and some cars and of course the transformer to power it all.

I set it all up in front of the class and dutifully tried to demonstrate it. Nothing moved. I checked to make sure the engine was properly on the tracks: check. I made sure the wires were connected to the transformer: check. I made sure the wires were connected to the track: check.  I was stumped: check. Finally Charlie raised his hand and asked, “Shouldn’t that be plugged in?”  Ayup, in all my nervousness and being hurried, I had forgot the most basic step, of plugging in the transformer.

I try to keep this in mind when troubleshooting: check the obvious. I ran into this again over the weekend when trying to get my BMW Z3 running again. (Side note: no, consulting does not pay that well. This is one of the few tangible items I have left from my dad’s estate). It had stopped running late last fall and at the time I spent a little time trying to make it run, without much success. Finally, with the family’s help I pushed and pulled it into a shelter for the winter and then left it for the winter.

I wasn’t planning on worrying about it until later this month, but then… well let’s just say when I put the large box with metal corners into the rear of the Subaru, I forgot to check the obvious and slammed the rear hatch down on the box. Well, the box, realizing it didn’t have enough room, decided to take advantage of the metal corner and proceeded to make more room by punching out the rear window of the Subaru. Oops.  Such a simple mistake, but a large one.

So, while waiting for the Subaru to get fixed, I decided it was time to get the BMW on the road.

Now due to the symptoms, I knew it wasn’t a dead battery or bad gas. So taking advantage of what I call my extended brain, I asked others for help.  We had narrowed the problem down to either the clutch interlock switch or the starter. Neither looked like it would be an easy self-service and I was getting frustrated. I finally decided that perhaps checking the ODB-II codes might yield more information. Strangely though, the reader didn’t power up; there were no codes to read. That struck me as a strange. So here I did check the obvious: I took the reader to the Subaru and made sure the reader worked. And it worked fine on the Subaru. I went back to my extended brain and mentioned that.

“Oh, have you checked the fuses?”

“Nah I thought about it, but everything seems to have power.”

“You sure, sounds like the onboard computer fuse might be blown.”

So, I trudged out and took off the fuse cover.  Now, I don’t really believe in fate or signs from God, but it was weird, in the list of about 40 fuses, the first one my eyes fell on was Computer. “Nah, can’t be.”

I pulled it, and sure enough, it was burned out. I pulled it and replaced it. Got in the car and thought, “it can’t be that easy, can it?” A turn of the key and the next thing I knew, the 6 cylinders were purring.

All that work and frustration because I had overlooked the basics.

This is far from the first time I’ve overlooked the basics. And I bet you’ve done the same thing. I have a theory about why we do this, and it is in part because the basics ARE so fundamental that we assume it has to be something else. In my model train example, dirty track and loose wires, especially in an ad-hoc setup are arguably a more common issue than forgetting to plug in the transformer. In my BMW case, because literally everything else worked, I assumed the power was getting to the computer. And honestly, even now, thinking about it, I’m surprised the dash light startup didn’t change at all because of a lack of computer.

I’ve seen this in databases and elsewhere. I was recently trying to do a quick restore of a database from one machine to another and the obvious wasn’t working. It took me a bit to remember the client’s new security setup prevented this specific case for these two machines. Once I remembered that, the problem and subsequent solution were obvious.

This in part goes back to why I like using a rubber-duck at times. It can force you to review your assumptions and check the basics.

Having a problem? Employ the Charlie M. rule and check the basics.

 

DTSX Error

Not really a blog post of the typical form, this is more so add content to the Internet and hopefully have Google find it for someone else.

So, I inherited a DTSX package from a former project. Who hasn’t been in that position before, right?

No problem I could make most of it do what I wanted except for ONE Data Flow Task. Or more accurately, ADO NET Source.  This was connecting to a 3rd party database on the client server.  Not a problem, except I can’t hit that 3rd party database from my desktop and, unfortunately, I can’t install Visual Studio on the client’s server. So, for most of my changes, I had to disable that data flow task to make my other edits.  Annoying, but not a show-stopper in this particular case.

Until… I had to actually edit that Source.  I could not add new OUTPUT Columns under the Source Output.  I think this is because I couldn’t connect to the actual data source to validate stuff. I could be wrong. But anyway, I had to resort to editing the XML directly. This is always a bit dangerous, but Danger is my middle name. (Ok, maybe not, but my middle initial IS D.)

And then I committed my changes, loaded it to the client computer and ran it.

Well, sort of.  The data flowed like it should and then I got:

System.NullReferenceException: Object reference not set to an instance of an object.   at Microsoft.SqlServer.Dts.Pipeline.DataReaderSourceAdapter.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)  at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput(IDTSManagedComponentWrapper100 wrapper, Int32 outputs, Int32[] outputIDs, IDTSBuffer100[] buffers, IntPtr ppBufferWirePacket)

The other error was:

SSIS Error Code DTS_E_PRIMEOUTPUTFAILED.  The PrimeOutput method on ADO Source returned error code 0x80004003.  The component returned a failure code when the pipeline engine called PrimeOutput(). The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.  There may be error messages posted before this with more information about the failure

I added some more error handling and tried everything, but, I couldn’t stop the error, even though all the data actually was flowing.  This was weird. The data WAS flowing exactly the way I wanted. But, the package would fail due to the above error.

I finally created a test package with JUST the Data Flow Task and tried debugging that. I still had no luck. But at least the XML was far easier to parse.

After looking at it for the 42nd time, I finally noticed… I had added the column to the Output Columns under the ADO NET Source Output, but I had NOT put it under the ADO NET Source Error Output.

So, even though there were no errors, apparently DTSX would still fail because of the missing column. Once I added that, everything was solved.

 

Moving the Needle – Hard

One of the things I enjoy is problem solving or “debugging”.  I don’t necessarily mean debugging code, though I’ve done plenty of that.  One particular class of problems I like solving is when something isn’t working “right”.  I’m currently involved on one such issue.

Just before the holidays, the lead developer at one my of my clients put me in touch with a team in another division to help them solve some performance issues they were having with their SQL Server. This is the sort of issue I generally like to sink my teeth into.

I started poking around and asking questions. I was a bit crushed when in the initial review they listed all the things they had tried and I had to nod my head sagely (which, being a remote worker went unnoticed by them) because they had tried all the basic things. They had, fortunately for them, ruled out a lot of the easy fixes.

So now it came down to some digging. I won’t go into too many details, but will cover some of the things uncovered and tried. For one thing, they have 44 SQL jobs that run every 20 seconds and basically do a poll of a database to see if there’s any work to be done. So, every 20 seconds 44 SQL jobs would fire up, do a quick select and then go back to sleep.  On their new server, they were on average taking 6 seconds a piece.  In addition, the CPU would spike to 100% for about 5-6 seconds and then drop back down. We are also seeing a lot of wait states of the MSQL_XP variety (accounting for about 1/2 the time the system is waiting and averaging about 61.1 ms each time. [Thanks to Brent Ozar’s script here!])

We tried three things, two helped, one didn’t.

First, I asked them to spread the jobs out. So now, basically 2-3 jobs are started every second. This means over a 20 second period all 44 jobs are run, but not all at once.  This had an immediate impact, the jobs now were taking about 2-3 seconds. A small victory.

Secondly, we changed the MAXDOP settings from 0 to 4.  This appeared to have no impact on the jobs. In retrospect makes a lot of sense. Each job is a separate task and basically single-threaded, so SQL Agent won’t care about the MAXDOP.

For those who aren’t familiar with SQL Server, MAXDOP is short for “Maximum Degree of Parallelism” This controls how much SQL Server will try to spread out a task among its CPUs. So for example you had 100 tests to grade and sort into alphabetical order and you had 1 person to grade them. That one person would have to do all the work. You might decide that having 100 people is 100 times faster since every person can grade a test at the same time. But then you have to hand out the 100 tests and then collect the tests and resort them back into alphabetical order, and this takes longer than you think.  So by playing around, you realize it’s actually faster to only have 10 people grade them and sort them.  In other words, sometimes, the effort of spreading out the work itself takes longer than the time saved by spreading it out.)

But, one thing that didn’t change was the CPU spike. But, since the poll jobs were twice as fast, we were happy with that improvement.

However, the real goal of the poll jobs was to wake up ETL jobs to handle large amounts of data. These were running about 1/2 as fast as they’d like or expected.

Here, MAXDOP does seem to have changed things.  In most cases, the ETL jobs are running close to twice as fast.

But, here’s the funny thing. I didn’t really care. Yes, that was our goal, but I’d have been content if they had run twice as slow. Why? Because at the point we changed the MAXDOP settings, my goal wasn’t to improve performance, it was simply to move the needle, hard.  What I meant by that was, by changing the MAXDOP from 0 (use all 32 CPUs) to 4 I was fairly confident, for a variety of reasons, I’d impact performance.  And I did in fact expect performance to improve.  But, there were really 3 possible outcomes:

  1. It improved. Great, we know we’re on the right track, let’s tweak it some more.
  2. It got worse. Great, this is probably NOT the solution, but let’s try it the other way and instead of 4 CPUs, try say 16 or even a larger value. At least we know that the MAXDOP is having an impact.
  3. Nothing change. In this case, we can pretty much rule out parallelization being a factor at all.

In other words by forcing SQL Server to use only 4 CPUs instead of all 32, I expected a change. If I didn’t see a change, one way or the other, I could mostly rule out parallelization.

Finally, once we saw that a MAXDOP of 4, we started to play with the threshold of parallelization. In this case we ended up with option 3 above. We tried a fairly small value (5) and a fairly large value (100) and haven’t seen much of a difference. So the cost threshold doesn’t seem to have much of an impact.

So, we’re not fully there yet, there’s a number of other factors we need to consider.  But sometimes when you’re approaching the problem, don’t be afraid to move the needle, in any direction, hard, can tell you if you should continue to try that approach. In this case with MAXDOP it indicated we were on the right track, but with the cost threshold, we’re probably not.

We’ve got a lot more to do, including seeing if we can eliminate or speed up the MSQL_XP wait states, but we’re on our way. (For the record, I don’t expect much change on this one, it’s really SQL Server saying, “hey, I called out to an external procedure and am waiting to hear back” so we can’t tweak the query or do other things that would make much of a difference.”

 

 

 

 

 

Copying a Large File

It was a pretty simple request actually. “Can you copy over the Panama database from FOO\WAS_21 to server BAR\LAX_45?”

“Sure, no problem.”

Of course it was a problem.  Here’s the issue. This is at one of my clients. They have a couple of datacenters and have hundreds of servers in each.  In addition, they have servers in different AD domains.  This helps them partition functionality and security requirements. Normally copying files between servers within a datacenter isn’t an issue. Even copying files between the different domains in the same datacenter isn’t normally too bad. To be clear, it’s not great.  Between servers in the same domain, it appears they have 1GB connections, between the domains, the firewall seems to throttle stuff down to 100MB.

The problem is when copying between different domains in different datacenters. This can be abysmally slow. That was my problem this week.  WAS_21 and LAX_45 were in different datacenters, and in different domains.

Now, for small files, I can use the cut and paste functionality built into RDP and simply cut and paste. This doesn’t work for large files. The file in this case was 19GB.  So this was out.

Fortunately, through the Citrix VDI they provide, I have a temp folder I can use. So, easily enough, I could copy the 19GB file from FOO\WAS_21 to that. That took just a few minutes.  Then I tried to copy it from there to BAR\LAX_45. This was slow, but looked like it would work.  It was going to take 4-5 hours, but they didn’t need the file for a week.

After about 4.5 hours, my RDP session locked up. I logged out and back in and saw the copy had failed. I tried again. This time at just under 4.5 hours I noticed an out of memory error. And then my session locked up.

So, apparently this wasn’t going to work. The obvious solution was to split the file (it was already compressed) into multiple files; except I’m not allowed to install most software on the servers. So that wasn’t a great option. I probably could have installed something like 7zip and then uninstalled it, but I didn’t want to deal with that and the paperwork that would result.

So I fell back to an old friend: Robocopy.  This appeared to be working great. Up until about 4.5 hours.  And guess what… another out of memory error.

But I LIKE challenges like this.

So I looked more closely. Robocopy has a lot of options. There are two that stuck out: /Z – restartable mode. That looked good. I figured worst case, I’d start my backup, let it fail at about 85% done and then resume it.

But then the holy grail: /J :: copy using unbuffered I/O (recommended for large files). 

Wow… unbuffered… that looks good. Might use less memory.

So I gambled and tried both.  And low and behold, 4:19 later… the file was copied!

So, it was an annoying problem but… I had solved it.  I like that!

So the take-away: Don’t give up. There’s always a way if you’re creative enough!