Debugging – Powerhell (sic)

Generally I don’t plan to talk too much about specific programming problems. There’s too many blogs out there about how to solve particular problems; and most of them are generally good and even better, accurate.

I’m going to bore most of you, so if you want to jump to the summary, that’s fine. I won’t be insulted (heck I won’t even know!)

That said, last night I spent a lot of time debugging a particular PowerShell script I had written for a client.

I want to talk a bit about what was going on and how I approached the problem.

So, first the setup: It’s really rather simple

  1. Log into an SFTP site
  2. Download a zip file
  3. Expand the zip file
  4. Run an SSIS package to import the files in the zip package to a SQL Server

I had written the script several months ago and it’s been in production since then.  It’s worked great except for one detail. When it was put into production, we didn’t yet have the service account to run it under, so I set it up to run as a scheduled task using my account information. Not ideal, but workable for now.

Since then, we received the service account. So, I had made several accounts to move to the service account, but it wasn’t working.  Yesterday was the first time in awhile I had time to work on it so I started looking at it in more detail.

Permissions Perhaps?

So, the first problem I had was, I’d run the scheduled task and nothing would happen. Ok, that’s not entirely accurate. The script would start, but never run to completion. It just sort of sat there. And it certainly was NOT downloading the zip file.

My first thought was it might be a files permission or other security issue. So the first thing I did was write a really simply script.

get-date | out-file “D:\foo\timetest.text”

Then I setup a scheduled task and ran it manually. And sure enough, timetest.txt showed up exactly where it was supposed to.

So much for security in the OS.

out-file to the Rescue!

So, my next step is probably one of the oldest debug techniques ever: I put in statements in the code to write to a file as I hit various steps in the code.  Not quite as cool as a debugger, but hey, it’s simple and it works.

It was simple stuff like:

“Path $SftpPath Set” | out-file ftp_log.txt -append

BTW, if you’re not familiar with Powershell, one of the nice things is in the above statement, it’ll automatically insert the variable $SftpPath into the string it’s printing out.

This started to work like a charm. And then… jackpot!

“Setting Session using $sftpURL and Password: $password and credentials: $Credential” | out-file ftp_log.txt -append

When I ran it, I’d get a line in the log like:

Setting Session using secureftp.example.com and Password: System.Security.SecureString and Credentials: System.Management.Automation.PSCredentials

Not the most informative, but useful. And it’s nice to know that one can’t easily just print out the password!

But, when I ran it as the service I was getting something VERY different.

Setting Session using secureftp.example.com and Password:  and Credentials:

HUGE difference. What the heck was happening to the password and credentials?

That’s when the first lightbulb hit.

Secure credentials in PowerShell

Let’s take a slight side trip here.  We all know (right, of course you do) that we should never store passwords in cleartext in our scripts.  So I did some digging and realized that PowerShell actually has a nice way to handle this. You can pass a plaintext string to a cmdlet and write that out to a file.  Then when you need credentials, you read from the file and it gets parsed and handed to whatever needs it.

$password = get-content $LocalFilePath\cred.txt | ConvertTo-SecureString

Of course first you have to get the password INTO that cred.txt file.

That’s easy!

if (-not (test-path $LocalFilePath\cred.txt))
{
read-host -AsSecureString | ConvertFrom-SecureString | Out-File     $LocalFilePath\cred.txt
}

So, the first time the program is run, if the cred.txt file isn’t found, the user is prompted for it, they enter the password, it gets put into a secure string and written out. From then on, the cred.txt file is used and the password is nice and secure.

And I confirmed that the service account could actually see the cred.txt file.

So what was happening?

The Key!

It took me a few minutes and when I hit a problem like this I do what I often do, step away from the keyboard. I had to stop and think for a bit. Then it dawned on me. “How does Powershell know how to encrypt and decrypt the password in a secure fashion. I mean at SOME point it needs to have a clear-text version of to pass to the various cmdlets. It wouldn’t be very secure if just anyone could decrypt it.  That’s when it struck me! The encryption key (and decryption key) has to be based on the USER running the cmdlet!

Let me show some more code

$password = get-content $LocalFilePath\cred.txt | ConvertTo-SecureString
$Credential = New-Object System.Management.Automation.PSCredential (‘ftp_example’, $Password)

$sftpURL = ‘secureftp.example.com’

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential

So first I’d get the password out of my cred.txt file and then create a Credential Object with a username (ftp_example) and the aforementioned password.

Then I’d try to open a session on the SFTP server using that credential.  And that’s exactly where it was hanging when I ran this under the service account. Now I was on to something. Obviously the $password and $Credential wasn’t getting set as we saw from my debug statements.  It wasn’t able to decrypt cred.txt!

Great, now I just need to have the service create its own cred.txt.  But, I can’t use the same technique where the first time the service runs it prompts the user for the password.  This bothers me. I still don’t have a perfect solution.

For now I fell back on:

if (-not (test-path $LocalFilePath\cred_$env:UserName.txt))
{
# after running once, remove password here!
“password_here” | ConvertTo-SecureString -AsPlainText -Force | ConvertFrom-SecureString | Out-File $LocalFilePath\cred_$env:UserName.txt
}

Note I changed the name of the cred.txt file to include the username, so I could have one when I ran it and another when the service account ran it. This makes testing far easier AND solves conflicts when debugging (i.e. “which credentials are currently being used).

So for now the documentation will be “after running the first time, remove the password”. I think more likely I’ll write one-time script that runs to setup a few things and then deletes itself. We’ll see.

Anyway, this worked much better. Now my debug lines were getting the password and all. BUT… things were STILL hanging.  Talk about frustrating. I finally tracked it down to the line

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential

Caving Blind

As you know, I love caving. And for caving, a headlamp is a critical piece of equipment. Without it you have no idea where you’re going.  That’s how I felt here. Things were simply hanging and I was getting NO feedback.

I tried

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential | out-file ftp_log.txt -append

But I still wasn’t getting anything. I tried a few other things, including setting parameters on how New-SFTPSession would handle errors and warnings. But still nothing. It was like caving blind. The cmdlet was being called, but I couldn’t see ANY errors.  Of course when I ran it as myself, it ran fine. But when I ran it as a service, it simply hung.

I was getting frustrated I couldn’t get any feedback. I figured once I had feedback, the rest would be simple. I was mostly right.

I needed my headlamp! I needed to see what was going on!

Finally it dawned on me, “wrap it in an exception”. So now the code became:

try
{
$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential
}
catch
{
$exception = $_.Exception.Message
“Error in New-SFTPSession – $exception – ” | out-file ftp_log.txt -append
}

Now THIS got me somewhere. I was getting a timeout message!  Ok, that confused me. I mean I knew I had the right username and password, and it never timed out when I ran it. Why was it timing out when the service ran it?

So again, I did what I do in cases like this. Got up, went to the fridge, got some semi-dark chocolate chips and munched on them and ran through my head, “why was the secure FTP session timing out in one case, but not the other?”  At times like this I feel like Jack Ryan in Hunt for Red October, “How do you get a crew to want to get off a nuclear sub…”

Eureka!

DOH, it’s SECURE FTP. The first time it runs it needs to do a key exchange!  I need to accept the key!  At first I panicked though. How was I supposed to get the service to accept the key?  Fortunately it turned out to be trivial. There’s actually a parameter -Acceptkey. I added that… and everything ran perfectly.

Except the SSIS package

That’s a separate issue and again probably related to security and that’s my problem to debug today. But, I had made progress!

Summary

Now, quite honestly, the above is probably more detail than most care to read. But let me highlight a couple of things I think that are important when debugging or trouble shooting in general.

First, simplify the problem and eliminate the obvious. This is what I did with my first script. I wanted to make sure it wasn’t something simple and obvious. Once I knew I could write to a file using the service account that ruled out a whole line of questions. So I could move on to the next step.

Often I use an example from Sesame Street of “one of these things is not like another.”  In this case, I had to keep figuring out what was different between when I ran it and when the service account ran things. I knew up front that anything requiring keyboard input would be a problem. But, I thought I had that covered. Of course starting to compare and contrast the results of decrypting the cred.txt file showed that there was a problem there. And the issue with the initial acceptance of the SFTP key was another problem.

So, gather information and compare what works to what doesn’t work. Change one thing at a time until you narrow down the problem.

The other issue is being able to get accurate and useful information. Using debugging statements if often old school, and I know some developers look down on them, but often the quick and dirty works. So I had no problem using them. BUT, there are still cases where they won’t work. If for example a cmdlet hangs or doesn’t give output to standard output, they won’t work. Catching the exception definitely solved this.

The biggest problems I really ran into here is I’m still a beginner with PowerShell. I’m loving it, but I often still have to lookup things. But the basic troubleshooting skills are ones I’ve developed and honed over the years.

And quite seriously, sometimes don’t be afraid to walk away and do something else for a bit. It almost always brings a fresh perspective. I’ve been known to take a shower to think about a problem. Working from home that is easier for me than it might be for someone in an office. But even then, go take a walk. Or do something.  Approach the problem with a fresh mind. All too often we start down the wrong path and just keep blindly going without turning around and reevaluating our position.

When we teach crack and crevice rescue in cave rescue we tell our students, “If you’re not making progress after 15 minutes, let someone else try. The fresh approach will generally help.”

Looking back on this, the problems seem pretty obvious. But it took a bit of work to get there. And honestly, I love the process of troubleshooting and/or debugging. It’s like being a detective, compiling one clue after another until you get the full picture.

So now on to comment some of the code and then figure out why the SSIS package isn’t executing now (I have ideas, we’ll see where I get to!)

Too Secure

There’s an old joke in IT that the Security Office’s job isn’t done until you can’t do yours.

There’s unfortunately at times some truth to that.  And it can be a bigger problem than you might initially think.

A recent example comes to mind. I have one client that has setup fairly strict security precautions. I’m generally in favor of most of them, even if at times they’re inconvenient. But recently, they made some changes that were, frustrating to say the least and potentially problematic.  Let me explain.

Basically, at times I have to transfer a file created on a secured VM I control to one of their servers (that in theory is a sandbox in their environment that I can play in). Now, I obviously can’t just cut and paste it. Or perhaps that’s not so obvious, but yeah, for various reasons, through their VDI, they have C&P disabled. I’m ok with that. It does lessen the chance of someone accidentally cutting and pasting the wrong file to the wrong machine.

So what I previously did was something that seemed strange, but worked. I’d email the file to myself and then open a browser session on the said machine and get the file there. Not ideal and I’ll admit there are security implications, but it does cause the file to get virus scanned at at least two places and I’m very unlikely to send myself a dangerous file.

Now, for my webclient on this machine, I tended to use Firefox. It was kept up to date and as far as I know, up until recently, on their approved list of programs.  Great. This worked for well over a year.

Then, one day last week, I go to the server in question and there’s no Firefox. I realized this was related to an email I had seen earlier in the week about their security team removing Firefox from a different server, “for security reasons”. Now arguably that server didn’t need Firefox, but still, my server was technically MY sandbox. So, I was stuck with IE. Yes, their security team thinks IE is more secure than Firefox.  Ok, no problem I’ll use IE.

I go ahead, enter my userid and supersecret password. Nothing happens. Try a few times since maybe I got the password wrong. Nope. Nothing.  So I tried something different to confirm my theory and get the dreaded, “Your browser does not support cookies” error. Aha, now I’m on to something.

I jump into the settings and try several different things to enable cookies completely. I figure I can return things to the way they want after I get my file. No joy. Despite enabling every applicable options, it wouldn’t override the domain settings and cookies remained disabled.  ARGH.

So, next I figured I’d re-download FF and use that. It’s my box after all (in theory).

I get the install downloaded, click on it and it starts to install. Great! What was supposed to be a 5 minute problem of getting the file I needed to the server is about done. It’s only taken me an hour or two, but I can smell success.

Well, turns out what I was smelling was more frustration. Half-way through the install it locks up. I kill the process and go back to the file I downloaded and try again. BUT, the file isn’t there. I realize after some digging that their security software is automatically deleting certain downloads, such as the Firefox install.

So I’m back to dead in the water.

I know, I’ll try to use Dropbox or OneDrive. But… both require cookies to get setup.  So much for that.

I’ve now spend close to 3 hours trying to get this file to their server.  I was at a loss as to how to solve this. So I did what I often do in situations like this. I jumped in the shower to think.

Now, I finally DID manage to find a way, but I’m actually not going to mention it here. The how isn’t important (though keeping the details private are probably at least a bit important.)

Anyway, here’s the thing. I agree with trying to make servers secure. We in IT have too many data breaches as it is. BUT, there is definitely a problem with making things TOO secure. Actually two problems. The first is the old joke about how a computer encased in cement at the bottom of the ocean is extremely secure. But also unusable.  So, their security measures almost got us to the state of making an extremely secure  but useless computer.

But the other problem is more subtle. If you make things too secure, your users are going to do what they can to bypass your security in order to get their job done. They’re not trying to be malicious, but they may end up making things MORE risky by enabling services that shouldn’t be installed or by installing software you didn’t authorize, thus leaving you in an unknown security state (for the record, I didn’t do either of the above.)

Also, I find it frustrating when steps like the above are taken, but some of the servers in their environment don’t have the latest service packs or security fixes. So, they’re fixing surface issues, but ignoring deeper problems. While I was “nice” in what I did; i.e. I technically didn’t violate any of their security measures in the end, I did work to bypass them. A true hacker most likely isn’t going to be nice. They’re going to go for the gold and go through one of at least a dozen unpatched security holes to gain control of the system in question. So as much as I can live with their security precautions of locking down certain software, I’d also like to see them actually patch the machines.

So, security is important, but let’s not make it so tight people go to extremes to by pass it.

 

Mistakes were made

I generally avoid Reddit, I figure I have enough things in my life sucking my time. But from time to time one link comes across my screen that I find interesting. This is one of them.

The user accidentally deleted a production database. Now, I think we can all agree that deleting a database in production is a “bad thing”. But, whose fault is this really?

Yes, one could argue the employee should have been more careful, but, let’s backup.

The respondents in the thread raise several good points.

  • Why were the values in the documentation ones that pointed to a LIVE, production database? Why not point to a dev copy or even better yet, one that doesn’t really exist. They expect the person to update/change the parameters anyway, so worst case if they put in the fake ones in is, nothing happens.
  • Why didn’t Production have backups? This is a completely separate question, but a very important one!
  • Why fire him? As many pointed out, he had just learned a VERY valuable lesson, and taught the company a very valuable lesson too!

I’ll admit, I’d something similar in my career at one of my employers. Except I wasn’t an intern, I was the Director of IT, and my goal in fact WAS to do something on the live database. The mistake I made was a minor one in execution (reversed the direction of an arrow on the GUI before hitting the Execute button) but disastrous in terms of impact. And of course there wasn’t a recent enough backup.

I wasn’t fired for that.  I did develop and enforce our change control documents after that and always ensured, as much as possible that we had adequate backups. (Later in a my career, a larger, bigger muckup did get me… “given the opportunities to apply my skills elsewhere”, but there were other factors involved and I arguably wasn’t fired JUST for the initial issue.)

As the Director of IT, I made a point of telling my employees that story. And I explained to them, that I expected them to make mistakes. If they didn’t they probably weren’t trying hard enough. But I told them the two things that I wouldn’t accept would be lying about a mistake (trying to cover it up, or blame others, etc) and repeatedly making the same mistake.

I wrote in an earlier post that mistakes were avoidable. But as I pointed out, it’s actually more complex than that. Some mistakes are avoidable. Or, at least they can be managed. For example, it is unfortunately likely that at some point, someone, somewhere, will munge production data.  Perhaps they won’t delete it all, or perhaps they’ll do a make a White Ford Taurus type mistake, but it will happen.  So you have safeguards in place. First, limit the number of people in a position to make such a mistake. Second, have adequate backups. There’s probably other steps you can do to reduce the chance of error and mitigate it when it does eventually happen. Work on those.

But don’t fire the person who just learned a valuable lesson. They’re the one that is least likely to make that mistake again.  Me, I’d probably fire the CTO for not having backups and for having production values in documentation like that AND for firing the new guy.

When Life hands you Lemons

You make lemonade! Right? Ok, but how?

Ok, this is the 21st Century, now we use mixes. That makes it even easier, right?

But, I’ve given this some thought, and like many procedures there’s not necessarily a right way to do it. That said, I may change the procedure I use.

Ok, so I use one of those little pouches that make a lemon-flavored drink. I’m hesitant to call it actual lemonade, but let’s go with it.

Typically my process is to take the container, fill a drinking glass and if the container is empty, or has only a little bit left in it, make more. (Obviously if there’s a lot left, I just put the container back in the refrigerator. 🙂

So still pretty simple, right? Or is it.

Basically you put the powder in the container and then add water.

Or do you put the water in and then add the powder?

You may ask, “What difference does it make?”

Ultimately, it shouldn’t, in either case you end up with a lemon-flavored drink as your final product.

All along I’ve been going the route of putting the powder in first then adding the water. There was a rational reason for this: the turbulence of the water entering the container would help mix it and it would require less shaking. I thought this was pretty clever.

But then one night as I was filling the container with water (it was sitting in the sink) I got distracted and by the time I returned my attention to it, I had overfilled the container and water was flowing over the top.  Or rather, somewhat diluted lemon-flavored was flowing over the top.  I had no idea how long this had been going on, but I knew I had an over-filled container that had to have a bit more liquid poured off before I could put it away. It also meant the lemon-flavored drink was going to be diluted by an unknown amount. That is less than optimal.

So the simple solution I figured was to change my procedure. Add the water first and then add the flavoring. That way if there was too much water in the container, I could just pour off the extra and then add the proper amount of powder and have an undiluted lemon-flavored drink.

That worked fine until one day as I was pouring the package, it slipped through my fingers into a half-filled container.  Now I had to find a way to fish it out. Ironically, the easiest way to do it was to overfill it so the package would float to the top. Of course now I was back to diluted lemon-flavored drink. And who knows what was on the outside of the powder package that was now inside the water.

Each procedure has its failure modes. Both, when successful, get me to the final solution.

So, which one is better?

I put in the powder first and then put in the water. I could say I have a rational reason like preferring slightly diluted lemon-flavored drink over a possibly contaminated lemon-flavored drink from a dropped in packet.

But the truth is, it really doesn’t matter which order I do the steps in. Neither failure is completely fatal and in fact about equivalent in their seriousness.

Old habits die hard, so I stick with my original method.

But, the point is that even in a process as simple as making lemon-flavored drink, there’s more than one way to do it, and either way may be workable. Just make sure you can justify your reasoning.

Deep Drilling

I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.

But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen).  Nope.

Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records.  No luck.

I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.

Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

Now we’re getting someplace.  After a minor diversion of my own doing I then ran

DBCC CheckIDENT('sysjobhistory',NORESEED)

This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.

Now, the simple solution, which worked for me in this case was to issue a

truncate table sysjobhistory

This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.

But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs.  But for a year that should generate less than 1 million rows.  This database server hasn’t been running for 2 thousand years.

So, I decided to monitor things a bit and wait for a few jobs to run.

Then, I executed the following query.

select max(instance_id) from sysjobhistory

This returned a value along the lines of 232031.  Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!

So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.

I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November.  Each directory had close to 22,000 log files that should have been deleted and weren’t.

Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.

I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.

Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).

So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again.  That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.

But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.

Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.

 

Experimenting

There are times when you have to take at face value what you are told.

There are 1.31 billion people living in China. This according to several sources (that all probably go back to the same official document from the Chinese government.)  I’m willing to believe that number. I’m certainly not going to go to China and start counting heads. For one, I don’t have the time, for another, I might look awfully weird doing so. It’s also accurate enough for any discussions I might have about China. But if I were going to knit caps for every person in China I might want a more accurate number.

That said, sometimes one shouldn’t take facts at face value. A case in point is given below. Let me start out with saying the person who gave me this fact, wasn’t wrong.  At least they’re no more wrong than the person who tells me that the acceleration due to gravity is 9.8m/s².  No, they are at worst inaccurate and more likely imprecise. Acceleration due to gravity here on Earth IS roughly 9.8m/s². But it varies depending where on the surface I am. And if I’m on the Moon it’s a completely different value.

Sometimes it is in fact possible to actually test and often worth it. I work with SQL Server and this very true here. If a DBA tells you with absolute certainty that a specific setting should be set, or a query must be written a specific way or an index rebuilt automatically at certain times, ask why. The worst answer they can give is, “I read it some place.”  (Please note, this is a bit different from saying, “Generally it’s best practice to do X”. Now we’re back to saying 9.8m/s², which is good enough for most things, but may not be good enough if say you want to precisely calibrate a piece of laboratory equipment.)

The best answer is “because I tested it and found that it works best”.

So, last night I had the pleasure of listening to Thomas Grohser speak on the SQL IO engine at local SQL Server User Group meeting. As always it was a great talk. At one point he was talking about backups and various ways to optimize them. He made a comment about setting the maxtransfersize to 4MB being ideal. Now, I’m sure he’d be the first to add the caveat, “it depends”. He also mentioned how much compression can help.

But I was curious and wanted to test it. Fortunately I had access to a database that was approximately 15GB in size. This seemed liked the perfect size with which to test things.

I started with:

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB.BAK’ with maxtransfersize=4194304

This took approximately 470 seconds and had a transfer rate of 31.151 MB/sec.

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB_COMP.BAK’ with maxtransfersize=4194304, compression

This took approximately 237 seconds and a transfer rate of 61.681 MB/sec.

This is almost twice as fast.  While we’re chewing up a few more CPU cycles, we’re writing a lot less data.  So this makes a lot of sense. And of course now I can fit more backups on my disk. So compression is a nice win.

But what about the maxtransfersize?

backup database TESTDB to disk=’Z:\backups\TESTDB.BAK’

This took approximately 515 seconds and a transfer rate of 28.410 MB/sec. So far, it looks like changing the maxtransfersize does help a bit (about 8%) over the default.

backup database TESTDB to disk=’Z:\backups\TESTDB_comp.BAK’ with compression

This took approximately 184 seconds with a transfer rate of 79.651 MB/sec.  This is the fastest of the 4 tests and by a noticeable amount.

Why? I honestly, don’t know. If I was really trying to optimize my backups, most likely I’d run each of these tests 5-10 more times and take an average. This may be an outlier. Or perhaps the 4MB test with compression ran slower than normal.  Or there may be something about the disk setup in this particular case that makes it the fastest method.

The point is, this is something that is easy to setup and test. The entire testing took me about 30 minutes and was done while I was watching tv last night.

So before you simply read something on some blog someplace about “you should do X to SQL Server” take the time to test it. Perhaps it’s a great solution in your case. Perhaps it’s not. Perhaps you can end up finding an even better solution.

 

 

 

 

On Call

I want to pass on a video I’ve finally gotten around to watching:

Dave O’Conner speaks

I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.

He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.

One thing he mentions is availability.  Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?”  Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.

I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.

An example might be traffic flow.  I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out.  Why? By the time you respond it’s gone and there’s nothing to do.

A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.

In some cases, simply retrying works.  In the SQL world I’ve seen re-index jobs fail due to locking or other issues.  I like my sleep.  So I set up most of my jobs to retry at least once on failure.

Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.

At one client, we had an issue where a job would randomly fail maybe once a month.  They would page someone about it, who would rerun the job and it would succeed.

I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so.  Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.

Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.