Copying a Large File

It was a pretty simple request actually. “Can you copy over the Panama database from FOO\WAS_21 to server BAR\LAX_45?”

“Sure, no problem.”

Of course it was a problem.  Here’s the issue. This is at one of my clients. They have a couple of datacenters and have hundreds of servers in each.  In addition, they have servers in different AD domains.  This helps them partition functionality and security requirements. Normally copying files between servers within a datacenter isn’t an issue. Even copying files between the different domains in the same datacenter isn’t normally too bad. To be clear, it’s not great.  Between servers in the same domain, it appears they have 1GB connections, between the domains, the firewall seems to throttle stuff down to 100MB.

The problem is when copying between different domains in different datacenters. This can be abysmally slow. That was my problem this week.  WAS_21 and LAX_45 were in different datacenters, and in different domains.

Now, for small files, I can use the cut and paste functionality built into RDP and simply cut and paste. This doesn’t work for large files. The file in this case was 19GB.  So this was out.

Fortunately, through the Citrix VDI they provide, I have a temp folder I can use. So, easily enough, I could copy the 19GB file from FOO\WAS_21 to that. That took just a few minutes.  Then I tried to copy it from there to BAR\LAX_45. This was slow, but looked like it would work.  It was going to take 4-5 hours, but they didn’t need the file for a week.

After about 4.5 hours, my RDP session locked up. I logged out and back in and saw the copy had failed. I tried again. This time at just under 4.5 hours I noticed an out of memory error. And then my session locked up.

So, apparently this wasn’t going to work. The obvious solution was to split the file (it was already compressed) into multiple files; except I’m not allowed to install most software on the servers. So that wasn’t a great option. I probably could have installed something like 7zip and then uninstalled it, but I didn’t want to deal with that and the paperwork that would result.

So I fell back to an old friend: Robocopy.  This appeared to be working great. Up until about 4.5 hours.  And guess what… another out of memory error.

But I LIKE challenges like this.

So I looked more closely. Robocopy has a lot of options. There are two that stuck out: /Z – restartable mode. That looked good. I figured worst case, I’d start my backup, let it fail at about 85% done and then resume it.

But then the holy grail: /J :: copy using unbuffered I/O (recommended for large files). 

Wow… unbuffered… that looks good. Might use less memory.

So I gambled and tried both.  And low and behold, 4:19 later… the file was copied!

So, it was an annoying problem but… I had solved it.  I like that!

So the take-away: Don’t give up. There’s always a way if you’re creative enough!

RCA or “get it running!”

How often have any of us resorted to fixing a server issue by simply rebooting the server?  Yes, we’re all friends here, you can raise your hands. Don’t be shy. We all know we’ve done it at some point.

I ask the question because of a recent tweet I saw with the hashtag #sqlhelp where Allan Hirt made a great comment:

Finding root cause is nice, but my goal first and foremost is to get back up and running quickly. Uptime > root cause more often than not.

This got me thinking, when is this true versus when is it not? And I think the answer ends up being the classic DBA answer, “it depends”.

I’m going to pick two well studied disasters that we’re probably all familiar with. But we need some criteria.  In my book IT Disaster Response: Lessons Learned in the Field I used the definition:

Disaster: An unplanned interruption in business that has an adverse impact on finances or other resources.

Let’s go with that.  It’s pretty broad, but it’s a starting point. Now let’s ignore minor disasters like I mention in the book, like the check printer running out of toner or paper on payroll day. Let’s stick with the big ones; the ones that bring production to a halt and cost us real money.  And we’re not going to restrict ourselves to IT or databases, but we’ll come back to that.

The first example I’m going to use is the Challenger Disaster. I would highly recommend folks read Diane Vaughen’s seminal work: The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA. That said, we all know that when this occurred, NASA did a complete stand-down of all shuttle flights until a full RCA was complete and many changes were made to the program.

On the other hand, in the famous Miracle on the Hudson, airlines did not stop flying after the water landing. But this doesn’t mean a RCA wasn’t done. It in fact was; just well after the incident.

So, back to making that decision.  Here, it was an easy decision. Shuttle flights were occurring every few months and other than delaying some satellite launches (which ironically may have led to issues with the Galileo probe’s antenna) there wasn’t much reason to fly immediately afterwards.  Also, while the largest points were known, i.e. something caused a burn-thru of the SRB, it took months to determine all the details. So, in this case, NASA could and did stand-down for as long as it took to rectify the issues.

In the event of the Miracle on the Hudson, the cause was known immediately.  That said, even then an RCA was done to determine the degree of the damage, if Sullenberg and Skiles had done the right thing, and what procedural changes needed to be made.  For example one item that came out of the post-landing analysis was that the engine restart checklist wasn’t really designed for low altitude failures such as they experienced.

Doing a full RCA of the bird strike on US Airways 1549 and stopping all over flights would have been an economic catastrophe.  But it was more than simply that. It was clear, based on the millions of flights per year, that this was a very isolated incident. The exact scenario was unlikely to happen again.  With Challenger, there had only been 24 previous flights, and ALL of them had experienced various issues, including blow-bys of the primary O-ring and other issues with the SRBs.

So back to our servers.  When can we just “get it running” versus taking downtime to do a  complete RCA vs other options?

I’d suggest one criteria is, “how often has this happened compared to our uptime?”

If we’ve just brought a database online and within the first week it has crashed, I’m probably going to want to do more of an immediate RCA.  If it’s been running for years and this is first time this issue has come up, I’m probably going to just get it running again and not be as adamant about an immediate RCA. I will most likely try to do an RCA afterwards, but again, I my not push for it as hard.

If the problem starts to repeat itself, I’m more likely to push for some sort of immediate RCA the next time the problem occurs.

What about the seriousness of the problem? If I have a server that’s consistently running at 20% CPU and every once in awhile it leaps up to 100% CPU for a few seconds and then goes back to 20% will I respond the same way as if it crashes and it takes me 10 minutes to get it back up? Maybe.  Is it a web-server for cat videos that I make a few hundred off of every month? Probably not. Is it a stock-trading server where those few seconds costing me thousands of dollars?  Yes, then I almost certainly will be attempting an RCA of some short.

Another factor would be, what’s involved in an RCA? Is it just a matter of copying some logs to someplace for later analysis and that will simply take a few seconds or minutes, or am I going to have to run a bunch of queries, collect data and do other items that may keep the server off-line for 30 minutes or more?

Ultimately, in most cases, it’s going to come down to balancing money and in the most extreme cases, lives.  Determining the RCA now, may save money later, but cost money now. On the other hand, not doing an RCA now might save money now, but might cost money later.  Some of it is a judgement call, some of it depends on factors you use to make your decision.

And yes, before anyone objects, I’m only very briefly touching upon the fact that often an RCA can still be done after getting things working again. I’m just touching upon the cases where it has to be done immediately or evidence may be lost.

So, are your criteria for when you do an RCA immediately vs. getting things running as soon as you can? I’d love to hear them.

And credit for the Photo by j zamora on Unsplash

Oil Change Time and a Rubber Ducky

Sometimes, the inspiration for this blog comes from the strangest places. This time… it was an oil change.

I had been putting off changing my oil for far too long and finally took advantage of some free time last Friday to get it changed. I used to change it myself, but for some reason, in this new car (well new used car, but that’s a story for another day) I’ve always paid to get it changed. (And actually why I stopped changing it myself is also a blog post for another day.)

Anyway, I’ve twice now gone to the local Valvoline. This isn’t really an add for Valvoline specifically but more a comment on what I found interesting there.

So, most places where I’ve had my oil changed, you park, go in, give them your name and car keys and wait. Not here, they actually have you drive the car into the bay itself and you sit in the car the entire time. I think this is a bit more efficient, but since, instead of lifting the car, they have a pit under the car, I suppose they do risk someone driving their car into the pit (yes, it’s guarded by a low rail on either side, but you know there are drivers just that bad out there).

So, while sitting there I observed them doing two things I’m a huge fan of: using a checklist and calling out.

As I’ve talked about in my book and here in my own blogs, I love checklists. I recommend the book The Checklist Manifesto. They help reduce errors.  And while changing oil is fairly simple, mistakes do happen; the wrong oil gets put in, the drain plug isn’t properly tightened, too much gets put in, etc.

So hearing them call out and seeing them check off on the computer what they were doing, helps instill confidence. Now, I’m sure most, if not all oil change places do this, but if you’re sitting in the waiting room, you don’t get to see it.

But they also did something else which I found particularly interesting: they did a version of Pointing and Calling.  This is a very common practice in the Japanese railway system. One study showed it reduced accidents by almost 85%. So while changing my oil, the guy above would call out what he was doing. It was tough to hear everything he was calling out, but I know at one point the call was “4.5 Maxlife”  He then proceeded to put in what I presume was 4.5 quarts of the semi-synthetic oil into my engine (I know it was the right oil because I could see which nozzle he selected). I didn’t count the clicks, but I believe there was 9.  Now, other than the feedback of the 9 clicks, the guy in the pit couldn’t know for sure that it was the right oil and amount, but, I’m going to guess he had a computer terminal of his own and had his screen said “4 quarts standard” he’d have spoken up.  But even if he didn’t have a way of confirming the call, by speaking it out loud the guy above was engaging more of his brain in his task, which was more likely to reduce the chances of him making a mistake.

I left the oil change with a high confidence that they had done it right. And I was glad to know they actually were taking active steps to ensure that.

So, what about the rubber duck?

Well, a while back I started to pick up the habit of rubber duck debugging. Working at home, alone, it’s often hard to show another developer my code and ask, “Why isn’t this working?”  But, if I encounter a problem and I can’t seem to figure out why it’s not working. I now pull out a rubber duck and start working through the code line by line. It’s amazing how well this works.  I suspect that by taking the time to slow down to process the information and by engaging more of my brain (now the verbal and auditory portions), like pointing and calling, it helps bring more of my limited brain power to bear on the problem.  And if that doesn’t work, I still have my extended brain.

PS As a reminder, this coming Saturday I’m speaking at SQL Saturday Philadelphia. Don’t miss it!

Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?

 

 

Debugging – Powerhell (sic)

Generally I don’t plan to talk too much about specific programming problems. There’s too many blogs out there about how to solve particular problems; and most of them are generally good and even better, accurate.

I’m going to bore most of you, so if you want to jump to the summary, that’s fine. I won’t be insulted (heck I won’t even know!)

That said, last night I spent a lot of time debugging a particular PowerShell script I had written for a client.

I want to talk a bit about what was going on and how I approached the problem.

So, first the setup: It’s really rather simple

  1. Log into an SFTP site
  2. Download a zip file
  3. Expand the zip file
  4. Run an SSIS package to import the files in the zip package to a SQL Server

I had written the script several months ago and it’s been in production since then.  It’s worked great except for one detail. When it was put into production, we didn’t yet have the service account to run it under, so I set it up to run as a scheduled task using my account information. Not ideal, but workable for now.

Since then, we received the service account. So, I had made several accounts to move to the service account, but it wasn’t working.  Yesterday was the first time in awhile I had time to work on it so I started looking at it in more detail.

Permissions Perhaps?

So, the first problem I had was, I’d run the scheduled task and nothing would happen. Ok, that’s not entirely accurate. The script would start, but never run to completion. It just sort of sat there. And it certainly was NOT downloading the zip file.

My first thought was it might be a files permission or other security issue. So the first thing I did was write a really simply script.

get-date | out-file “D:\foo\timetest.text”

Then I setup a scheduled task and ran it manually. And sure enough, timetest.txt showed up exactly where it was supposed to.

So much for security in the OS.

out-file to the Rescue!

So, my next step is probably one of the oldest debug techniques ever: I put in statements in the code to write to a file as I hit various steps in the code.  Not quite as cool as a debugger, but hey, it’s simple and it works.

It was simple stuff like:

“Path $SftpPath Set” | out-file ftp_log.txt -append

BTW, if you’re not familiar with Powershell, one of the nice things is in the above statement, it’ll automatically insert the variable $SftpPath into the string it’s printing out.

This started to work like a charm. And then… jackpot!

“Setting Session using $sftpURL and Password: $password and credentials: $Credential” | out-file ftp_log.txt -append

When I ran it, I’d get a line in the log like:

Setting Session using secureftp.example.com and Password: System.Security.SecureString and Credentials: System.Management.Automation.PSCredentials

Not the most informative, but useful. And it’s nice to know that one can’t easily just print out the password!

But, when I ran it as the service I was getting something VERY different.

Setting Session using secureftp.example.com and Password:  and Credentials:

HUGE difference. What the heck was happening to the password and credentials?

That’s when the first lightbulb hit.

Secure credentials in PowerShell

Let’s take a slight side trip here.  We all know (right, of course you do) that we should never store passwords in cleartext in our scripts.  So I did some digging and realized that PowerShell actually has a nice way to handle this. You can pass a plaintext string to a cmdlet and write that out to a file.  Then when you need credentials, you read from the file and it gets parsed and handed to whatever needs it.

$password = get-content $LocalFilePath\cred.txt | ConvertTo-SecureString

Of course first you have to get the password INTO that cred.txt file.

That’s easy!

if (-not (test-path $LocalFilePath\cred.txt))
{
read-host -AsSecureString | ConvertFrom-SecureString | Out-File     $LocalFilePath\cred.txt
}

So, the first time the program is run, if the cred.txt file isn’t found, the user is prompted for it, they enter the password, it gets put into a secure string and written out. From then on, the cred.txt file is used and the password is nice and secure.

And I confirmed that the service account could actually see the cred.txt file.

So what was happening?

The Key!

It took me a few minutes and when I hit a problem like this I do what I often do, step away from the keyboard. I had to stop and think for a bit. Then it dawned on me. “How does Powershell know how to encrypt and decrypt the password in a secure fashion. I mean at SOME point it needs to have a clear-text version of to pass to the various cmdlets. It wouldn’t be very secure if just anyone could decrypt it.  That’s when it struck me! The encryption key (and decryption key) has to be based on the USER running the cmdlet!

Let me show some more code

$password = get-content $LocalFilePath\cred.txt | ConvertTo-SecureString
$Credential = New-Object System.Management.Automation.PSCredential (‘ftp_example’, $Password)

$sftpURL = ‘secureftp.example.com’

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential

So first I’d get the password out of my cred.txt file and then create a Credential Object with a username (ftp_example) and the aforementioned password.

Then I’d try to open a session on the SFTP server using that credential.  And that’s exactly where it was hanging when I ran this under the service account. Now I was on to something. Obviously the $password and $Credential wasn’t getting set as we saw from my debug statements.  It wasn’t able to decrypt cred.txt!

Great, now I just need to have the service create its own cred.txt.  But, I can’t use the same technique where the first time the service runs it prompts the user for the password.  This bothers me. I still don’t have a perfect solution.

For now I fell back on:

if (-not (test-path $LocalFilePath\cred_$env:UserName.txt))
{
# after running once, remove password here!
“password_here” | ConvertTo-SecureString -AsPlainText -Force | ConvertFrom-SecureString | Out-File $LocalFilePath\cred_$env:UserName.txt
}

Note I changed the name of the cred.txt file to include the username, so I could have one when I ran it and another when the service account ran it. This makes testing far easier AND solves conflicts when debugging (i.e. “which credentials are currently being used).

So for now the documentation will be “after running the first time, remove the password”. I think more likely I’ll write one-time script that runs to setup a few things and then deletes itself. We’ll see.

Anyway, this worked much better. Now my debug lines were getting the password and all. BUT… things were STILL hanging.  Talk about frustrating. I finally tracked it down to the line

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential

Caving Blind

As you know, I love caving. And for caving, a headlamp is a critical piece of equipment. Without it you have no idea where you’re going.  That’s how I felt here. Things were simply hanging and I was getting NO feedback.

I tried

$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential | out-file ftp_log.txt -append

But I still wasn’t getting anything. I tried a few other things, including setting parameters on how New-SFTPSession would handle errors and warnings. But still nothing. It was like caving blind. The cmdlet was being called, but I couldn’t see ANY errors.  Of course when I ran it as myself, it ran fine. But when I ran it as a service, it simply hung.

I was getting frustrated I couldn’t get any feedback. I figured once I had feedback, the rest would be simple. I was mostly right.

I needed my headlamp! I needed to see what was going on!

Finally it dawned on me, “wrap it in an exception”. So now the code became:

try
{
$session = New-SFTPSession -ComputerName $sftpURL -Credential $Credential
}
catch
{
$exception = $_.Exception.Message
“Error in New-SFTPSession – $exception – ” | out-file ftp_log.txt -append
}

Now THIS got me somewhere. I was getting a timeout message!  Ok, that confused me. I mean I knew I had the right username and password, and it never timed out when I ran it. Why was it timing out when the service ran it?

So again, I did what I do in cases like this. Got up, went to the fridge, got some semi-dark chocolate chips and munched on them and ran through my head, “why was the secure FTP session timing out in one case, but not the other?”  At times like this I feel like Jack Ryan in Hunt for Red October, “How do you get a crew to want to get off a nuclear sub…”

Eureka!

DOH, it’s SECURE FTP. The first time it runs it needs to do a key exchange!  I need to accept the key!  At first I panicked though. How was I supposed to get the service to accept the key?  Fortunately it turned out to be trivial. There’s actually a parameter -Acceptkey. I added that… and everything ran perfectly.

Except the SSIS package

That’s a separate issue and again probably related to security and that’s my problem to debug today. But, I had made progress!

Summary

Now, quite honestly, the above is probably more detail than most care to read. But let me highlight a couple of things I think that are important when debugging or trouble shooting in general.

First, simplify the problem and eliminate the obvious. This is what I did with my first script. I wanted to make sure it wasn’t something simple and obvious. Once I knew I could write to a file using the service account that ruled out a whole line of questions. So I could move on to the next step.

Often I use an example from Sesame Street of “one of these things is not like another.”  In this case, I had to keep figuring out what was different between when I ran it and when the service account ran things. I knew up front that anything requiring keyboard input would be a problem. But, I thought I had that covered. Of course starting to compare and contrast the results of decrypting the cred.txt file showed that there was a problem there. And the issue with the initial acceptance of the SFTP key was another problem.

So, gather information and compare what works to what doesn’t work. Change one thing at a time until you narrow down the problem.

The other issue is being able to get accurate and useful information. Using debugging statements if often old school, and I know some developers look down on them, but often the quick and dirty works. So I had no problem using them. BUT, there are still cases where they won’t work. If for example a cmdlet hangs or doesn’t give output to standard output, they won’t work. Catching the exception definitely solved this.

The biggest problems I really ran into here is I’m still a beginner with PowerShell. I’m loving it, but I often still have to lookup things. But the basic troubleshooting skills are ones I’ve developed and honed over the years.

And quite seriously, sometimes don’t be afraid to walk away and do something else for a bit. It almost always brings a fresh perspective. I’ve been known to take a shower to think about a problem. Working from home that is easier for me than it might be for someone in an office. But even then, go take a walk. Or do something.  Approach the problem with a fresh mind. All too often we start down the wrong path and just keep blindly going without turning around and reevaluating our position.

When we teach crack and crevice rescue in cave rescue we tell our students, “If you’re not making progress after 15 minutes, let someone else try. The fresh approach will generally help.”

Looking back on this, the problems seem pretty obvious. But it took a bit of work to get there. And honestly, I love the process of troubleshooting and/or debugging. It’s like being a detective, compiling one clue after another until you get the full picture.

So now on to comment some of the code and then figure out why the SSIS package isn’t executing now (I have ideas, we’ll see where I get to!)

Too Secure

There’s an old joke in IT that the Security Office’s job isn’t done until you can’t do yours.

There’s unfortunately at times some truth to that.  And it can be a bigger problem than you might initially think.

A recent example comes to mind. I have one client that has setup fairly strict security precautions. I’m generally in favor of most of them, even if at times they’re inconvenient. But recently, they made some changes that were, frustrating to say the least and potentially problematic.  Let me explain.

Basically, at times I have to transfer a file created on a secured VM I control to one of their servers (that in theory is a sandbox in their environment that I can play in). Now, I obviously can’t just cut and paste it. Or perhaps that’s not so obvious, but yeah, for various reasons, through their VDI, they have C&P disabled. I’m ok with that. It does lessen the chance of someone accidentally cutting and pasting the wrong file to the wrong machine.

So what I previously did was something that seemed strange, but worked. I’d email the file to myself and then open a browser session on the said machine and get the file there. Not ideal and I’ll admit there are security implications, but it does cause the file to get virus scanned at at least two places and I’m very unlikely to send myself a dangerous file.

Now, for my webclient on this machine, I tended to use Firefox. It was kept up to date and as far as I know, up until recently, on their approved list of programs.  Great. This worked for well over a year.

Then, one day last week, I go to the server in question and there’s no Firefox. I realized this was related to an email I had seen earlier in the week about their security team removing Firefox from a different server, “for security reasons”. Now arguably that server didn’t need Firefox, but still, my server was technically MY sandbox. So, I was stuck with IE. Yes, their security team thinks IE is more secure than Firefox.  Ok, no problem I’ll use IE.

I go ahead, enter my userid and supersecret password. Nothing happens. Try a few times since maybe I got the password wrong. Nope. Nothing.  So I tried something different to confirm my theory and get the dreaded, “Your browser does not support cookies” error. Aha, now I’m on to something.

I jump into the settings and try several different things to enable cookies completely. I figure I can return things to the way they want after I get my file. No joy. Despite enabling every applicable options, it wouldn’t override the domain settings and cookies remained disabled.  ARGH.

So, next I figured I’d re-download FF and use that. It’s my box after all (in theory).

I get the install downloaded, click on it and it starts to install. Great! What was supposed to be a 5 minute problem of getting the file I needed to the server is about done. It’s only taken me an hour or two, but I can smell success.

Well, turns out what I was smelling was more frustration. Half-way through the install it locks up. I kill the process and go back to the file I downloaded and try again. BUT, the file isn’t there. I realize after some digging that their security software is automatically deleting certain downloads, such as the Firefox install.

So I’m back to dead in the water.

I know, I’ll try to use Dropbox or OneDrive. But… both require cookies to get setup.  So much for that.

I’ve now spend close to 3 hours trying to get this file to their server.  I was at a loss as to how to solve this. So I did what I often do in situations like this. I jumped in the shower to think.

Now, I finally DID manage to find a way, but I’m actually not going to mention it here. The how isn’t important (though keeping the details private are probably at least a bit important.)

Anyway, here’s the thing. I agree with trying to make servers secure. We in IT have too many data breaches as it is. BUT, there is definitely a problem with making things TOO secure. Actually two problems. The first is the old joke about how a computer encased in cement at the bottom of the ocean is extremely secure. But also unusable.  So, their security measures almost got us to the state of making an extremely secure  but useless computer.

But the other problem is more subtle. If you make things too secure, your users are going to do what they can to bypass your security in order to get their job done. They’re not trying to be malicious, but they may end up making things MORE risky by enabling services that shouldn’t be installed or by installing software you didn’t authorize, thus leaving you in an unknown security state (for the record, I didn’t do either of the above.)

Also, I find it frustrating when steps like the above are taken, but some of the servers in their environment don’t have the latest service packs or security fixes. So, they’re fixing surface issues, but ignoring deeper problems. While I was “nice” in what I did; i.e. I technically didn’t violate any of their security measures in the end, I did work to bypass them. A true hacker most likely isn’t going to be nice. They’re going to go for the gold and go through one of at least a dozen unpatched security holes to gain control of the system in question. So as much as I can live with their security precautions of locking down certain software, I’d also like to see them actually patch the machines.

So, security is important, but let’s not make it so tight people go to extremes to by pass it.

 

Mistakes were made

I generally avoid Reddit, I figure I have enough things in my life sucking my time. But from time to time one link comes across my screen that I find interesting. This is one of them.

The user accidentally deleted a production database. Now, I think we can all agree that deleting a database in production is a “bad thing”. But, whose fault is this really?

Yes, one could argue the employee should have been more careful, but, let’s backup.

The respondents in the thread raise several good points.

  • Why were the values in the documentation ones that pointed to a LIVE, production database? Why not point to a dev copy or even better yet, one that doesn’t really exist. They expect the person to update/change the parameters anyway, so worst case if they put in the fake ones in is, nothing happens.
  • Why didn’t Production have backups? This is a completely separate question, but a very important one!
  • Why fire him? As many pointed out, he had just learned a VERY valuable lesson, and taught the company a very valuable lesson too!

I’ll admit, I’d something similar in my career at one of my employers. Except I wasn’t an intern, I was the Director of IT, and my goal in fact WAS to do something on the live database. The mistake I made was a minor one in execution (reversed the direction of an arrow on the GUI before hitting the Execute button) but disastrous in terms of impact. And of course there wasn’t a recent enough backup.

I wasn’t fired for that.  I did develop and enforce our change control documents after that and always ensured, as much as possible that we had adequate backups. (Later in a my career, a larger, bigger muckup did get me… “given the opportunities to apply my skills elsewhere”, but there were other factors involved and I arguably wasn’t fired JUST for the initial issue.)

As the Director of IT, I made a point of telling my employees that story. And I explained to them, that I expected them to make mistakes. If they didn’t they probably weren’t trying hard enough. But I told them the two things that I wouldn’t accept would be lying about a mistake (trying to cover it up, or blame others, etc) and repeatedly making the same mistake.

I wrote in an earlier post that mistakes were avoidable. But as I pointed out, it’s actually more complex than that. Some mistakes are avoidable. Or, at least they can be managed. For example, it is unfortunately likely that at some point, someone, somewhere, will munge production data.  Perhaps they won’t delete it all, or perhaps they’ll do a make a White Ford Taurus type mistake, but it will happen.  So you have safeguards in place. First, limit the number of people in a position to make such a mistake. Second, have adequate backups. There’s probably other steps you can do to reduce the chance of error and mitigate it when it does eventually happen. Work on those.

But don’t fire the person who just learned a valuable lesson. They’re the one that is least likely to make that mistake again.  Me, I’d probably fire the CTO for not having backups and for having production values in documentation like that AND for firing the new guy.