Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?

 

 

Hours for the week

Like I say, I don’t generally post SQL specific stuff because, well there’s so many blogs out there that do. But what the heck.

Had a problem the other day. I needed to return the hours worked per timerange for a specific employee. And if they worked no hours, return 0.  So basically had to deal with gaps.

There’s lots of solutions out there, this is mine:

Alter procedure GetEmployeeHoursByDate @startdate date, @enddate date , @userID varchar(25)
as

— Usage exec GetEmployeeHoursByDate ‘2018-01-07’, ‘2018-01-13’, ‘gmoore’

— Author: Greg D. Moore
— Date: 2018-02-12
— Version: 1.0

— Get the totals for the days in question

 

 

set NOCOUNT on

— First let’s create simple table that just has the range of dates we want

; WITH daterange AS (
SELECT @startdate AS WorkDate
UNION ALL
SELECT DATEADD(dd, 1, WorkDate)
FROM daterange s
WHERE DATEADD(dd, 1, WorkDate) <= @enddate)

 

select dr.workdate as workdate, coalesce(a.dailyhours,0) as DailyHours from
(
— Here we get the hours worked and sum them up for that person.

select ph.WorkDate, sum(ph.Hours) as DailyHours from ProjectHours ph
where ph.UserID=@userid
and ph.workdate>= @startdate and ph.workdate <= @enddate
group by ph.workdate
) as a
right outer join daterange dr on dr.WorkDate=a.WorkDate — now join our table of dates to our hours and put in 0 for dates we don’t have hours for
order by workdate

GO

There’s probably better ways, but this worked for me. What’s your solution?

This is secure, right?

WP_20180114_001So, over the weekend, while I was waiting for my wife’s hockey game to begin I was exploring the facility. I saw an upper level and I was curious if it was open. But alas no, it was locked. And I mean locked. Above is a photo of the chain and lock used to keep people out.

There’s a saying that locks only keep out honest people. That’s not entirely true, but there’s some truth to that. Security, especially in my field of IT and databases is extremely important. I am often working with data that contains private information about people. I have to take steps to keep it safe. So, as I mentioned in Too Secure, I don’t have a problem with security per se. There I talked about security that prevented me from doing my job.

Here we have the opposite. Making something that look secure, but really isn’t. I mean this has a heavy chain and a lock. I’m not sure who did this or why, but I’m sure they felt the lock was important. It’s not a cheap one either.

 

This is a case where probably a simple sign and string, “Balcony closed” would have sufficed.  The only purpose the lock really had was to make sure the chain wasn’t stolen. The purpose of the chain appeared to be to give something to attach the lock to.

Now, sure, someone could have removed a sign and string and said, “oh we never saw it” but again, what’s the ultimate risk if someone got upstairs? It was simply an observation balcony. They weren’t really securing much.

But I’m sure someone felt good about their security.

So, when you’re making something secure, stop and make sure you’re spending your time wisely. Sometimes not everything needs the most secure system available. Sometimes a “keep out sign” might be enough and easier.

Starting off the New Year… wrong

Ok, I had plans for a great blog post on time and how we measure it and how it really is arbitrary. But… that will have to wait. (I suppose that brings its own irony about time and waiting and all that. Hmm, now I may need a post about that!)

But, circumstances suggested a different post.

This one started with an email from my boss last year (who will remain nameless for his sake. 🙂  “Greg, I hacked together this web page so my team-members can track their hours. I need you to make a few tweaks to it. And don’t worry, we’ll replace it by the end of the year.”  Yes, that sound you hear is your head hitting the desk like mine, because you KNOW what is coming next. Here we are in January of 2018 and guess what, the web page is still in place.

And of course, it’s not just HIS (now former, he’s moved on) team that is using it. Apparently his boss liked what it could do, so his boss (my boss^2) declared that ALL his teams would use it.  So instead of just 4-5 people using it, there’s close to 30-40 people using it.

So, remember the first rule of development, “Oh don’t worry about this, we’ll replace it soon” never happens.  His original code made a lot of assumptions (which at the time I suppose seemed reasonable). For example, since many schedules are formed around work-weeks, he was originally storing hours by the week (and then day of the week) they were worked in. For example, if someone worked for 6 hours on January 5th of last year, the table stored that as “WEEK 1, day 5”.

Now, before anyone completely thinks this is nuts, do keep in mind, many places, including my last job at GE did everything by Week of the Year.  There’s some advantages to this (that I won’t go into here).

For the most part, the code worked and I didn’t care. But, at one point I was asked to do a report based not on Work Weeks, but on an actual calendar month. So, I had to figure out for example that February 1st occurred during Work Week 5, on the 4th day. And go from there.

So, I hacked that together.  Then where was a request for another report and another. Pretty soon it was getting to be a pain to keep track of all this. And I realized another problem. My boss had gotten lucky in 2017. January 1st occurred on a Sunday, which meant that Work Week 1, Day 1 was a natural starting point.

I started to worry about what would happen when 2018 rolled around. His code would probably break.  So I finally took the plunge and started to refactor the code.  I also added new features as they were requested.  Things were going great.

Until yesterday.  So now we get to another good rule of development: “Use Version Control” So simple. Even for a one person shop like me it’s a good idea. I had put it off for awhile, but finally downloaded the git plug-in for Visual Studio and started to use it.

So yesterday, a user reported that they couldn’t enter hours for last week. I pulled up the app and realized that yes, in fact it was coded in such a way you couldn’t go back into a previous year to enter hours. No problem, I can fix that!

Well, let me add a rule I had sort of missed; “Understand HOW to use Version Control”. I won’t go into details, but let’s just say, I wasn’t committing like I should have been or thought I was. So, in an attempt to get a clean base and all, I merged things… “poorly“. I had the branches I wanted, but had not properly committed stuff to them.

The work of the last few weeks was GONE. I know, you’re saying, “just go to your backups Greg, because of course a person who writes about DR has proper backups, right?”  Yeah, right! Let’s not go there!

Anyway, I spent 4 hours last night and 2 this morning recreating the code. Fortunately, it dawned on me, being .NET code, it was really just CLR and that perhaps with a good decompiler, I could get most of the code back without too much effort. I want to give props to RedGate’s .NET Reflector. Of the two decompilers I tried, it was clearly the better one. While I lost my variable names and the decompiled code isn’t quite what I’d call human written, it was close enough I could in short order recreate hours and hours of work I had done.

In the meantime, I also talked to some of my programming buddies on an RPI chat server (ask me about it sometime) about git and better procedures.

And here’s where I realized my fundamental mistake. It wasn’t just misunderstanding how Visual Studio was interacting with git and the branches I was creating, it was that somewhere in the back of my head I had decided that commits should be used only when major steps were completed. It was almost like I was afraid I’d run out; or perhaps complicate the history with too many of them.  But, you know what? Commits aren’t scarce resources. And I’m the only one reading the history. I don’t really care if I’ve got 10 commits or 100. The more the better in many ways. Had I been making commits much more frequently, I’d have saved myself a lot of work. (And I can discuss having a remote store at some point, etc.)

So really, the point of this hastily, sleep-deprived written blog isn’t so much to talk about dates and apps that never get replaced, but about a much deeper problem I was having. It wasn’t just failing to fully understand git commands, it was failing to understand how I should be using git in the first place.

So, to riff on an old phrase, “Commit Early, Commit often”.

Oh and I did hack together a way for that user to enter hours for last year. It’s good enough. I’m sure I won’t need it a year from now. The code will be all replaced or refactored by then I’m sure!

 

Too Secure

There’s an old joke in IT that the Security Office’s job isn’t done until you can’t do yours.

There’s unfortunately at times some truth to that.  And it can be a bigger problem than you might initially think.

A recent example comes to mind. I have one client that has setup fairly strict security precautions. I’m generally in favor of most of them, even if at times they’re inconvenient. But recently, they made some changes that were, frustrating to say the least and potentially problematic.  Let me explain.

Basically, at times I have to transfer a file created on a secured VM I control to one of their servers (that in theory is a sandbox in their environment that I can play in). Now, I obviously can’t just cut and paste it. Or perhaps that’s not so obvious, but yeah, for various reasons, through their VDI, they have C&P disabled. I’m ok with that. It does lessen the chance of someone accidentally cutting and pasting the wrong file to the wrong machine.

So what I previously did was something that seemed strange, but worked. I’d email the file to myself and then open a browser session on the said machine and get the file there. Not ideal and I’ll admit there are security implications, but it does cause the file to get virus scanned at at least two places and I’m very unlikely to send myself a dangerous file.

Now, for my webclient on this machine, I tended to use Firefox. It was kept up to date and as far as I know, up until recently, on their approved list of programs.  Great. This worked for well over a year.

Then, one day last week, I go to the server in question and there’s no Firefox. I realized this was related to an email I had seen earlier in the week about their security team removing Firefox from a different server, “for security reasons”. Now arguably that server didn’t need Firefox, but still, my server was technically MY sandbox. So, I was stuck with IE. Yes, their security team thinks IE is more secure than Firefox.  Ok, no problem I’ll use IE.

I go ahead, enter my userid and supersecret password. Nothing happens. Try a few times since maybe I got the password wrong. Nope. Nothing.  So I tried something different to confirm my theory and get the dreaded, “Your browser does not support cookies” error. Aha, now I’m on to something.

I jump into the settings and try several different things to enable cookies completely. I figure I can return things to the way they want after I get my file. No joy. Despite enabling every applicable options, it wouldn’t override the domain settings and cookies remained disabled.  ARGH.

So, next I figured I’d re-download FF and use that. It’s my box after all (in theory).

I get the install downloaded, click on it and it starts to install. Great! What was supposed to be a 5 minute problem of getting the file I needed to the server is about done. It’s only taken me an hour or two, but I can smell success.

Well, turns out what I was smelling was more frustration. Half-way through the install it locks up. I kill the process and go back to the file I downloaded and try again. BUT, the file isn’t there. I realize after some digging that their security software is automatically deleting certain downloads, such as the Firefox install.

So I’m back to dead in the water.

I know, I’ll try to use Dropbox or OneDrive. But… both require cookies to get setup.  So much for that.

I’ve now spend close to 3 hours trying to get this file to their server.  I was at a loss as to how to solve this. So I did what I often do in situations like this. I jumped in the shower to think.

Now, I finally DID manage to find a way, but I’m actually not going to mention it here. The how isn’t important (though keeping the details private are probably at least a bit important.)

Anyway, here’s the thing. I agree with trying to make servers secure. We in IT have too many data breaches as it is. BUT, there is definitely a problem with making things TOO secure. Actually two problems. The first is the old joke about how a computer encased in cement at the bottom of the ocean is extremely secure. But also unusable.  So, their security measures almost got us to the state of making an extremely secure  but useless computer.

But the other problem is more subtle. If you make things too secure, your users are going to do what they can to bypass your security in order to get their job done. They’re not trying to be malicious, but they may end up making things MORE risky by enabling services that shouldn’t be installed or by installing software you didn’t authorize, thus leaving you in an unknown security state (for the record, I didn’t do either of the above.)

Also, I find it frustrating when steps like the above are taken, but some of the servers in their environment don’t have the latest service packs or security fixes. So, they’re fixing surface issues, but ignoring deeper problems. While I was “nice” in what I did; i.e. I technically didn’t violate any of their security measures in the end, I did work to bypass them. A true hacker most likely isn’t going to be nice. They’re going to go for the gold and go through one of at least a dozen unpatched security holes to gain control of the system in question. So as much as I can live with their security precautions of locking down certain software, I’d also like to see them actually patch the machines.

So, security is important, but let’s not make it so tight people go to extremes to by pass it.

 

When Life hands you Lemons

You make lemonade! Right? Ok, but how?

Ok, this is the 21st Century, now we use mixes. That makes it even easier, right?

But, I’ve given this some thought, and like many procedures there’s not necessarily a right way to do it. That said, I may change the procedure I use.

Ok, so I use one of those little pouches that make a lemon-flavored drink. I’m hesitant to call it actual lemonade, but let’s go with it.

Typically my process is to take the container, fill a drinking glass and if the container is empty, or has only a little bit left in it, make more. (Obviously if there’s a lot left, I just put the container back in the refrigerator. 🙂

So still pretty simple, right? Or is it.

Basically you put the powder in the container and then add water.

Or do you put the water in and then add the powder?

You may ask, “What difference does it make?”

Ultimately, it shouldn’t, in either case you end up with a lemon-flavored drink as your final product.

All along I’ve been going the route of putting the powder in first then adding the water. There was a rational reason for this: the turbulence of the water entering the container would help mix it and it would require less shaking. I thought this was pretty clever.

But then one night as I was filling the container with water (it was sitting in the sink) I got distracted and by the time I returned my attention to it, I had overfilled the container and water was flowing over the top.  Or rather, somewhat diluted lemon-flavored was flowing over the top.  I had no idea how long this had been going on, but I knew I had an over-filled container that had to have a bit more liquid poured off before I could put it away. It also meant the lemon-flavored drink was going to be diluted by an unknown amount. That is less than optimal.

So the simple solution I figured was to change my procedure. Add the water first and then add the flavoring. That way if there was too much water in the container, I could just pour off the extra and then add the proper amount of powder and have an undiluted lemon-flavored drink.

That worked fine until one day as I was pouring the package, it slipped through my fingers into a half-filled container.  Now I had to find a way to fish it out. Ironically, the easiest way to do it was to overfill it so the package would float to the top. Of course now I was back to diluted lemon-flavored drink. And who knows what was on the outside of the powder package that was now inside the water.

Each procedure has its failure modes. Both, when successful, get me to the final solution.

So, which one is better?

I put in the powder first and then put in the water. I could say I have a rational reason like preferring slightly diluted lemon-flavored drink over a possibly contaminated lemon-flavored drink from a dropped in packet.

But the truth is, it really doesn’t matter which order I do the steps in. Neither failure is completely fatal and in fact about equivalent in their seriousness.

Old habits die hard, so I stick with my original method.

But, the point is that even in a process as simple as making lemon-flavored drink, there’s more than one way to do it, and either way may be workable. Just make sure you can justify your reasoning.

Testing

This ties in with the concept of experimentation. Thomas Grohser related a story the other night of a case of “yeah, the database failed and we tried to do a restore and found out we couldn’t.”

Apparently their system could somehow make backups, but couldn’t restore them. BIG OOPS.  (Apparently they managed to create an empty database and replay 4.5  years of transaction logs and recover their data. That’s impressive in its own right.)

This is not the first time I’ve worked with a client or heard of a company where their disaster recovery plans didn’t pass the first actual need of it. It may sound obvious, but companies need to test the DR plans. I’m in fact working with a partner on a new business to help companies think about their DR plans. Note, we’re NOT writing or creating DR plans for companies, we’re going to focus on how companies go about actually implementing and testing their DR plans.

Fortunately, right now I’m working with a client that had an uncommon use case. They wanted a restore of the previous night’s backup to a different server every day.

They also wanted to log-ship the database in question to another location.

This wasn’t hard to implement.

But what is very nice about this setup is, every 15 minutes we have a built-in automatic test of their log-backups.  If for a reason log-backups stop working or a log gets corrupt, we’ll know in fairly short time.

And, with the database copy, we’ll know within a day if their backups fail.  They’re in a position where they’ll never find out 4.5 years later that their backups don’t work.

This client’s DR plan needs a lot of work, they actually have nothing formal written down. However, they know for a fact their data is safe. This is a huge improvement over companies that have a DR plan, but have no idea if their idea is safe.

Morale of the story: I’d rather know my data is safe and my DR plan needs work than have a DR plan but not have safe data.