Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?

 

 

SQL Data Partners Podcast

I’ve been keeping mum about this for a few weeks, but I’ve been excited about it. A couple of months ago, Carlos L Chacon from SQL Data Partners reached out to me about the possibility of being interviewed for their podcast. I immediately said yes. I mean, hey, it’s free marketing, right?  More seriously, I said yes because when a member of my #SQLFamily asks for help or to help, my immediate response is to say yes.  And of course it sounded like fun.  And boy was I right!

What had apparently caught Carlos’s attention was my book: IT Disaster Response: Lessons Learned in the Field.  (quick go order a copy now.. that’s what Amazon Prime is for, right?  I’ll wait).

Ok, back? Great. Anyway, the book is sort of a mash-up (to use the common lingo these days) of my interests in IT and cave rescue and plane crashes. I try to combine the skills, lessons learned, and tools from one area and apply them to other areas. I’ve been told it’s a good read. I like to think so, but I’ll let you judge for yourself. Anyway, back to the podcast.

So we recorded the podcast back in January. Carlos and his partner Steve Stedman were on their end and I on mine. And I can tell you, it was a LOT of fun. You can (and should) listen to it here.  I just re-listened to it myself to remind myself of what we covered. What I found remarkable was the fact that as much as I was really trying to tie it back to databases, Carlos and Steve seemed as much interested, if not more in cave rescue itself. I was ok with that.  I personally think we covered a lot of ground in the 30 or so minutes we talked. And it was great because this is exactly the sort of presentation, combined  with my air plane crash one and others I’m looking to build into a full-day onsite consult.

One detail I had forgotten about in the podcast was the #SQLFamily questions at the end. I still think I’d love to fly because it’s cool, but teleportation would be useful too.

So, Carlos and Steve, a huge thank you for asking me to participate and for letting me ramble on about one of my interests.  As I understand it my Ray Kim has a similar podcast with them coming up in the near future also.

So thought for the day is, think how skills you learn elsewhere can be applied to your current responsibilities. It might surprise you and you might do a better job.

 

 

 

What a Lucky Man He Was….

Being a child of the 60s my musical tastes run the gamut from The Doors through Rachel Platten.  In this case, the title of course comes from the ELP song.

Anyway, today’s post is a bit more reflective than some. Since yesterday I’ve been fighting what should be simple code. Years back I wrote a simple website to handle student information for the National Cave Rescue Commission (NCRC).  The previous database manager had started with a database designed back in the 80s. It was certainly NOT web friendly. So after some work I decided it was time to make it a bit more accessible to other folks.  Fortunately ASP.NET made much of the work fairly easy.  It did what I wanted to do. But now, I’m struggling to figure out how to get and save profile information along with membership info.  Long story short, due to a design decision years back, this isn’t as automatic and easy as I’d like.  So, I’ve been banging my head against the keyboard quite a bit over the last 24 hours. It’s quite frustrating actually.

So, why do I consider myself lucky? Because I can take the time to work on this. Through years of hard work, education and honestly a bit of luck, I’m at the point where my wife and I can provide for our family to live a comfortable life and I can get away with working less than a full 40 hours a week. This is important to me as I get older. Quality of life becomes important.

I’ve talked about my involvement in cave rescue in the past and part of that is wearing of multiple hats. Some of which take more work than others.

I am for example:

  • Co-captain of the Albany-Schoharie Cave Rescue Team – This is VERY sporadic and really sort of unofficial and some years we will have no rescues at all locally.
  • I’m an Instructor with the NCRC – This means generally a week plus a few days every year I take time out to travel, at my own expense to a different part of the country and teach students the skills required to be effective in a cave rescue. For this, I get satisfaction. I don’t get paid and like I say I travel at my own expense.  Locally I generally take a weekend or two a year to teach a weekend course.
  • I’m a Regional Coordinator with the NCRC – Among other things this means again I travel at my own expense once a year, generally to Georgia, to meet with my fellow coordinators so we can conduct the business of the NCRC. This may include approving curriculum created by others, reviewing budgets and other business.
  • Finally, I’m the Database Coordinator. It’s really a bit more of IT Coordinator but the title is what it is. This means not only do I develop the database and the front end, I’m responsible for inputting data and running reports.

As you can see, this time adds up, quickly.  I’d say easily, in terms of total time, I dedicate a minimum of two weeks a year to the NCRC.  But it’s worth it. I can literally point at rescues and say, “those people are alive because of the work I and others do”. Sometimes it’s direct like when I’m actually on a rescue, sometimes it’s indirect when I know it’s a result of the training I and others have provided.  But it’s worth it.  I honestly can claim I work with some of the best people in the world. There are people here that I would literally put my life on the line for; in part because I know they’d do the same.

So, I’m lucky. I’m lucky that I can invest so much of my time in something I enjoy and love so much.  I’ll figure out this code and I’ll keep contributing, because it’s worth it, and because I’m lucky enough that I can.

How are you lucky?