Fail-safes

Dam it Jim, I’m a Doctor, not a civil engineer

I grew up near a small hydro-electric dam in CT. I was fascinated by it (and still am in many ways). One of the details I found interesting was that on top of this concrete structure they had what I later found are often called flashboards. These were 2x8s (perhaps a bit wider) running the length of the top of the dam, held in place by wooden supports.  The general idea was they increased the pooling depth by 8″ or so, but in the advent of a very heavy water flow or flood, they could be easily removed (in many cases removed simply by the force of the water itself).  They safely provided more water, but were designed in fact to fail (i.e. give away) in a safe and predictable manner.

This is an important detail that some designers of systems often don’t think about; how to fail. They spend so much time trying to PREVENT a failure, they don’t think about how the system will react in the EVENT of a failure. Properly designed systems assume that at some point failure IS an not only an option, it’s inevitable.

When I was first taught rigging for cave rescue, we were always taught “Have a mainline and a belay”.  The assumption is, that the system may fail. So, we spent a lot of time learning how to design a good belay system. The thinking has changed a bit these days, often we’re as likely to have TWO “mainlines” and switch between them, but the general concept is still the same, in the event of a failure EITHER line should be able to catch the load safely and be able to recover. (i.e. simply catching the fall but not being able to resume operations is insufficient.)

So, your systems. Do you think about failures and recovery?

Let me tell you about the one that prompted this post.  Years ago, for a client I built a log-shipping backup system for them. It uses SSH and other tools to get the files from their office to the corporate datacenter.  Because of the network setup, I can’t use the built-in SQL Server log-shipping copy commands.

But that’s not the real point. The real point is… “stuff happens”. Sometimes the network connection dies. Sometimes the copy hangs, or they reboot the server in the office in the middle of a copy, etc. Basically “things break”.

And, there’s another problem I have NOT been able to fix, that only started about 2 years ago (so for about 5 years it was not a problem.) Basically the SQL Server in the datacenter starts to have a memory leak and applying the log-files fails and I start to get errors.

Now, I HATE error emails. When this system fails, I can easily get like 60 an hour (every database, 4 times an hour plus a few other error emails). That’s annoying.

AND it was costing the customer every time I had to go in and fix things.

So, on the receiving side I setup a job to restart SQL Server and Agent every 12 hours (if we ever go into production we’ll have to solve the memory leak, but at this time we’ve decided it’s such a low priority as to not bother, and since it’s related to the log-shipping and if we failed over we’d be turning off log-shipping, it’s considered even less of an issue). This job comes in handy a bit later in the story.

Now, on the SENDING side, as I’ve said, sometimes the network would fail, they’d reboot in the middle of a copy or something random would make the copy job get “stuck”. This meant rather than simply failing, it would keep running, but not doing anything.

So, I eventually enabled a “deadman’s switch” in this job. If it runs for more than 12 hours, it will kill itself so that it can run normally again at the next scheduled time.

Now, here’s what often happens. The job will get stuck. I’ll start to get email alerts from the datacenter that it has been too long since logfiles have been applied. I’ll go in to the office server, kill the job and then manually run it. Then I’ll go into the datacenter, and make sure the jobs there are running.  It works and doesn’t take long. But, it takes time and I have to charge the customer.

So, this weekend…

the job on the office server got stuck. So I decided to test my failsafes/deadman switches.

I turned off SQL Agent in the datacenter, knowing that later that night my “cycle” job would turn it back on. This was simply so I wouldn’t get flooded with emails.

And, I left the stuck job in the office as is. I wanted to confirm the deadman’s switch would kick in and kill it and then restart it.

Sure enough later that day, the log files started flowing to the datacenter as expected.

Then a few hours later the SQL Agent in the datacenter started up again and log-shipping picked up where it left off.

So, basically I had an end to end test that when something breaks, on either end, the system can recover without human intervention. That’s pretty reassuring. I like knowing it’s that robust.

Failures Happen

And in this case… I’ve tested the system and it can handle them. That lets me sleep better at night.

Can your systems handle failure robustly?

 

 

SQL Data Partners Podcast

I’ve been keeping mum about this for a few weeks, but I’ve been excited about it. A couple of months ago, Carlos L Chacon from SQL Data Partners reached out to me about the possibility of being interviewed for their podcast. I immediately said yes. I mean, hey, it’s free marketing, right?  More seriously, I said yes because when a member of my #SQLFamily asks for help or to help, my immediate response is to say yes.  And of course it sounded like fun.  And boy was I right!

What had apparently caught Carlos’s attention was my book: IT Disaster Response: Lessons Learned in the Field.  (quick go order a copy now.. that’s what Amazon Prime is for, right?  I’ll wait).

Ok, back? Great. Anyway, the book is sort of a mash-up (to use the common lingo these days) of my interests in IT and cave rescue and plane crashes. I try to combine the skills, lessons learned, and tools from one area and apply them to other areas. I’ve been told it’s a good read. I like to think so, but I’ll let you judge for yourself. Anyway, back to the podcast.

So we recorded the podcast back in January. Carlos and his partner Steve Stedman were on their end and I on mine. And I can tell you, it was a LOT of fun. You can (and should) listen to it here.  I just re-listened to it myself to remind myself of what we covered. What I found remarkable was the fact that as much as I was really trying to tie it back to databases, Carlos and Steve seemed as much interested, if not more in cave rescue itself. I was ok with that.  I personally think we covered a lot of ground in the 30 or so minutes we talked. And it was great because this is exactly the sort of presentation, combined  with my air plane crash one and others I’m looking to build into a full-day onsite consult.

One detail I had forgotten about in the podcast was the #SQLFamily questions at the end. I still think I’d love to fly because it’s cool, but teleportation would be useful too.

So, Carlos and Steve, a huge thank you for asking me to participate and for letting me ramble on about one of my interests.  As I understand it my Ray Kim has a similar podcast with them coming up in the near future also.

So thought for the day is, think how skills you learn elsewhere can be applied to your current responsibilities. It might surprise you and you might do a better job.

 

 

 

Hours for the week

Like I say, I don’t generally post SQL specific stuff because, well there’s so many blogs out there that do. But what the heck.

Had a problem the other day. I needed to return the hours worked per timerange for a specific employee. And if they worked no hours, return 0.  So basically had to deal with gaps.

There’s lots of solutions out there, this is mine:

Alter procedure GetEmployeeHoursByDate @startdate date, @enddate date , @userID varchar(25)
as

— Usage exec GetEmployeeHoursByDate ‘2018-01-07’, ‘2018-01-13’, ‘gmoore’

— Author: Greg D. Moore
— Date: 2018-02-12
— Version: 1.0

— Get the totals for the days in question

 

 

set NOCOUNT on

— First let’s create simple table that just has the range of dates we want

; WITH daterange AS (
SELECT @startdate AS WorkDate
UNION ALL
SELECT DATEADD(dd, 1, WorkDate)
FROM daterange s
WHERE DATEADD(dd, 1, WorkDate) <= @enddate)

 

select dr.workdate as workdate, coalesce(a.dailyhours,0) as DailyHours from
(
— Here we get the hours worked and sum them up for that person.

select ph.WorkDate, sum(ph.Hours) as DailyHours from ProjectHours ph
where ph.UserID=@userid
and ph.workdate>= @startdate and ph.workdate <= @enddate
group by ph.workdate
) as a
right outer join daterange dr on dr.WorkDate=a.WorkDate — now join our table of dates to our hours and put in 0 for dates we don’t have hours for
order by workdate

GO

There’s probably better ways, but this worked for me. What’s your solution?

The Basics

Last night at our local SQL Server User Group meeting we had the pleasure of Deborah Melkin speaking.  I first met Deborah at our Albany SQL Saturday Event last year. She gave: Back to the Basics: T-SQL 101. Because of the title I couldn’t help but attend. It wasn’t the 101 part by itself that caught my eye. It was the “Back to the Basics”. While geared to beginners, I thought the idea of going back to the basics of something I take for granted was a great idea. She was also a first time speaker, so I’ll admit, I was curious how she would do.

It was well worth my time. While I’d say most of it was review, I was reminded of a thing or two I had forgotten and taught a thing or two.  But also very importantly, she had a great ability to break down the subject into a clearly understandable talk. This is actually harder than many people realize. I’ve heard some brilliant speakers, who simply can’t convey their message, especially on basic items of knowledge, in a way that beginners can understand it.

So, after the talk last summer, I cornered her at the Speaker’s Dinner and insisted she come up with a follow up, a 201 talk if you will. Last night she obliged, with “Beyond the Select”.  What again struck me about it, was other than a great tip in SSMS 17.4 (highlighting a table alias will show you what the base table is), again nothing was really new to me. She talked about UDFs; I’ve attended entire sessions on UDFs. She talked about CTE; I’ve read extensively about them. She discussed windowing functions; we’ve had one of our presenters present on them locally. Similarly with some of the other items she had brought up.

Now, this is NOT a slight at all, but really a compliment. Both as an attendee and as the guy in charge of selecting speakers, it was great to have a broad-reaching topic. Rather than a deep-drive, this was a bit of everything that gave the audience a chance to learn a bit of everything if they hadn’t seen it before (and based on the reactions and feedback I know many learned new stuff) and to compare different methods of doing things.  For example what’s the advantage of a CTE vs. a derived table vs. a temp table.  Well the answer is of course the DBA’s favorite answer, “it depends”.

As a DBA with decades of experience and as an organizer, it’s tempting to have a Bob Ward type talk every month. I enjoyed his talk last month. But, honestly, sometimes we need to go back and review the basics. We’ll probably learn something new or relearn something we had forgotten. And with talks like Deborah’s, we get to see the big picture, which is also very valuable.

So my final thought this week is that in any subject, not only should we be doing the deep dives that extend our knowledge, but we should review our basics. As DBAs, we do a select every day. We take it for granted, but how many people can really tell you clearly the order of operations? Review the basics once in awhile. You may learn something.

And that’s why I selected this topic for this week’s blog.

Time, what is it?

Time is a funny thing. We all experience the passage of it, but, fundamentally it’s arbitrary.  In the SI system of units, the second is defined as “the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium 133 atom”,

Sure. That’s helpful. To someone. Not to me.

It’s now 2018 AD. Or is that 2018 CE? Or is it 4714, or 1438, or perhaps 5778 or perhaps it’s 2010. Or perhaps it’s year 1?

We can’t agree on what year it is. But we can certainly at least agree on some semblance of the date, correct? Not really. If I’m using Julian dates, my day is going to be 12 hours off of yours.

But, since I’m writing this in the Northeastern US, let’s agree we’ll work with the Common Era Calendar and that it’s January 9th, 2018.  That should work.  Everything should be simple right? For many of people, it probably is. For us programmers, it isn’t.

Let’s start with a simple sentence, “Run the job at midnight on Saturday.” Can we agree on when to run it? Most of us would probably run it just as Saturday is beginning. But some would argue that it’s just as valid run it at the end of Saturday.  I have in fact seen both arguments. This is where our common use of the language can fall apart. Fortunately though, we can remedy the issue. The military for example, for whom precision timing can be critical would say something like, “Run the job at 00:00 on Saturday.”  Now at least we’re in agreement.

But even then, we love to muck things up. Let me start with being a pedant for a second and point out that there is technically no such thing as 12 PM or 12 AM.  “What?”, you might say? “Of course there is!” Nope, sorry. Going back to the Latin PM and AM mean Post-Meridiem and Ante-Meridiem, i.e. after and before the middle of the day. So technically, what we call 12 Noon can’t be either (since it’s neither before nor after the time of the middle of the day, it IS the middle of the day). And equally, one can argue that 12 Midnight is neither.  And yet, you hear folks use the terms anyway.

Almost always, but not always though 12PM is associated with Noon and 12AM with Midnight. This follows because 12:00:01 IS after the middle of the day. So it makes sense to call the second before that also PM. Yet, that said, I have seen multiple times signs claiming that a place is open for dinner from say 7:00PM-12:00PM and or open “All Day from 7:00 AM-12:00PM” and they’re clearly open after lunch and even after dinner. (And by the way, does it strike you as a strange that we count 12AM, 1AM, 2AM etc to 11AM.. and start over at 12PM, 1PM… i.e. the 12 is considered part of the next set of hours even though we immediately move back to 1 after it? It’s one reason I really prefer a 24 hour clock.)

So I get pedantic about this when folks say 12PM or 12AM and ask them to clarify. This is especially important when it comes to scheduling tasks in say SQL Server.

BUT, we have yet another problem. I’m in the Eastern Time Zone of the US. You might be in the Pacific Time Zone of the US.  I want to hold a meeting at 3:00 PM.  Is that my time or your time?  Most modern scheduling software will correctly put this into my calendar as a 3:00 PM meeting and into yours as a 12 Noon meeting.

Yet that’s still not enough. I was recently reminded of this a few months ago as I started to schedule my PASS Summit into my calendar. I put the events in while I was in the Eastern Timezone, forgetting that when I got to Seattle, my calendar would “nicely” shift things for me 3 hours.  This meant rather than say thinking I wanted to attend the 9:00 AM Summit, I wanted to attend a 6:00 AM Summit.  Fortunately again, the software I discovered DOES let me specify what timezone the appointment actually is in when I set it up.

And one more aside. If we schedule a meeting at 3:00 PM (and we both agree on what that means) what happens if I say “Oh, I have a conflict, can we move that back an hour?” We we moving it backwards in time to 2:00 PM, or backwards in the sense of “further away” and mean 4:00 PM?

So, we can agree that hours are tough. But it can get worse.  Here’s a question for you. When was George Washington born? My father was always proud that he shared George Washington’s Birthday. But did he?  Was it February 22nd 1732? What if I told you he was born February 11th 1731? Both are technically correct (which is the best kind of correct.)

We all know about leap years. We have them because it doesn’t take exactly 365 days to orbit the sun. It takes closer to 365.25 days to orbit the sun. But even THAT isn’t really accurate enough. When Julius Caesar reformed the Roman Calendar, he added the concept of leap years. He know that without them, the calendar would slowly get out of synch with the seasons. Before you knew it, July would start becoming the middle of winter and who would want that! (Ok, technically it would take a few hundred years, but you get the point).

The problem was, scheduling Leap Years still wasn’t enough. By 1582, despite leap years, the calendar had “slipped” by 10 days.  So Pope Gregory instituted the current calendar most of the world uses which added the more complex rule of leap years every 4 years, unless the year is divisible by 100 which are not leap years, except in the cases of when it’s divisible by 400 when it is a leap year.

Now, Pope Gregory had a lot of influence over parts of Europe, so predominately Catholic countries adopted the changes almost immediately, in 1582.  But understandably, the more Protestant countries were a bit slower to adopt. The Great Britain and the Colonies didn’t adopt it until 1752, after George Washington’s birth. So, he was born on either day depending on which calendar you’re using. And by then things had slipped 11 days. (The change in years is a related, but different issue that had to do with what was considered the first day of the year. It wasn’t always January 1st you know.)

Now, when Y2K rolled around, I have to admit, I started to wonder, and still do, what would have happened if we had had computers in 1899. Would they have gotten the lack of a leap year correct? My guess is, “probably not”. And in fact, ironically enough, I recall at least one software package (I can’t recall the name) that is well prepared for 2100 and 2200 and 2300, but was NOT prepared for the year 2000 and in fact skipped the leap day that year. (Oh, by the way to really blow your mind, depending on the calendaring system, the leap day is actually February 24th. Or perhaps a more accurate way of saying it is that the leap day is inserted after the 53rd day of the year.)

In conclusion let’s just say time been on my mind a lot lately, partly because I’ve orbited the Sun just over 50 times now and because it’s a new Year. And as a developer, at times handling dates can be far tougher than it looks at first pass, you have to deal with midnight boundaries, timezones, leap years (or the lack of them) and even wholesale shifts in calendars (and this is ignoring completely different calendars such as the numerous lunar based ones that may or may not have entire leap months!) Oh and completely left out any mention of leap seconds! Yes, you can have December 31st 23:59:60 as a valid time.

But when you think about it, other than the fact that time keeps slippin’ (slippin’, slippin’) into the future, all measurement is really arbitrary.  We pick a 0 point and go with it. We just have to agree at some point (in time, yes, the irony is not lost on me) what that 0 point is and how we’ll measure from there, and we’re all good.

Or are we? I’ll leave you with this one final thought. Far into the future, possibly trillions of years, when the Universe has expanded so much that the distance between particles is far enough that none can interact, essentially nothing will be changing. Can you even have time if you have no clock against which to measure it? Will time eventually run out?

I could go on and on about time, but, I seem to have run out of it.

 

Maps

This is actually about a bit more than simply maps. It’s really about what a map is. It’s an abstraction. This may sound obvious, but it’s an important concept.

“I have a map of the United States… Actual size. It says, ‘Scale: 1 mile = 1 mile.’ I spent last summer folding it. I hardly ever unroll it. People ask me where I live, and I say, ‘E6.” – Stephen Wright

Forget the impracticality of such a map, it makes a fine joke, it’s still just a map. We don’t live on maps.  Piaget and others have done a lot of research into the development of the minds of young children and it’s fascinating. At first, I think one can argue they live in a very solipsistic world. Nothing exists except what they can see or touch. Soon though concepts of object permanence come into being. But they’re still not quite at the concept of true abstraction. At very young ages, if you hold up a picture of a room to a small enough child, they’ll try to grab stuff of the picture. And why not? They can grab things in the physical world.  Soon they start to realize the picture is just that, not the actual room, but a flattened portable version of it.

Around age 5 (I’m going on memory here) we start to ask children to draw maps. It might be a map of their room, or a map of how to get to school. They’re crude and often, to our adult minds wildly inaccurate, but, they’re starting to achieve a truer concept of abstraction.  Their maps aren’t even miniaturized pictures of the real world, they’re completely new, abstract designs of the world.

Eventually, around 8th grade or so, they’re probably introduced to Algebra. Up until then much of their math might have been things like 8 * ___ = 56. They’ll dutifully fill in the 7. It’s a blank, but it’s not really abstract. Then they start to solve problems like “56/x = 8, solve for x”.  Again they’ll write down 7, but now they’ve learned that a letter can represent a number and be used that way later on.

What’s interesting about the above is, many people never stop to realize that * or / or + or many other mathematical symbols are really just abstractions. Yes, numbers are in a sense an abstraction. There’s nothing inherent about the symbol 5 that means it MUST represent five of an object. It’s just a convenience (and it’s shorter than writing ///// hash marks). But the mathematical symbols are even more of an abstraction. If I wrote 8♒︎__ = 56, most of us could probably figure out what ♒︎ represents there.

Anyway, that aside, why am I writing about maps and ultimately abstraction? Because, it is so vitally important to my job and perhaps the jobs of most of my readers and it’s been on my mind lately.

I’ve been working on an SSIS package for a client. The original spec was “Import these 20 or so (out of 240) CSV files out of this ZIP file.”  I thought about it and after teaching myself a bit more about SSIS (I don’t usually do much in it) I realized I could create a container that did most of the work. And I could assign variables to the container. The variables include things like what database connection to use and where the files might be located. And, most importantly, a variable for the NAME of the file/table (by using a table with the same basename as the file, I can use a single variable and line stuff up nicely.) And part of what I had to (re)-learn was the scope of variables within SSIS packages and how to expose the ones I wanted as parameters.

Now, honestly, it probably took me more time to set up this level of abstraction to make it workable than if, for the 20 or so tables I had hard-coded everything.

BUT, I’ve got enough experience to guess at what was coming next; and I’m going to guess you know what’s coming next.

“Hey, this is great, but you know what, we think we want all 240 files in that ZIP file loaded. Can yo do that?”

“Sure!”  And at this point, the problem isn’t the amount of work, it’s the tediousness of it; “create table schema, create a new file connector, copy the SSIS container, update the variable for the table, go in and reconnect the file connect to the new table, test”.  It’s a LOT less work than if I had to hardcode everything by hand.  The container name even automatically is updated based on the table variable. This actually has an unintended, positive side-effect. If for some reason I start to duplicate a container (say I forgot I had already imported table Customer_Val, the container name won’t update using the variable since you can’t have duplicate container names. I caught this very problem the other day. So this was an unintended, but very useful additional side effect!)

And of course, once I move the final code to their UAT environment and then the Prod environment, it’s simply a matter of updating 3-4 variables in the deploy (for the database server, file location and where to set emails) and I’m done.

By taking the time to abstract out as much as I could, I’ve made my life easier. Up front it was definitely a bit more work, but in the long run, has saved me a lot of effort.

Nothing revolutionary about the concept to us developers. BUT, stop and try to think of an environment where maybe variables or such layers of abstraction don’t exist. Such environments exist and there’s some valid reasons for them, but ultimately, being able to abstract stuff can make our lives MUCH easier and our code that much more powerful.

And it appears that abstraction at this level is perhaps the ONE real intellectual advantage we have over other intelligence on this planet. And I’m not even so sure about that!

So I’ll end in pointing out that even the written word is really just a level of abstraction. So this column is about itself. Or something like that.

 

SQL Saturday DC 2017 Follow-up

Last week I wrote about heading towards SQL Saturday DC. This week I figured I’d write a follow-up.

I want to start by saying thanks to everyone who organized and set things up and ran it. I’ve only been on the periphery of working the Albany SQL Saturday, but I have an inkling of what it takes to run an event like this. There’s so much work, much of it obvious, but also a myriad of small details, many that never get seen or noticed.

Since I like to talk about “thinking” in this blog I’m going to mention one such detail that I’m sure no one gave much though about, but was there. Chris Bell is one of the principal organizers of the DC event (but far from the only one who works it). I’m going to give away one of his secrets (and I hope he’s ok with that!)

So one of the decisions an event organizer has to decide is how long to make the sessions. Generally they run 60 minutes or 75 minutes. (Occasionally you might see some that run 90 minutes). 15 minutes may not seem like much, but over 4-5 sessions, it can mean the difference between adding or removing an entire session during the course of the day.

Now as a speaker, I can’t say I have a real preference. A 60 minute session often feels like I have to rush a bit, but a 75 minute session tends to mean I may have to add content, OR perhaps end a bit early, especially if there’s not a lot of questions at the end.

And that last detail is a critical detail that Chris took advantage of.

See, one of the logistical issues any major event has to worry about is feeding people. At a large event like PASS Summit, they have an entire and huge room just dedicated for lunch with rows and rows of serving tables. They can, if they had to, get everyone through the lunch line and served in a short period of time. However, at smaller events, with a smaller budget, especially 1 day events, it makes little sense to rent a space that large.

So, if you have limited space and limited time, how do you handle lunch, and especially trying to get your speakers and staff through first (so they can be prepared for their next session)?

You take advantage of the fact that every seminar before lunch isn’t necessarily going to use the full 75 minutes! You start lunch say, 15 minutes before the END of the pre-lunch sessions. Some sessions will end 15 minutes early, so those folks can get in line right away. 1-2 might end 10 minutes early and they can get in line after the previous folks are just finishing getting their lunches.  And so on.

So, you end up serving everyone in a reasonable amount of time, but you don’t have huge long lines all of a sudden. You have a much more reasonable distribution of people across your available time.

So, that’s something to think about when scheduling events. Oh by the way, if you’re doing massive maintenance or dataloads to your SQL Server database, you might want to see if you can spread out your I/O in a similar way. (Hmm, I had to get SOME SQL Server content in here, right?)

So, remember, if you think outside the box, sometimes you can get more done than otherwise!

P.S. I couldn’t finish this blog post without a huge shout-out to the @SQLSpouse and stealth Princess Gigi Bell and say thanks again for the card!