“It’s Just a Simple Change”

How often have we heard those words? Or used them ourselves?

“Oh this is just a simple change, it won’t break a thing.” And then all hell breaks lose.

Yet, we also hear the reverse at times. “This is pretty complex, I’ll be surprised if it works the first time, or if it doesn’t break something.” And yet then nothing bad seems to happen.

We may observe this, but we don’t necessarily stop to think about the why. I’ve seen this happen a lot in IT, but honestly, I’ve seen this happen elsewhere and often when we read about accidents in areas such as caving, this also holds true.

I argue that in this case the perception is often true. Let me put in one caveat. There’s definitely a bias in our memory where we don’t recall all the times where simple things don’t break things, but the times it does, it really stands out.

The truth is, whenever we deal with complex systems, even simple changes aren’t so simple. But we assume they are and then are surprised when they have side effects. “Oh updating that path here won’t break anything. I only call it one place, and I’ll update that.” And you’re good. But what you didn’t realize was another developer liked your script, so made a copy and is using it for their own purposes and now their code breaks because of the new path. So your simple change isn’t so simple.

Contrast that to the complex change. I’m in the middle up refactoring a stored procedure. It’s complex. I suspect it’ll break something in production. But, honestly, it probably won’t. Not because I’m am awesome T-SQL developer, but, because of our paranoia, we’ll be testing this in UAT quite a bit. In other words, our paranoia drives our testing to a higher level.

I think it behooves us to treat even simple changes with more respect than we do and test them.

In the world of caving we use something called SRT – Single Rope Technique. This is the method we use for ascending and descending a rope. When ascending, if you put your gear on wrong at the bottom, generally there’s no real risk other than possible embarrassment. After all, you’re standing on the ground. But obviously a the top, it’s critical to put your equipment on correctly, lest your first step be your last. Similarly, we practice something known as a change-over; changing from ascending to descending, or descending to ascending while on rope. When changing from climbing to descending you want to make sure you do it correctly lest you find yourself descending at 9.8m/s^2. To prevent accidents, we ingrain in students “load and test your descent device before removing your other attachment point.” Basically, while you’re still secured to something at the top, or to your ascending devices if you’re partway up the rope, put your entire weight on your descent device and lower yourself 1-2″. If you succeed, great, then you can detach yourself from whatever you are attached to at the top, or remove your ascending devices. If somehow you’ve screwed something up and the descent device comes off the top or otherwise fails, you’ve got a backup.

Now, I will interject, getting on rope at the top of a pit, or a changeover is something an experienced caver will have done possibly 100s if not 1000s of times. It’s “a simple change”. Yet we still do the test because a single failure can be fatal. And I have in fact seen a person fail to properly test their descent device. And moreover, this wasn’t in a cave, or other dark or cramped space. It was in broad daylight on the edge of the RPI Student Union! This was about as simple as it could get! Fortunately he heard it start to fail and grabbed the concrete railing for dear life. In this particular case a failure most likely would not have been fatal, but would have caused serious injury.

So, despite having gotten on rope 100s of times myself, I ALWAYS test. It’s a simple change. But the test is also simple and there’s no reason to skip it.

The morale of the story, even your simple changes should be tested, lest you find they’re not so simple, or their failures aren’t so minor.

4/20

I was going to start this post by making a crack about getting any cracks about references to 420 out of the way. But then I realized they’re actually apropos of the intent of this post.

Yes, often when we folks think of the numbers 420 the references to marijuana jump out. Not a habit I’ve ever had any interest in, but I’ve been around it enough to feel its effects and I guess I can understand why others might partake. Growing up in the 70s and 80s I was routinely offered it but always declined due to lack of interest. That said, one thing that I never really dwelt on much was what would happen if I got caught with it. My skin color mattered.

Three events though shaped 4/20/21 for me.

I happened to reread (I had come across it earlier) a post by Eva Kor on Quora. Eva Kor was a twin who survived Josef Mengele’s atrocities and spent much of her life talking about them. She was a living witness to the history of the Holocaust, an event we must never forget. Sadly she is gone now, but her writings and voice live on.

4/20 also happens to be the birthday of George Takei. I recall growing up watching him in reruns of the original Star Trek, playing originally a physicist on the Enterprise, but really best known as the ship’s navigator. To quote Spock Sulu “is at heart a swashbuckler out of the 18th century”. But I later learned he was also instrumental in bringing attention to a dark period of our own US history during WWII, the internment of US citizens of Japanese heritage. He is, at this writing, still a living witness to those dark days. But, the truth is unfortunately, time will eventually silence his great voice. But that does not mean we can be allowed to forget what the US did to its own citizens.

And finally of course 4/20/21 was the reading of the verdict of in the George Floyd murder case. Guilty on all three counts. George Floyd’s life was sadly ended with the words “I can’t breath.” He can’t speak for himself. But fortunately, due to cell phone cameras, and the work of the prosecution, the jury could speak for accountability and hold his murderer responsible.

While the murderer will be held accountable, it will not change the tragedy that such an event should never have happened. There are those that will still argue, “well if he hadn’t resisted arrest…” ignoring the idea that perhaps the initial response while legal, probably should have been handled very differently. Dr. Mengele’s atrocities were considered legal, but that didn’t make them right by any moral compass I am comfortable with. The Supreme Court in Korematsu v. United States held that the government could force Korematsu to be detained because of his heritage. In the case of George Floyd, the defense argued a reasonable officer would do what George Floyd’s murderer did. The jury rejected that argument. Thankfully. But we know all to often where that argument did hold sway. And, honestly will again.

So back to 420. The decriminalization of marijuana is quickly becoming the norm. Even my US Senator Chuck “I never found a camera I didn’t like” Schumer posted on Facebook positively about 420 day. These are steps forward. But, there is still an ugly racial history to the handling and prosecution of crimes related to marijuana in this country. Blacks for example are about twice as likely to be arrested for possession, despite their rate of use being about the same as whites. Like many aspects of the law, it’s clear it’s applied disproportionality and in a huge part based on the color of ones skin. Hence why I never really worried too much about it.

Fortunately here in New York, part of the rollback of marijuana laws is including vacating 10s of thousands or prior convictions and expunging them from individual records (there are some caveats however.) This is a step towards restorative justice.

So 4/20 represents a confluence of events and perhaps a step forward. But despite Eva Kor’s testimonies, George Takei’s work, still going on today, and the conviction of George Floyd’s murderer, we have a long ways to go towards the living up to our ideals. They are the voices calling us to do better. And we must. And we must never think the work is done.

Your Boss Doesn’t Care About Backups!

It’s true. Even if they don’t realize it. Or even if they claim they do. They really don’t.

I’ve made this point before. Of course this is hyperbole. But a recent post by Taryn Pratt reminded me of this. I would highly recommend you go read Taryn’s post. Seriously. Do it. It’s great. It’s better than my post. It actually has code and examples and the like. That makes it good.

That said, why the title here? Because again, I want to emphasize what your boss really cares about is business continuity. At the end of the day they want to know, “if our server crashes, can we recover?” And the answer had better be “Yes.” This means that you need to be able to restore those backups, Or have another form of recovery.

Log-Shipping

It seems to me that over the years log-shipping has sort of fallen out of favor. “Oh we have SAN snapshots.” “We have Availability Groups!” “We have X.” “No one uses log-shipping any more, it’s old school.”

In fact this recently came up in a DR discussion I had with a client and their IT group. They use a SAN replication software to replicate data from one data center to another. “Oh you don’t need to worry about shipping logs or anything, this is better.”

So I asked questions like was it block-level, file-level, byte-level or what? I asked how much latency there was? I asked how we could be sure that data was hardened on the receiving side. I actually never got really clear answers to any of that other than, “It’s never failed in testing.”

So I asked the follow up question, “How was it tested.” I’m sure their answer was supposed to reassure me. “Well during a test, we’d stop writing to the primary, shut it down and the redirect the clients to the secondary.” And yes, that’s a good test, but it’s far from a complete test. Here’s the thing, many disasters don’t allow the luxury of cleaning stopping writes to the primary. They can occur for many reasons, but in many cases the failure is basically instantaneous. This means that data was inflight. Where in flight? Was it hardened to the log? Was that data in flight to the secondary? Inquiring minds want to know.

Now this is not to say these many methods of disk based replication (as opposed to SQL based which is a different beast) aren’t effective or don’t have their place. It’s simply to say, they’re not perfect and one has to understand their limitations.

So back to log-shipping. I LOVE log-shipping. Let me start with a huge caveat. In an unplanned outage, your secondary will only be up to date as the most recent log backup. This could be an issue. But, the upside is, you should have a very good idea of what’s in the database and your chances of a corrupted block of data, or the like is very low.

But there’s two facts I love about it.

  1. Every time I restore a log file, I’ve tested the backup of that log file. This may seem obvious, but, it does give me a constant check on my backups. If my backups fail for any reason, lack of space, a bad block gets written and not noticed, etc. I’ll know as soon as my next restore fails. Granted, my FULL Backups aren’t being restored all the time, but I’ve got at least some more evidence that my backup scheme in general is working. (and honestly, if I really needed to, I could backup my copy and use that in a DR situation.)
  2. It can make me look like a miracle worker. I have, in the past, in a shop where developers had direct access to prod and had been known to mess up data, used log-shipping to save the day. Either on my DR box, or a separate box I’d keep around that was too slow CPU wise for DR, but had plenty of diskspace, I’d set it to delay applying logs for 3-4 hours. In the event of most DR events, it was fairly simple to catch-up on log-shipping and bring the DR box online. But more often than not, I used it (or my CPU weak but disk heavy box) in a different way. I’d get a report from a developer, “Greg, umm, I well, not sure how to say this, but just updated the automobile table so that everyone has a White Ford Taurus.” I’d simply reply, “give me an hour or so, I’ll see what I can do.” Now the reality is, it never took me an hour. I’d simply look at the log-shipped copy I had, apply any logs I needed to catch up to just before their error, then script out the data and fix the data in production. They were always assuming I was restoring the entire backup or something like that. This wasn’t the case, in part because doing so would have taken far more than an hour, and would have caused a complete production outage.

There was another advantage to my 2nd use of log-backups. I got practice at manually applying logs, WITH NOROLLBACK and the like. I’m a firm believer in Train as you Fight.

Yes, in an ideal world, a developer will never have such unrestricted access to Production ( and honestly it’s gotten better, I rarely see that these days) and you should never need to deal with an actual DR, but we don’t live in an ideal world.

So, at the end of the day, I don’t care if you do log-shipping, Taryn Pratt’s automated restores or what, but do restores; both automated and manually. Automated because it’ll test your backups. Manually because it’ll hone your skills for when your primary is down and your CEO is breathing down your neck as you huddle over the keyboard trying to bring things back.

Reminder

As a consultant, I’m always looking for new clients. My primary focus is helping to outsource your on-prem DBA needs. If need help, let me know!

Free Cell #1703491

This is a completely random post and for a very select crowd.

I often play Freecell (far to much, but that’s another story). Years ago, when it first came out with Windows XP, I wondered if every game was winnable. Apparently, not. That said, I haven’t come across any of the “impossible games”. But I’ve come across a few hard ones.

But none nearly as hard as game #1703491. Usually I can solve most games in 2-5 minutes, sometimes it takes 15-30. I was into this one for over 2 hours before I did something I’ve rarely done. I looked for help. Mostly I wanted to know I wasn’t playing an impossible game. A brief search suggested I wasn’t. A longer search proved I wasn’t. But there was only one cryptic suggestion. I had pretty much settled on this being the most likely path, clearing the 6th column.

Now, small sidebar. To add a bit of a challenge, I have a self-imposed rule that I don’t put cards up on the home cells manually, I let the game move them automatically. In other words, if there’s a free card I can put up there manually, but that won’t go automatically, I won’t put it up. This happens for example if the the home cells AH, blank, 2D, AS, I won’t put up the 3 of diamonds. The game won’t automatically put up the 3D until the 2H, 2C and 2S are up there also. Like I say, no real reason other than the extra challenge. I had to break that rule in this game.

Anyway, even with that advice, I kept getting stuck.

A common spot I would get to was:

Making Progress

Still not much wiggle room

This was the first time I had freed up the 8th column. So that was progress and I had considered that key. I’m not sure what took me this long to figure out this combination of moves.

And now the break-thru. I’m feeling good here. I know once I get the 2 of Hearts up there, I’ll be making real progress!

This move is obvious

Now I’m gaining momentum. I may seem tempting to free up that 2 of Spades. Resist that temptation!

Don’t play the obvious move!

Rather you want to move that stack on the 5 of Hearts. With that move and a few others you end up at:

Now we’re making real progress!

The next few moves are pretty clear. Now we can move up that 2 of Spaces and after that the game is clearly winnable.

Getting Close

That said, I have to break my own rule one more time, but I don’t care. I’m ready to win.

Almost There!

And that’s it! I can relax now!

This Post is Free!

Yes, seriously, other than a bit of your time, it will cost you nothing to read this post. And you might gain something from it. That can be a good value.

As I’ve mentioned in the past, one of things I do when I’m not doing SQL Server is perform training for those interested in Cave Rescue. I also sometimes blog about it. I have also mentioned that this year I’m organizing the National Cave Rescue Commission‘s national weeklong training class. In addition, since apparently I’m not enough of a masochist I’m also organizing a regional Level 1 only weeklong training class.

Due to generous contributions the NCRC is able to offer scholarships. For the regional weeklong, we are able to offer 4 scholarships of a value of up to $375 each. This covers 1/2 the cost of training. Applications were due Saturday. Now, we’re hoping for 12-20 students, so this means if everyone applied, they’d have between a 1/3-1/5 chance of getting scholarship. Can you guess how many had applied as of Saturday?

Before I answer that, I’ll note my wife used to work as a financial aid director at a local nursing school. They too sometimes offered scholarships. There was one worth I believe $500 that often went unclaimed. Yes, it required a one page essay to be judged to apply. That one page apparently was too high of a barrier for many folks and as a result sometimes it was never awarded. Quite literally a person could have written. “I would like to apply for the scholarship” as their essay and gotten it.

The same thing happened with our regional scholarships. Out of 11 students so far, none applied. This was literally free money sitting on the table. We have decided to extend the scholarship application process until April 23rd and reminded folks they could apply.

Now, some of the students probably can NOT apply, because they are employees of government agencies that sometimes have rules on what outside funds or gifts can be accepted. This actually increases the odds for the other students. And some may feel that their economic status is good enough that they don’t need to and fear they’d take a scholarship away from someone who has more of a need for it. And that’s a position I can definitely appreciate. But my advice to them, “let the scholarship committee make that decision.” If they determine someone is more needing the money, or your need is not enough, they will let you know. And if they do give you a scholarship and you feel guilty, pay it forward. Donate to the fund later on, or give the money you saved to other causes.

Besides essentially free money at the NCRC, I got thinking about the amount of free training I’ve received in the SQL Server community. Yes, I’ve paid for PASS Summit a few times, but even if I had never gone to that, the amount of knowledge I’ve gained for free over the past several years has been amazing. Between SQL Saturdays and User Group meetings, the body of knowledge I’ve been exposed to has been absolutely amazing.

And yet, I know folks who shun such activities. I’m not talking about folks who say, “I can’t make it this month because it’s my kid’s birthday”. I’m talking about folks who claim they never learn anything. I don’t understand how that’s possible given the HUGE range of topics I’ve seen at SQL Saturdays and oh so many other free events. Some folks seem to think only the paid events are worth it. And while PASS Summit had certain unique advantages, the truth is, you can listen to almost all the presenters at various free events too.

Yes, time is not free, and I recognize that. But overall, it still amazes me at the number of folks who overlook the value of free events, or easy to gain scholarships to events. Don’t turn your nose up at free. It can be valuable.

P.S. – for the parents of college bound kids out there, one thing I did in college which netted me a bit of free money. A few days after the semester began, I’d stop by the financial aid office and ask if there was any unclaimed scholarship money I was eligible for. I never netted much, but I did net a few hundred dollars over the years. For 15 minutes of my time, that’s a pretty decent ROI.

Make Security Easy

This will be a short blog this week, but I want to talk again about an issue I have with a client of mine. They make security hard.

This is not to say they don’t take it seriously, or that they are lax. Far from it. They actually are fairly stringent on their security protocols and get after folks on ensuring boxes are consistently patched and that passwords are stringent and details like that. Overall I’d give them probably an B on security. But I can’t quite give them an A.

There’s really two reasons for that:

The first is inconsistency. Let me be clear, getting to their internal network is appropriately difficult. I have to use their secure VPN, with soft-tokens and similar measures. Technically before I can access a box, I have to jump through multiple hurdles. I’m ok with that. What’s a pain is on some boxes if I walk away for an extended period of time, the screen remains unlocked and nothing changes. Now, because of my OWN security model my computer will lock FAR sooner than that. And my default mode is to typically lock my own computer anytime I walk away from it (and that’s within my own house). But for some machines, if there’s no keyboard or mouse input, the screen will lock after 15 minutes, but my session won’t ever be logged out. For others, the screen will lock after 15 minutes and my session will be logged out after several hours. There appears to be no real rhyme no reason to this other than a slight correlation with when the box was configured.

Now, in general, I think locking unattended screens can be a good thing. The downside is, due to the nature of my job, I may start work on one machine, flip over to another to do something like update the schema and then flip back to the first, only to find my screen locked. In some cases, I won’t. It’s inconsistent. Ideally I think it should be consistent.

So, if you have a security protocol, decide on what it is, and make it consistent.

But the real complaint I have, and this has been true of multiple companies I’ve worked with: make security easy.

Again, with this particular client, on most, but not all boxes, I can easily download and install the required patches. (OS level patches are handled by their internal IT team which is a huge win). But some machines have firewall rules in place such that you can’t download the patch directly to the machine. You have to go to a jump box, download the patch there and copy it over. This is fairly inconvenient. Now, if this were consistent across all machines I’d develop procedures around that, but they’re not consistent. This is particularly a problem for software that often will actually only download a stub installer that will then try to download the actual patch. In this case, if you simply copy over the stub and try to run it to patch the machine, it too will fail. This means you need to find the often hard to find link to download the full patch to the jump box and then copy that over. In some cases, it’s even worse, you have to manually place files where you want them. I had this occur on an update I was doing to a module for PowerShell. I had to download the installer to a jump box, extract what I needed and manually copy the files to the right subdirectory. Now, granted, I get paid by the hour, but I’d like to think my clients pay me for things other than copying files.

I’ve seen another related issue at other clients when it came to patching. They’d patch users desktops during the day and default to “reboot in the next 10 minutes” with no option of delaying the patch or reboot. Now, there are possibly first day exploits where this might be warranted, but this was the default for ALL Windows patches. This was really discouraging to employees and multiple times caused them to lose work, especially it they were away from the desk during this time and didn’t have a chance to save their work. The sad part is that there are multiple ways this could have easily been handled that would have had far less impact on the employees.

In the end, security is critical, but we should be making it as easy to comply as possible and as consistent as possible. There’s an old adage that the security person doesn’t stop doing their job until they’ve stopped you from doing yours. Don’t make that a truism.