It’s Easier the Other Way!

This is what someone said to me while biking up a hill the other day.

I have a certain bike route I do that includes stopping at the supermarket along the way. It’s just over 5 miles. Obviously over the course of the entire route I return to the same place I started, but the topography varies along the way. But the simple fact of the matter is that the supermarket is at a lower altitude than the house, and about 2/3rds of the way along the route in the direction I go.

So when I tell people about biking this route, I point out that it’s sort of a double-whammy. When I get to the lower point in the route, I go shopping. To finish my loop, my bike is often heavier with groceries and I have all the altitude to gain back (which admittedly isn’t much, a bit over a 100′) in a short distance (maybe 2000′).  It was along this section that my erstwhile adviser offered his advice.  My reply of course was that “true, but home is that way!”, I said pointing up the hill.

Unfortunately, sometimes one can’t do it the easier way. One has to do it the hard way.

But, this leads into something else I wrote on Quora.com: Do Bad Programmers Know hey write bad code? I only partly addressed the question, but to sum it up, I think the worst don’t, but the best programmers do, and sometimes intentionally.

The truth is, we probably should always be writing the best code we can. It should handle error trapping and handling. It should validate inputs. It should fail gracefully.

But often, we need a one-off script. Something that gets the job done here and now.

I recently did a 5-minute lightning round for my SQL User Group on the benefits of using PowerShell. I took to quick and dirty scripts I had written and rewrote them a bit for the presentation.  Afterwards, one of the attendees asked me a few questions about my stylistic choices in the code.  He was right in general, but I pointed out what my goal was. My goal was more to show what PowerShell could do than to actually show how to write good code in PowerShell.  That said, I probably should have written slightly better code, but this got the job done. It definitely didn’t need error handling and the like. It was good enough.

And ironically, this post is sort of like that. (I love it when I can get meta on my own posts). I have about a dozen drafts I have saved in WordPress. Most have just a title and a quick set of notes on what I think I should write about. This post was mostly written and just needed a bit more to flesh it out.  It was easier this way than trying to come up with a new topic for this week. Hope you enjoyed it. (and to keep things even easier, I’m going to let WordPress use a random photo for it!)

An Ounce of Prevention?

There’s an old saying in medicine, “when you hear hoof beats, think horses, not zebras.”  Contrary to what one might think after watching House the truth is, when you get presented a set of symptoms, you start with the most likely, well because it is most likely! But as House illustrates, sometimes it can be the unlikely.

In First Aid, especially wilderness or backcountry medicine, there’s an acronym that is often used called SAMPLE. This is a mnemonic to help rescuers remember what data to gather:

  • S – Signs/Symptoms – What do you see, observe? (i.e. what’s going on).
  • A – Allergies – Perhaps the problem is an allergic reaction, or they might be allergic to whatever drug you want to give them.
  • M – Medicines – What medications are they on? Perhaps they’re diabetic and haven’t taken their insulin, or they’re on an anti-seizure medicine and need some.
  • – Past, pertinent medical history. You don’t care they broke their ankle when they were 5. But perhaps they just underwent surgery a few weeks ago? Or perhaps they have a history of dislocating their shoulder.
  • L – Last oral intake. Have they eaten or drunk anything recently. This will drive your decision tree in a number of ways.
  • E – Events leading up to the injury.  Were they climbing to the top of the cliff and fell, or did they simply collapse at the bottom? The former may suggest you look for a possible spinal injury, the latter probably indicates something else.

I’m reminded of this because of how I spent my Presidents’ Day.  I woke up and checked a few emails and noticed that two I were expecting from a client’s servers never arrived. I logged in to see what was going on.  It turns out there was nothing going on. No, not as in, “nothing wrong going on” but more as in “nothing at all was happening, the databases weren’t operating right.”  The alert system could connect to the database server, so no alerts had been sent, but actually accessing several of the databases, including the master resulted in errors.

Master.mdf corruption

Not an error you want to wake up to!

And what was worse, was the DR server was exhibiting similar symptoms!

So modifying SAMPLE a bit:

  • S – Signs/Symptoms – Well, databases are throwing corruption errors on two servers.  This was extended to the ERRORLOG files on both servers.
  • A – Allergies – Well, servers don’t have allergies, but how about known bugs? That’s close enough.  Nope, nothing that seems to apply here.
  • M – Medicines – I’ll call this antivirus software and <redacted>. (For client privacy reasons I can’t specify the other piece of software I want to specify, but I’ll come back to it.)
  • – Past, pertinent medical history.  Nothing, these servers had been running great up until now. One has been in production for over 2 years, the other, up for about 2 months, being brought up as a DR box for the first.
  • L – Last oral intake. Let’s make this last data intake. Due to forensics, we determined the corruption on both servers occurred around 3:00 AM EST.  Checking our logs, jobs, and other processes, there’s nothing special about the data the primary server took in at this time. If anything, disk I/O was lower than average.  And, fortunately, we can easily recreate any data that was sent to the server after the failure.
  • E – Events leading up to the injury. This is where things get interesting.
    • There were some zero-day patches applied to both servers over the weekend.
    • On Saturday, I had finally setup log-shipping between the two servers

So, we’ve got three possibilities, well four really.

  1. The zero day patches caused an issue about 48 hours later.  Possible, but unlikely, given the client has about 1600 servers that were also patched and have not had issues.
  2. Log-shipping somehow caused problems.  But again, the new log-shipping setup had run for about 36 hours without issue. And, best we can tell, the corruption occurred on the secondary BEFORE the primary. And log-shipping doesn’t apply to the Master database or the ERRORLOG file.
  3. Some unknown interaction.  This I think is the most likely and is where the <redacted> from the M above comes into play.
  4. Pure Random, and it’ll never happen again. I hate this option because it just leaves me awake at night.  This I added only for completeness.

Without going into detail, our current theory is that some weird interaction between <redacted> and log-shipping is our cause. Of course the vendor of <redacted> is going to deny this (and has) but it’s the only combination of factors that seems to explain everything. (I’ve left out a number of additional details that also helped us get to this conclusion).

So for now, we’ve disabled log-shipping and are going to make some changes to the environment before we try log-shipping again.

Normally I think horses, but we might have a herd of zebras on this one. And ironically, setting up for DR, may have actually caused a Sev 1 outage. So the ounce of prevention here may not have been worth it!

And who said my Wilderness First Aid wouldn’t come in handy?

 

 

Bits are cheap

And, as unfortunately as a recent incident in our #SQLFamily community illustrated, apparently at times so is respect.  Bear with me as I relate these two ideas and another incident.

Let me start with a statement that should make more sense by the end of this post: My name is Gregory, but I prefer that you call me Greg. My pronouns are he/him/his.

But first a trip down memory lane. Many of us recall the Y2K issue. This was a direct result of programmers decades ago trying to save bytes in storage (and to a lesser extent memory and CPU cycles) because storage was expensive. By storing dates as just the last two digits of the year, they could cut the storage for years in half. This was important back then because it saved money. But, as many of us recall, as the year 2000 approached, this started to cause more and more problems. (As a point aside, the first example I’m aware of was brought to my attention by a programmer who worked for a bank in 1970. Seems as if they suddenly had issues handling 30 year mortgages!)

Since then of course the cost of storage has dropped and as an industry we’ve moved to storing years as a 4-digit year. No one in today’s day and age would normally question this decision.

But enough of ancient history, let me get to the point of this article: respecting others.

As many readers know, those of us on Twitter will often use the hashtag #SQLFamily.  In the past week I’ve seen two incidents that have illustrated the worst and the best of this family.

In the first case, a member of the community, a woman I had never met, said she was leaving the family, she no longer felt welcome. At an event she had been misgendered not once, but multiple times. For those who aren’t sure what that means, I will, without going into background or details (because they’re not important) say she is a trans-woman. Several people at the event took it upon themselves to refer to her using by male pronouns.

In the most recent case, a fellow speaker, Cathrine Wilhemsen tweeted about how she had been addressed as Cathi and Kathi twice in the previous 24 hours. She says this hasn’t been the only time, but just the most recent and recent enough for her to comment on.

In both cases, part of the problem is that strangers addressed the person in question in a manner that did not respect them; in the first case by not using the proper pronouns and in the second by not using her provided name.

But that’s one part of the problem.  So let’s address that: we have members of the #sqlfamily who don’t respect other members. But, we have another issue, and one that I think is important to address: those who minimize the issue. In the first case, apparently no one called out the folks misgendering the woman.  In a situation like this, a show of support can be as simple as saying something like, “Umm, I think you mean she, not he.”  You can also support the use of pronouns on nametags at events or in the bio descriptions for events.

Remember though, today, bits are cheap. So we can do more. Don’t design your database with a bit field for gender. Make it a table. These are relational databases after all. Have a table for possible gender identifications. Allow for a method to add rows to this table. Have a table for pronouns.  There’s more than you might think and people are often crafting additional ones. While the singular they/them is becoming more popular, it’s NOT the only alternative to he/him, she/hers.

We are data professionals after all. We absolutely should not lock our data into a single view of the world if that worldview is changing. (Note, the world is not changing, there have been multiple genders throughout recorded history.  We’re simply becoming more cognizant of it now.)

In the case of Cathrine being called by another name, keep it simple. Use the name provided, be it in an introduction, on the nametag or other method. Respect the person’s wishes. And do not, as some did on Twitter respond by “well they probably didn’t mean anything” or “eh, just roll with it.” It’s not YOUR name. It’s not YOUR identity. Sure, you might not care if someone calls you Richard, Rick, Ricky or Dick. But another person might. Their name is part of their identity, respect their wishes.  I will add one more note that Cathrine shared with me and that other women have shared with me, it is almost always men that will use nicknames or cute names or similar without prompting.  Yes, fellow men, I’m calling you out. We may not think about it. In fact I would argue we often don’t think about it. It’s something that privilege allows us. But be aware that your attempt to be friendly or familiar is actually often coming off as diminishing and condescending.

Now, despite the failure of some members of #SQLFamily, I want to celebrate the great people in the community. These two incidents have created a lot of responses. I’ve seen at least two great posts, one from Jen McCown and another from Kellyn Gorman. I’m sure there are others. I also have written in the past about being an ally. But in addition, while I’ve seen one or two tweets that have dismissed Cathrine’s tweet, I’ve seen many members rally to the defense of the women in both incidents. And, also very importantly, I’ve seen several tweets from people asking, “how can I help?” or “how can I improve my behavior?” I love that last one. I’m constantly trying to unlearn some of the behaviors I was taught and to be more conscious of what being a white, straight cis-het male brings to the table. We can always learn to do better.

Yes, our #SQLFamily has some members who could and need to do better. That saddens me. Fortunately as I’ve seen, it also has a lot of members actively striving to do better and help others do better. That gladdens me. Let’s all be the latter.

Respect and disk space don’t cost us much. Let’s learn to be respectful of people and to design databases that can also respect the world around us.

P.S. I want to note, I was purposely vague about the first incident because the specifics weren’t important and I did not want to draw more attention to a specific person without their permission. In Cathrine’s case, I made a point of respecting her and exchanged messages with her first to make sure she was ok with me bringing more attention to the incident.

Brake (sic) it to Fix it!

Last week I wrote about getting my brakes fixed. Turns out I made the right choice, besides shoes and pads, I needed new calipers.  Replacing them was a bigger job than I would have wanted to deal with. Of course it cost a bit more than I preferred, but I figure being able to make my car stop is a good thing. I had actually suspected I’d need new calipers because one of the signs of my brakes needing to be replaced was the terrible grinding and shrieking sound my left front brake was making. That’s never a good sign.

As a result I started to try to brake less if I could help it. And I cringed a bit every time I did brake. It became very much a Pavlovian response.

About two weeks ago, a colleague of mine said, “Hey, did you notice the number of errors for that process went way up?” I had to admit, nope, I had not. I had stopped looking at the emails in detail. They were for an ETL process that I had written well over a year ago. About 6 months ago, due to new, and bad, data being put into the source system, the ETL started to have about a half dozen rows it couldn’t process.  As designed, it sent out an email to the critical parties and I received copies.  We talked about the errors and decided that they weren’t worth tracking down at the time.  I objected because I figure if you have an error, you really should fix it. But I got outvoted and figured it wasn’t my concern at that point. As a result, we simply accepted that we’d get an email every morning with a list of rows in error.

But, as my colleague pointed out, about 3 months ago, the number of errors had gone up. This time it wasn’t about a half dozen, it was close to 300. And no one had noticed.  We had become so used to the error emails, our Pavlovian response was to ignore them.

But, this number was too large to ignore. I ended up doing two things. The first, and one I could deploy without jumping through hoops was to update the error email. Instead of simply showing the rows of errors, it now included a query that placed a table at the top that showed how many errors and in which tables. This was much more effective because now a single glance can easily show if the number of errors has increased or gone down (if we get no email that means we’ve eliminated all the errors, the ultimate goal in my mind.)

I was able to track down the bulk of the 300 new errors to a data dictionary disagreement (everyone raise their hands who has had a customer tell you one thing about data only to discover that really the details are different) that popped up when a large amount of new data was added to the source system.

I’ve since deployed that change to the DEV environment and now that we’re out of the end of the month code freeze for this particular product, will be deploying to production this week.

Hopefully though the parties that really care about the data will then start paying attention to the new email and squawking when they see a change in the number of bad rows.

In the meantime, it’s going to take me awhile to stop cringing every time I press my brakes. They no longer make any bad sounds and I like that, but I’m not used to to absence of grinding noises again.  Yet. In both cases, I and my client had accepted the normalization of deviance and internalized it.

I wrote most of this post in my head last week while remembering some other past events that are in part related to the same concept.  As a result this post is dedicated to the 17 American Astronauts who have perished directly in the service of the space program (not to diminish the loss of the others who died in other ways).

  • Apollo 1 – January 27th, 1967
  • Challenger – STS-51L – January 28th, 1986
  • Columbia – STS-107 – Launch January 16th, 2003, break-up, February 1st, 2003.