Quiet Time and Errors

I wrote last week about finally taking apart our dryer to solve the loud thumping issue it had. The dryer noise had become an example of what is often called normalization of deviance. I’ve written about this before more than once. This is a very common occurrence and one I would argue is at times acceptable. If we reacted to every change in our lives, we’d be overwhelmed.

That said, like the dryer, some things shouldn’t be allowed to deviate too far from the norm and some things are more important than others. If I get a low gas warning, I can probably drive another 50 miles in my car. If I get an overheated engine warning, I probably shouldn’t try to drive another 50 miles. The trick is knowing that’s acceptable and what’s not.

Yesterday I wrote about some scripting I had done. This was in response to an issue that came up at a customer site. Nightly a script runs to restore a database from one server to a second server. Every morning we’d get an email saying it was successful. But, there was a separate email about a separate task that was designed to run against that database that indicated a failure. And that particular failure was actually pretty innocuous. In theory.

You can see where I’m going here. Because we were trusting the email of the restore job over the email from the second job, we assumed the restore was fine. It wasn’t. The restore was failing every night but sending us an email indicating a success.

We had unwittingly accepted a deviance from the norm. Fortunately the production need for this database hadn’t started yet. But it will soon. This is what lead my drive to rewrite and redeploy the scripts on Friday and on Monday.

And here’s the kicker. With the new script, we discovered the restore had also been failing on a second server (for a completely different reason!)

Going back to our dryer here, it really is amazing how much we had come to expect the thunking sound and how much quieter it is. I’ve done nearly a half-dozen loads since I finally put in the new rollers and every time I push the start button I still cringe, waiting to hear the first thunk. I had lived with the sound so long I had internalized the sounds as normal. It’s going to take awhile to overcome that reaction.

And it’s going to take a few days or even weeks before I fully trust the restore scripts and don’t cringe a bit every morning when I open my email for that client and check for the status of overnight jobs.

But I’m happy now. I have a very quiet dryer and I have a better set of scripts and setup for deploying them. So the world is better. On to the next problems!

5 thoughts on “Quiet Time and Errors

  1. There’s another issue here too, in information overload. Monitoring systems that are designed to send email when it believes things are normal are at best just extra noise someone has to deal with and at worst something that people become accustomed to ignore and easily overlook when it isn’t saying that things are normal. Now clearly you had a bigger problem in a system thinking that things are normal when they weren’t, but still, cut down the noise.

    • Yeah, I actually was thinking about this because last night in our NCRC Medical Interest Group we were talking about how often to take vital signs and how in some cases it may be pointless, since you might not be able to act on the information in any event.

      I actually brought up the point that I’ve been known to disable alerts because they provided no useful or actionable information.

      For another set of jobs we’re able to collect the data in a useful dashboard and I can look at that once in the morning and know the status of about 8 overnight jobs. Far more useful than 8 separate emails.

  2. “We had unwittingly accepted a deviance from the norm.”

    I’m not sure I entirely agree… Acceptance of deviation from the norm is not valuing clear warnings of a deteriorating situation. While that applies here to an extant, what you actually did was (in my book) much blacker. The basic fault here isn’t really talking yourself ignoring a warning – it’s failing to resolve inconsistent indications.

    While it can be seen on the same spectrum, “this isn’t so bad (it hasn’t hasn’t blown up after all)” isn’t the same as “I chose to believe an potentially broken indication”. The former… you never really know until it’s too late. The latter? It’s a blaring alarm that something is broken.

    Back in the day, I’d have had a long talk with a tech who did the former. But the latter? I’d have _absolutely_ roasted him alive. Think back on the discussions of “ratty data” in Lovell’s Lost Moon.

    See pages 77-79….

    https://books.google.com/books?id=-H2JDwAAQBAJ&pg=PT94&lpg=PT94&dq=ratty+data+apollo+13#v=onepage&q=ratty%20data%20apollo%2013&f=false

    • Fair point. To be fair, and I didn’t explain it well, is that there was initially very little reason to correlate the failure with the success. The failure, to be technical was “can’t add user X to database”. This can happen for a variety of reasons, one of which is the original database had the user prior to the restore (which was actually my working theory at the time, since it was actually far more likely. It also meant the error was really just a warning). The second could be on the secondary machine the user didn’t exist (this is actually the exact case in another setup, where we ended up removing the job because it was pointless). So it was a weaker correlation than I make it out to be in the article. It wasn’t as simple as “this showed a successful back and this one showed a failed backup.”

      That said, I think you’ve got a point, it’s not necessarily the greatest example of normalization of deviance.

      Thanks for the insight.

  3. Pingback: Advanced Braining | greenmountainsoftware

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s