SQL Saturday DC 2017 Prologue

I’ll apologize upfront, not every blog post is mind-shattering and deep. This is one such lightweight one.

Once again I’ll be heading to a SQL Saturday, this time in Washington DC. I actually had two to choose from, one in Providence Rhode Island and this one. Mostly because I have friends in DC from the days I worked there, I choose to submit some sessions to present in DC.

Submitting presentations is always a bit nerve-wracking for me. You feel confident you’ve got a good session and you hope it is what what the organizers want. I’m not really sure how many SQL Saturdays I submitted sessions to before I was finally selected for my first one in NYC. My topic then (and still one I still present from time to time) was “Tips that Saved my Bacon.” I even made bacon cookies to hand out. (Never again, they didn’t come out great.) I knew I was off to a good start when I saw Micron Technologies handing out t-shirts with a bacon motif and even a scent of bacon. Must be kismet.

I get to my room, I get setup, I’m all set. I’m a bit excited and waited for the people to pour in and fill the room. Ok, then I waited for anyone to walk into the room. Finally some people showed up. Great, first hurdle cleared, I had an audience. Not as large as I had hoped with which to share my amazing wisdom, but an adequate one nonetheless.

Before I started, a question, from a woman I believe who was Muslim, “what is bacon?” Of course, with pork being harām to most observant Muslims, she may have never even encountered it. But I realized too, I had a bigger issue. She didn’t understand the idiom!  I was really off to a great start now.  I did the best to explain both what it was and the idiom.  She seemed satisfied. And then we were off.

Fortunately I sort of had a plant in the audience; a friend of mine who was attending who had wanted to hear my talk.  He claims he got a lot out of it.  I hope so.

Since then I’ve given that talk several times and fortunately had better turnouts and better results.

After that I’ve had pretty good luck in getting selected to speak, but sometimes you still get the rejection. I had previously put in to speak at DC at least twice before and turned down both time. You take it in stride and try to not take it personally.

So, again, this year I put in to speak at DC. I submitted three different possible presentations, hoping at least one would be accepted.

The final date for submissions came and went. And nothing. Then I saw they had reopened the period for submissions. This didn’t instill hope in me getting selected.

And still nothing. Oh well, I decided I’d go to SQL Saturday DC anyway, just to attend and to see my friends in DC.

Then I was at the RedGate After Party at Pass Summit and Chris Hyde walked up to me and said, “Congratulations. I really wanted to see your talk in DC, but apparently we’re presenting at the same time.” So, I had to go through my emails and find the one from DC. (That day I had a system that was giving me some warning messages, so I had to sort through about 100 messages to find the one from SQL Saturday DC, hence why I had originally missed it.)

But, as Chris didn’t say WHICH of my presentations he was going to miss, I pulled out my phone, logged into the site. And lo and behold, I discovered I wasn’t presenting just once, but TWICE! I was completely shocked. And… now in a wee bit of trouble.

You see my talk on Who’s flying the plane? What IT can learn from plane crashes, is one of my favorites and one I’ve given multiple times before (and the one apparently Chris will miss). But, my 2nd talk Presently Presenting…. Presenting was one that had hadn’t quite fully written. Ok, I had the outline in my head, but hadn’t written it at all! I generally do NOT recommend this style of doing things. I really like to present at a smaller group first (say my local user group) but I figured this was a good way to give me a kick in the pants and get the talk written. And I was right. The presentation is written, I’ve run through it a few times (and will run through it a few times more before Saturday) and I’m quite happy with it at this point.

So, this coming Saturday, I’ll be giving not just one, but two talks at SQL Saturday in DC. If you’re reading this and already signed up, I’d love to see you there. If we know each other, of course say hi. If we don’t, introduce yourself. I always enjoy meeting new folks.

And if you haven’t signed up, there is unfortunately a wait-list, but you can still add your name to it and if folks cancel, get in.

So, I hope to see you there!

P.S. I’ll give one piece of advice that’ll be in my talk on presenting. If you DO get turned down, don’t take it personally. Take it with grace. SQL Saturday organizers face a lot of challenges in picking presenters and are often overwhelmed with the number of submissions. Trying to argue with them or worse calling them names or getting upset with them is a sure fire way to guarantee you do NOT get selected in the future. And, organizers talk to each other. You do NOT want to get tagged with being “that person”. If you get turned down, don’t take it personally and move on.

She’s smart and good looking.

Now, if you work from home like I do, this exercise won’t really work, but if you work in an office, look around at your coworkers and start to notice what gender they present as. Most likely you’ll notice a lot of men and a few women.

Sexism is alive and well in the tech world. Unfortunately.

We hear a lot about efforts (which I support by the way) like Girls and Data and Girls Who Code. These are great attempts at addressing some of the gender issues in the industry.  We’ve probably all heard about the “Google Manifesto” (and no, I’m not linking to it, since most of the “science” in it is complete crap and I don’t want to give it any more viewership than it has had. But here’s a link to the problems with it.)

We know that grammar school and middle girls have a strong interest in the STEM field. And yet, by the time college graduation rolls around, we have a disproportionately smaller number of them in the computer sciences for example.  So the above attempts to keep them interested help, but honestly only address part of the problem.

The other side is us men.  Yes, us.  We can tell our daughters all day long, “you’re smart, you can program”.  “You too can be a DBA!” and more. But what do we tell our sons?  We need to tell the that women can program. We should be telling them about Ada Lovelace and Admiral Grace Hopper. We should be making sure they realize that boys aren’t inherently better at STEM then girls.  We should be making sure they recognize their own language and actions have an impact.

What do we do ourselves when it comes to the office environment? Do we talk too much? Evidence suggests we do.

Do we subconsciously ignore the suggestions of our female coworkers or perhaps subconsciously give more support or credence to the suggestions of our male coworkers?  While I can’t find a cite right now, again evidence again suggests we do.

Who is represented at meetings?  Are they a good ol’ boys network?  Who do we lunch with, both at work and when we network?

If you’re a member of a user group that has speakers, what does the ratio of speakers look like to you? Do they reflect groups ratio? Do they reflect the ratio of the industry?

I think it’s great that we have programs such as Girls who Code and Girls and Data, but we as men have to work on ourselves and work on our actions and reactions.

Some suggestions: “Sometimes, simply shut up.” I’ve started to do this more, especially if I’m in a group of women. LISTEN. And you know what, if you’re thinking right now, “well duh… because women talk so much I’d never get a word in anyway” you’re falling victim to the cliches and perpetuating the problem.

Support the women you work with. If they have a good idea, make sure it gets the same discussion as other ideas. And if one of your coworkers tries to co-opt it as their own, call them on it.  If you have a coworker (and I’ve had these) that is continually cutting off women in meetings, call them on it.

Seek out women speakers for your user groups. I’d suggest for example Rie Irish and her talk “Let her Finish”.  I asked Rie to speak at our local user group. Partly because of serendipity (I contacted one of our women members to let her know about the talk) we got the local Women in Technology group to advertise our meeting and ended up with a number of new members.

And finally, the title. Watch your language. Unless you’re working at a modelling agency or similar, you probably should never be introducing a coworker as “She’s smart and good looking.”  Think about it, would you ever introduce a male coworker as “He’s a great DBA and handsome too boot!”  Your coworkers, male or female are just that, coworkers in a professional setting, treat them as such.

Two final thoughts:

  1. If somehow this blog post has impacted you more than the brilliant posts of Rie Irish, Mindy Curnutt, or others who have spoken on sexism in the industry, I’d suggest you examine your biases, not give credit to my writing.
  2. If you have suggestions for women speakers for my local user group, especially local ones who can make the second Monday of the month, please let me know.

 

 

 

 

SQL Saturday NYC 2017 Recap

Realizing that perhaps SOME entry is better than no entry, I figure I’d write a short one.

This weekend, spend about 27 hours in NYC with my wife Randi. While one goal of the trip was definitely a min-vacation, the actual impetus was again a SQL Saturday. Again, I was selected to present my talk, “Tips that Saved My Bacon”.  Apparently it was well received since the people nice enough to give feedback on the provided forms gave me top notches across the board.

While that’s always refreshing, it does make me wonder about what the other folks thought. Did they go away satisfied, feeling that there was no useful feedback to provider?  Or did they feel they couldn’t provide information since I might be insulted? Or did they simply not bother?  I’ll never know.

I’ll say now, that good feedback is always appreciated by me. (And feedback that I’m good is always an ego boost 🙂

I’ll be presenting again in a couple of weeks at SQL Saturday Philadephia, this time two talks, again my Bacon talk and my IT and Plane Crashes. A twofor if you will.

But, this weekend got me thinking about my weekends this year. I’ll have spent at least 15 days on Cave Rescue stuff (several weekends plus a week of teaching) and at least 4 SQL Saturdays (Chicago (passed), New York City (passed), Philadelphia and Albany) and 3 days at SQL Summit. So that’s 26 days at least donating time to organizations that I believe strongly in.

What do you do with your time?

 

Don’t Break the Chain!

If one backup is good, two is better right?

Not always.

Let me start by saying I’ve often been very skeptical of SQL Server backups done by 3rd party tools. There’s really two reasons. For one, many years ago (when I first started working with SQL Server) they often simply weren’t good. They had issues with consistency and the like. Over time and with the advent of services like VSS, that issue is now moot (though, I’ll admit old habits die hard).

The second reason was I hate to rely on things that I don’t have complete control over. As a DBA, I feel it’s my responsibility to make sure backups are done correctly AND are usable. If I’m not completely in the loop, I get nervous.

Recently, a friend had a problem that brought this issue to light. He was asked to go through their SQL Server backups to find the time period when a particular record was deleted so they could develop a plan for restoring the data deleted in the primary table and in the subsequent cascaded deletes. Nothing too out of the ordinary. A bit tedious, but nothing too terrible.

So, he did what any DBA would do, he restored the full backup of the database for the date in question. Then he found the first transaction log and restored that.  Then he tried to restore the second transaction log.

The log in this backup set begins at LSN 90800000000023300001,  which is too recent to apply to the database. An earlier log backup that  includes LSN 90800000000016600001 can be restored.

Huh? Yeah, apparently there’s a missing log.  He looks at his scheduled tasks. Nope, nothing scheduled. He looks at the filesystem.  Nope, no files there.

He tries a couple of different things, but nope, there’s definitely a missing file.  Anyone who knows anything about SQL Server backups, knows that you can’t break the chain. If you do, you can’t get too far. This can work both ways. I once heard of a situation where the FULL backups weren’t recoverable, but they were able to create a new empty database and apply five years worth of transaction logs. Yes, 5 years worth.

This was the opposite case. They had the full backup they wanted, but couldn’t restore even 5 hours worth of logs.

So where was that missing transaction log backup?

My friend did some more digging in the backup history files in the MSDB and found this tidbit:

backup_start_date backup_finish_date first_lsn last_lsn physical_device_name
11/9/2016 0:34 11/9/2016 0:34 90800000000016600000 90800000000023300000 NUL

There was the missing transaction backup.  It was a few minutes after the full backup, and definitely not part of the scheduled backups he had setup.  The best he can figure is the sysadmin had set SAN Snapshot software to take a full backup at midnight and then for some reason a transaction log backup just minutes later.

That would have been fine, except for one critical detail. See that rightmost column (partly cut-off)? Yes, ‘physical_device_name’. It’s set to NUL.  So the missing backup wasn’t made to tape or another spot on the disk or anyplace like that. It was sent to the great bit-bucket in the sky. In other words, my friend was SOL, simply out of luck.

Now, fortunately, the original incident, while a problem for his office, wasn’t a major business stopping incident. And while he can’t fix the original problem he was facing, he discovered the issues with his backup procedures long before a major incident did occurr.

I’m writing about this incident for a couple of reasons.  For one, it emphasizes why I feel so strongly about realistic DR tests.  Don’t just write your plan down. Do it once in awhile. Make it as realistic as it can be.

BTW, one of my favorite tricks that I use for multiple reasons is to setup log-shipping to a 2nd server.  Even if the 2nd server can never be used for production because it may lack the performance, you’ll know very quickly if your chain is broken.

Also, I thought this was a great example of where doing things twice doesn’t necessarily make things less resistant to disaster. Yes, had this been setup properly it would have resulted in two separate, full backups being taken, in two separate places. That would have been better. But because of a very simple mistake, the setup was worse than if only one backup had been written.

I’d like to plug my book: IT Disaster Response due out in a bit over a month. Pre-order now!

Deep Drilling

I was reviewing the job history on one of the DR servers of a client of mine. I noticed something funny. The last job recorded in the job history table (msdb.dbo.sysjobhistory for those playing along at home) was recorded in January of this year.

But jobs were still running. It took me awhile to track it down, but through some sleuthing I solved the problem. First, I thought the msdb database might have filled up (though that event should have generated an error I’d have seen).  Nope.

Then I thought perhaps the table itself was full somehow. Nope, only about 32,000 records.  No luck.

I finally tried to run sp_sqlagent_log_jobhistory manually with some made up job information.

Msg 8115, Level 16, State 1, Procedure sp_sqlagent_log_jobhistory, Line 99
Arithmetic overflow error converting IDENTITY to data type int.
Arithmetic overflow occurred.

Now we’re getting someplace.  After a minor diversion of my own doing I then ran

DBCC CheckIDENT('sysjobhistory',NORESEED)

This returned a value of 2147483647. Hmm, that number looks VERY suspicious. A quick check of Books Online confirmed that’s the max value of a signed int.

Now, the simple solution, which worked for me in this case was to issue a

truncate table sysjobhistory

This removed all the rows in the table AND reset the IDENTITY value. Normally I’d hate to lose history information, but since this was 6 months old and seriously out of data it was acceptable. I could have merely reset the IDENTITY seed value, but there’s no guarantee I would not have then had collisions within the table later on. So this was the safest solution.

But wait, there was more. It kept bugging me that I had somehow reached the 2 BILLION row limit for this table. Sure, it handles log-shipping for about a dozen databases and as a result does about 48 jobs an hour, plus other jobs.  But for a year that should generate less than 1 million rows.  This database server hasn’t been running for 2 thousand years.

So, I decided to monitor things a bit and wait for a few jobs to run.

Then, I executed the following query.

select max(instance_id) from sysjobhistory

This returned a value along the lines of 232031.  Somehow, in the space of an hour or less, my sysjobhistory IDENTITY column had increased by over 232,000. This made no sense. But it did explain how I hit 2 billion rows!

So I started looking at the sysjobhistory table in detail. And I noticed gaps. Some make sense (if a job has multiple steps, it may temporarily insert a row and then roll it back once the job is done and put in a job completion record, and with the way IDENTITY columns work, this explains some small gaps). For example, there was a gap in instance_id from 868 to 875. Ok that didn’t bother me. BUT, the next value after 875 was 6,602. That was a huge gap! Then I saw a gap from 6,819 to 56,692. Another huge gap. As the movie says, “Something strange was going on in the neighborhood”.

I did a bit more drilling and found 3 jobs that were handling log-shipping from a particular server were showing HUGE amounts of history. Drilling deeper, I found they were generating errors, “Could not delete log file….”. Sure enough I went to the directories where the files were stored and there were log files going back to November.  Each directory had close to 22,000 log files that should have been deleted and weren’t.

Now I was closer to an answer. Back in November we had had issues with this server and I had to do a partial rebuild of it. And back then I had had some other issues related to log-shipping and permissions. I first checked permissions, but everything seemed fine.

I then decided to check attributes and sure enough all these files (based on the subdirectory attribute setting) had the R (readonly) value set. No wonder they couldn’t be deleted.

Now I’m trying to figure out how they got their attribute values set to R. (This is a non-traditional log-shipping setup, so it doesn’t use the built in SQL Server tools to copy the files. It uses rsync to copy files through an SSH tunnel).

So the mystery isn’t fully solved. It won’t be until I understand why they had an R value and if it will happen again.  That particular issue I’m still drilling into. But at least now I know why I hit the 2 billion row limit in my history table.

But, this is a good example of why it’s necessary to follow through an error to its root cause. All too often as an IT manager I’ve seen people who reported to me fix the final issue, but not the root cause. Had I done that here, i.e. simply cleared the history and reset the IDENTITY value, I’d have faced the same problem again a few weeks or months from now.

Moral of the story: When troubleshooting, it’s almost always worth taking the time to figure out not just what happened and fixing that, but WHY it happened and preventing it from happening again.

 

Testing

This ties in with the concept of experimentation. Thomas Grohser related a story the other night of a case of “yeah, the database failed and we tried to do a restore and found out we couldn’t.”

Apparently their system could somehow make backups, but couldn’t restore them. BIG OOPS.  (Apparently they managed to create an empty database and replay 4.5  years of transaction logs and recover their data. That’s impressive in its own right.)

This is not the first time I’ve worked with a client or heard of a company where their disaster recovery plans didn’t pass the first actual need of it. It may sound obvious, but companies need to test the DR plans. I’m in fact working with a partner on a new business to help companies think about their DR plans. Note, we’re NOT writing or creating DR plans for companies, we’re going to focus on how companies go about actually implementing and testing their DR plans.

Fortunately, right now I’m working with a client that had an uncommon use case. They wanted a restore of the previous night’s backup to a different server every day.

They also wanted to log-ship the database in question to another location.

This wasn’t hard to implement.

But what is very nice about this setup is, every 15 minutes we have a built-in automatic test of their log-backups.  If for a reason log-backups stop working or a log gets corrupt, we’ll know in fairly short time.

And, with the database copy, we’ll know within a day if their backups fail.  They’re in a position where they’ll never find out 4.5 years later that their backups don’t work.

This client’s DR plan needs a lot of work, they actually have nothing formal written down. However, they know for a fact their data is safe. This is a huge improvement over companies that have a DR plan, but have no idea if their idea is safe.

Morale of the story: I’d rather know my data is safe and my DR plan needs work than have a DR plan but not have safe data.

Experimenting

There are times when you have to take at face value what you are told.

There are 1.31 billion people living in China. This according to several sources (that all probably go back to the same official document from the Chinese government.)  I’m willing to believe that number. I’m certainly not going to go to China and start counting heads. For one, I don’t have the time, for another, I might look awfully weird doing so. It’s also accurate enough for any discussions I might have about China. But if I were going to knit caps for every person in China I might want a more accurate number.

That said, sometimes one shouldn’t take facts at face value. A case in point is given below. Let me start out with saying the person who gave me this fact, wasn’t wrong.  At least they’re no more wrong than the person who tells me that the acceleration due to gravity is 9.8m/s².  No, they are at worst inaccurate and more likely imprecise. Acceleration due to gravity here on Earth IS roughly 9.8m/s². But it varies depending where on the surface I am. And if I’m on the Moon it’s a completely different value.

Sometimes it is in fact possible to actually test and often worth it. I work with SQL Server and this very true here. If a DBA tells you with absolute certainty that a specific setting should be set, or a query must be written a specific way or an index rebuilt automatically at certain times, ask why. The worst answer they can give is, “I read it some place.”  (Please note, this is a bit different from saying, “Generally it’s best practice to do X”. Now we’re back to saying 9.8m/s², which is good enough for most things, but may not be good enough if say you want to precisely calibrate a piece of laboratory equipment.)

The best answer is “because I tested it and found that it works best”.

So, last night I had the pleasure of listening to Thomas Grohser speak on the SQL IO engine at local SQL Server User Group meeting. As always it was a great talk. At one point he was talking about backups and various ways to optimize them. He made a comment about setting the maxtransfersize to 4MB being ideal. Now, I’m sure he’d be the first to add the caveat, “it depends”. He also mentioned how much compression can help.

But I was curious and wanted to test it. Fortunately I had access to a database that was approximately 15GB in size. This seemed liked the perfect size with which to test things.

I started with:

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB.BAK’ with maxtransfersize=4194304

This took approximately 470 seconds and had a transfer rate of 31.151 MB/sec.

backup database TESTDB to disk=’Z:\backups\TESTDB_4MB_COMP.BAK’ with maxtransfersize=4194304, compression

This took approximately 237 seconds and a transfer rate of 61.681 MB/sec.

This is almost twice as fast.  While we’re chewing up a few more CPU cycles, we’re writing a lot less data.  So this makes a lot of sense. And of course now I can fit more backups on my disk. So compression is a nice win.

But what about the maxtransfersize?

backup database TESTDB to disk=’Z:\backups\TESTDB.BAK’

This took approximately 515 seconds and a transfer rate of 28.410 MB/sec. So far, it looks like changing the maxtransfersize does help a bit (about 8%) over the default.

backup database TESTDB to disk=’Z:\backups\TESTDB_comp.BAK’ with compression

This took approximately 184 seconds with a transfer rate of 79.651 MB/sec.  This is the fastest of the 4 tests and by a noticeable amount.

Why? I honestly, don’t know. If I was really trying to optimize my backups, most likely I’d run each of these tests 5-10 more times and take an average. This may be an outlier. Or perhaps the 4MB test with compression ran slower than normal.  Or there may be something about the disk setup in this particular case that makes it the fastest method.

The point is, this is something that is easy to setup and test. The entire testing took me about 30 minutes and was done while I was watching tv last night.

So before you simply read something on some blog someplace about “you should do X to SQL Server” take the time to test it. Perhaps it’s a great solution in your case. Perhaps it’s not. Perhaps you can end up finding an even better solution.