Yesterday was a Monday. I don’t just mean it was Monday, but it was in the Garfield comic sense of things A Monday.
As a consultant, I’ve come to expect certain patterns in my work load. For one client, I know approximately every 2 months, over 2 weekends I’m going to have to patch their SQL Servers. I know certain passwords will need to be updated quarterly or annually. And I know sometimes I’ll have A Monday.
Yesterday was one of those. I woke up, checked my email and noticed two jobs had not run. So I logged in and it appeared that the PowerShell script on each server had hung. I killed it and tried to rerun it, but got an error. This wasn’t entirely surprising. This script, in its first part downloads a file from a 3rd party vendor and last week for example, their SFTP server had been down. At first I expected this to be the problem again. But further testing showed I was getting inconsistent errors. Finally the script ran. But, what normally took about 20 minutes to download, took about 2 hours. We learned later the vendor had done an upgrade to their product over the weekend. This shouldn’t have impacted their SFTP server performance, but here we were. Today (Tuesday) the process took 20 minutes again and is back to normal. Chalk yesterday’s issue up to being A Monday.
Then I took a look at another job that had failed. This one is purely internal. Basically SFTP a file from a Linux server to a NAS for a backup. A quick check showed that the NAS share was inaccessible. Reporting this triggered an avalanche of emails back and forth. The most interesting line basically came down to “Yes, the internal IT team did a migration of the NAS, but the migration was supposed to be completely transparent to the users.” Famous last words in my book. Actually, honestly, what I decided was more disturbing was that the failure was on the new NAS device apparently due to a typo. To me, this means, most likely, all the old shares were recreated on the new device by hand, rather than using a script that read out the old shares and recreated them. In any event, the problem was solved, the job was rerun and the backup created on the now new NAS. Chalk that one up to being A Monday.
Then one of the developers for one of the platforms at this client emailed me and said, “Hey database FOO is in recovery mode, what happened?” This one, fortunately I knew exactly what the problem was. Unfortunately I knew it was my fault. We had decided to reconfigure that database to be a log-shipped copy of the main database and I had set it up over the weekend. I had simply forgotten to set it up to place itself in Stand-by/Read-only mode after it had applied the most recent logs. I’ll chalk that one up to it being A Monday.
All of the above was taken care of before 10:00 AM. The rest of the day was filled with a variety of other issues and items, including looking at a Hyper-V host machine with 16 physical CPUs with hyperthreading turned on hosting 4 VMs, 1 with 4 vCPUs allocated, and the other 3 with 8 each. They’re having performance issues. I’m still tackling that one. Looking at that happened on Monday, but it’s not A Monday issue, it’s been an ongoing issue for months.
So what was it about this particular Monday, or Mondays in general?
Well in this case, all 3 of my early AM issues had one thing in common: upgrades or changes made over the weekend. I’m not going to debate the value or wisdom of the timing here, but just note, that on the particular Monday, it wasn’t just one issue, but three. It was definitely A Monday. But I survived as did my customer.
Now back to my regularly scheduled workload.