I want to pass on a video I’ve finally gotten around to watching:
I’ve managed a number of on-call teams to various levels of success. One point I’d add that makes a difference is good buy-in from above.
He addresses several good points, most of which I would fully agree with and even at various times adopted at my various jobs.
One thing he mentions is availability. Too often folks claim they need 99.999% uptime. My question has often been “why?” and then followed by, “Are you willing to pay for that?” Often the why boils down to “umm.. because…” and the paying for it was “no”, at least once they realized the true cost.
I also had a rule that I sometimes used: “If there was no possible response or no response necessary, don’t bother alerting!”.
An example might be traffic flow. I’ve seen setups where if the traffic exceeds a certain threshold once in say a one hour period (assume monitoring every 5 seconds) a page would go out. Why? By the time you respond it’s gone and there’s nothing to do.
A far better response is to automate it such that if it happens more than X times in Y minutes, THEN send an alert.
In some cases, simply retrying works. In the SQL world I’ve seen re-index jobs fail due to locking or other issues. I like my sleep. So I set up most of my jobs to retry at least once on failure.
Then, later I’ll review the logs. If I see constant issue of retries I’ll schedule time to fix it.
At one client, we had an issue where a job would randomly fail maybe once a month. They would page someone about it, who would rerun the job and it would succeed.
I looked at the history and realized simply by putting a delay in of about 5 minutes on a failure and retrying would reduce the number of times someone had to be called from about once a month to once every 3 years or so. Fifteen minutes of reviewing the problem during a normal 9-5 timeframe and 5 minutes of checking the math and implementing the fix meant the on-call person could get more sleep every month. A real win.
Moral of the story: Not every thing is critical and if it is, handle it as if it is, not as a second thought.