Just a short blog post today since I’m actually in the middle of a call with a client as we test our failover scenario.
Right now I’m calling it a success even though the SQL Server hasn’t come up yet.
Why am I calling it a success? Because we learned that our current plan has a serious gaping hole concerning how the iSCSI drives failover. Yes, technically we failed to failover as quickly as we expected.
But, we’ve learned that before this system went into production. So that’s a success. This raises our confidence level for next time.
In all honesty, we often learn more from our failures from our successes. For example, before NASA would allow SpaceX to fly a crew on Crew Dragon, they required several abort tests, one of which involved launching a Falcon 9 and then in mid-flight firing the Crew Dragon abort engines. This resulted in the destruction of the Falcon 9 (which was expected) but proved the abort plans worked. Note however that for Orion on Artemis, NASA has decided such a test is not necessary. The decision making process behind this particular decision is worthy of a blog of its own.
In any case, with the current DR test, we expect to have things finally failed over in the next hour or two. Then we’ll update our playbook and have a lot more confidence.
Moral of your story: test your DR. Assume things will go wrong the first time because they will, but far better to have that before you go to production. This is not the first time I’ve had a failover not go as planned, but prior to production.