Last In - First Out: Simple Steps to Improving Availability

A poorly managed high availability cluster will have lower availability than a properly managed non-redundant system.

That's a bold statement, but I'm pretty sure it's true. The bottom line is that the path to improving system availability begins with the fundamentals of system management, not with redundant or HA systems. Only after you have executed the fundamentals will clustering or high availability make a positive contribution to system availability.

Here's the five transitions that are the critical steps on the path to improved availability:

Transition #1: From ad-hoc system management to structured system management.

Structured system management implies that you understand the fundamentals of building, securing, deploying, monitoring, logging, alerting, and documenting networks, servers and applications, that you have those fundamentals in place, you execute them consistently, and you know all cases where you are inconsistent. Ad hoc system management doesn't cut it.

Transition #2: From ad-hoc changes to simple change management.

Simple change management means that you have controls around changes sufficient to determine who/what/when/why on any change to any system or application critical file. Changes are predicted. Changes are documented. Changes are not random, and they do not 'just happen'. A text file 'changes.txt' edited with notepad.exe and stored in c:/changelog/ is not as comprehensive as a million dollar consultant-driven enterprise CMDB that takes years to implement, but is a huge step in the right direction, and certainly provides more incremental value at less cost than the big solution.

Transition #3: From 'i dunno....maybe.....' to root cause analysis.

Failures have a cause. All of them. The 'cosmic ray did it' excuse is bullshit. Find the root cause. Fix the core problem. You need to be able to determine that 'the event was caused by .... and can be resolved by ... and can be prevented from ever happening again by ...'. If you cannot find the cause and you have to resort to killing a process or rebooting a server to restore service, then you must add instrumentation, monitoring or debugging to your system sufficient so that the next time the event happens, you will find the cause.

Transition #4: From 'try it...I think it will work..' to 'my tests show that......'

Comprehensive pre-production testing ensures that the systems that you build and the changes that you make will work as expected. You know that they will work because you tested them, and in the rare case that they do not work as expected, you will be able do determine the variation between test and production devise a test that accommodates the differences.

Transition #5: From non-redundant systems to simple redundancy.

Finally, after you've made transitions one through four, you are ready for implementation of basic active/passive redundancy. Skipping ahead to transition #5 isn't going to get you to your availability goals any sooner.

Remember though, keep it simple. Complexity doesn't necessarily increase availability.

--Mike

Simple Steps to Improving Availability - Five Essential Transitions