Last week I read David Bills' (our chief reliability strategist) post Data Center Knowledge. David is responsible for the broad evangelism of the company’s online service reliability programs. His latest item is a follow on to his posts articles “Designing
for Dependability in the Cloud
” and Microsoft’s Journey: Solving Cloud Reliability With Software.

"In part three, I discuss the cultural shift and evolving engineering principles Microsoft is using to help improve the dependability of the services we offer and help customers realize the full potential of the cloud."

David highlights the importance of identifying as many potential failure conditions as possible in advance in the service design phase, so we can map out how the service should react when the unexpected occurs. (So really, it's expected, if you've mapped out the different potential issues far enough.)

"Many services teams employ fault modeling (FMA) and root cause analysis (RCA) to help them improve the reliability of their services and to help prevent faults from recurring. It’s my opinion that these are necessary but insufficient. Instead, the design team should adopt failure mode and effects analysis (FMEA) to help ensure a more effective outcome.

"FMA refers to a repeatable design process that is intended to identify and mitigate faults in the service design. RCA consists of identifying the factors that resulted in the nature, magnitude, location, and timing of harmful outcomes. The primary benefits of FMEA, a holistic, end-to-end methodology, include the comprehensive mapping of failure points and failure modes, which results in a prioritized list of engineering investments to mitigate known failures."

FMEA-Key-Step

Akin to our work in scenario focused engineering, groups should look at the entire infrastructure, from the hardware and software we use to run our datacenters, along with the infrastructure and wetware we use to power them, to components in out cloud offerings.

Worth a quick read.