A brief look at the New Look in complex system failure, error, safety, and resilience by R. I. Cook

There’s been plenty of discussion of the several major sites like Instagram and Netflix going down due to a cloud host disruption last week. It actually made the national news with CNN and others reporting on the outage. While applauding the rapid self-analyses in to the failures (see AWS_June29_2012, WindowsAzure_Feb29_2012), we should keep in mind that those investigations are conducted by insiders with their own experiences and biases. Moreover, questions remain whether these types of accidents are foreseeable and if so, if they are preventable.

In his latest post, Tim O’Brien encourages us to look at the failures through not only technical lenses and refers to the famous paper by Dr. Richard Cook of the University of Chicago - “How Complex Systems Fail”. It offers a cross-discipline perspective on complex systems. Among other things, Dr. Cook emphasizes the following:

  • Complex systems run as broken systems (with many latent failures within, but functional because it contains redundancies). This is reminiscent with Urs Hölzle’s statement that “at scale, everything breaks.
  • Failures in complex systems require the combination of multiple factors (thus, looking for a single ‘root cause’ is fundamentally wrong. In the case of the latest outage, even though an electrical storm was initially blamed, it turned out there was a slew of previously unseen bugs that extended the downtime).
  • Change/interventions introduce complexity and new forms of failure (when new technologies are used to eliminate well understood system failures or to gain high precision performance they often introduce new pathways to large scale, catastrophic failures, which potentially can have even greater impact than those eliminated by the new technology).
  • More robust system performance requires appreciation and experience with failure.

For a more in-depth look into complex systems failures with three good, thought-provoking case studies, I also recommend to check out the “Thinking About Accidents and Systems” chapter authored by Cook and O’Connor of the Medication Safety Guide.

Coming back to the topic of designing your cloud apps for high availability, read this insightful blog post by twilio’s CTO, Dr. Evan Cooke on how they’ve managed to scale out their services while minimizing the impact of the underlying infrastructure outages. Though he never mentions CQRS, it feels like that’s what they are doing underneath.

Another worthy read is this post by Ingrid Lunden with several experts’ opinions amounting to a seemingly obvious message: to protect your system from host service disruptions you must run it in multiple availability zones/data centers. While this geo-distribution mitigates some of the risks, I see some businesses that would need to extrapolate this even further by running their systems not only in multiple zones but also across multiple providers.

Image credit: R. I. Cook, 2005 modified from J. Reason, 1990. Used with permission.