Microsoft | patterns & practices | Developer Network | Enterprise Library | Acceptance Testing Guide | Personal Site
There’s been plenty of discussion of the several major sites like Instagram and Netflix going down due to a cloud host disruption last week. It actually made the national news with CNN and others reporting on the outage. While applauding the rapid self-analyses in to the failures (see AWS_June29_2012, WindowsAzure_Feb29_2012), we should keep in mind that those investigations are conducted by insiders with their own experiences and biases. Moreover, questions remain whether these types of accidents are foreseeable and if so, if they are preventable.
In his latest post, Tim O’Brien encourages us to look at the failures through not only technical lenses and refers to the famous paper by Dr. Richard Cook of the University of Chicago - “How Complex Systems Fail”. It offers a cross-discipline perspective on complex systems. Among other things, Dr. Cook emphasizes the following:
For a more in-depth look into complex systems failures with three good, thought-provoking case studies, I also recommend to check out the “Thinking About Accidents and Systems” chapter authored by Cook and O’Connor of the Medication Safety Guide.
Coming back to the topic of designing your cloud apps for high availability, read this insightful blog post by twilio’s CTO, Dr. Evan Cooke on how they’ve managed to scale out their services while minimizing the impact of the underlying infrastructure outages. Though he never mentions CQRS, it feels like that’s what they are doing underneath.
Another worthy read is this post by Ingrid Lunden with several experts’ opinions amounting to a seemingly obvious message: to protect your system from host service disruptions you must run it in multiple availability zones/data centers. While this geo-distribution mitigates some of the risks, I see some businesses that would need to extrapolate this even further by running their systems not only in multiple zones but also across multiple providers.
Image credit: R. I. Cook, 2005 modified from J. Reason, 1990. Used with permission.