This post describes the basics of business continuity. Business continuity poses both business and technical challenges. In this series, we focus on the technical challenges, which include failure detection, response, diagnosis, and defect correction. Since failure diagnosis and defect correction are difficult to automate effectively in the platform, however, we narrow the focus even further to failure detection and response.
Failure Detection
Failures are expected during normal operation due to the characteristics of the cloud environment.
They can be detected by monitoring the service. Reactive monitoring observes that the service is not meeting Service Level Objectives (SLOs), while proactive monitoring observes conditions which indicate that the service may fail, such as rapidly increasing response times or memory consumption.
Failure detection is always relative to some observer. For example, in a partial network outage, the service may be available to one observer but not another. If the network outage blocks all external access to the service, it will appear to be unavailable to all observers, even if it is still operating correctly. A service therefore cannot reliably detect its own failure.
Failure Response
Failure response involves two related but different objectives: high availability and disaster recovery. Both boil down to satisfying SLOs: high availability is about maintaining acceptable latencies, while disaster recovery is about recovering from outages in an acceptable amount of time with no more than an acceptable level of data loss.
High Availability
Availability is generally measured in terms of responsiveness. A service is considered responsive if it responds to requests within a specified amount of time. Availability is measured as the amount of time the service is responsive in some time window divided by the length of the window, expressed as a percentage. Most services typically offer at least 99.9% availability, which amounts to 43.2 minutes of downtime or less per month.
Availability SLOs may not offer an accurate indication of the actual availability of a service in practice, because of how responsiveness is defined. For example, a common way to define availability is to say that a service is considered responsive if at least one request succeeds in every 5 minute interval. By this definition, the service can be unresponsive for 4 minutes and 59 seconds in every 5 minute interval, providing an actual availability of 0.3%, while satisfying a stated SLO of 99.9% availability.
Disaster Recovery
Disaster recovery is governed by SLOs that become relevant when the service becomes unavailable.