This post describes the basics of business continuity. Business continuity poses both business and technical challenges. In this series, we focus on the technical challenges, which include failure detection, response, diagnosis, and defect correction. Since failure diagnosis and defect correction are difficult to automate effectively in the platform, however, we narrow the focus even further to failure detection and response.

Failure Detection

Failures are expected during normal operation due to the characteristics of the cloud environment.

They can be detected by monitoring the service. Reactive monitoring observes that the service is not meeting Service Level Objectives (SLOs), while proactive monitoring observes conditions which indicate that the service may fail, such as rapidly increasing response times or memory consumption.

Failure detection is always relative to some observer. For example, in a partial network outage, the service may be available to one observer but not another. If the network outage blocks all external access to the service, it will appear to be unavailable to all observers, even if it is still operating correctly. A service therefore cannot reliably detect its own failure.

Failure Response

Failure response involves two related but different objectives: high availability and disaster recovery. Both boil down to satisfying SLOs: high availability is about maintaining acceptable latencies, while disaster recovery is about recovering from outages in an acceptable amount of time with no more than an acceptable level of data loss.

High Availability

Availability is generally measured in terms of responsiveness. A service is considered responsive if it responds to requests within a specified amount of time. Availability is measured as the amount of time the service is responsive in some time window divided by the length of the window, expressed as a percentage. Most services typically offer at least 99.9% availability, which amounts to 43.2 minutes of downtime or less per month.

Availability SLOs may not offer an accurate indication of the actual availability of a service in practice, because of how responsiveness is defined. For example, a common way to define availability is to say that a service is considered responsive if at least one request succeeds in every 5 minute interval. By this definition, the service can be unresponsive for 4 minutes and 59 seconds in every 5 minute interval, providing an actual availability of 0.3%, while satisfying a stated SLO of 99.9% availability.

Disaster Recovery

Disaster recovery is governed by SLOs that become relevant when the service becomes unavailable.

  • A Recovery Time Objective (RTO) is the amount of time that can elapse before normal operation is restored following a loss of availability. It typically includes the time required to detect the failure, and the time required to either restore the affected components of the service, or to fail over to new components if the affected components cannot be restored fast enough.
  • A Recovery Point Objective (RPO) is a point in time to which the service must recover data committed by the user prior to the failure. It should be less than or equal to what users consider an "acceptable loss" in the event of a disaster.