The team have recently released a new whitepaper Disaster Recovery and High Availability for Windows Azure Applications

The whitepaper outlines the necessary architecture steps to be taken to disaster-proof a Windows Azure deployment so that the larger business continuity process can be implemented. A business continuity plan is a roadmap for continuing operations under adverse conditions. This could be a failure with technology, such as a downed service, or a natural disaster, such as a storm or power outage. Application resiliency for disasters is only a subset of the larger DR process as described in this NIST document: Contingency Planning Guide for Information Technology Systems.

The following taken from the blog post for the paper describes what the paper covers. I would strongly suggest that if anyone is putting nay missions critical systems in any cloud provider, that they have a good read of this paper.

 

Characteristics of Resilient Cloud Applications

A well architected application can withstand capability failures at a tactical level and can also tolerate strategic system-wide failures at the datacenter level. The following sections define the terminology referenced throughout the document to describe various aspects of resilient cloud services.

High Availability

A highly available cloud application implements strategies to absorb the outage of the dependencies like the managed services offered by the cloud platform. In spite of possible failures of the Cloud platform capabilities, this approach permits the application to continue to exhibit the expected functional and non-functional systemic characteristics as defined by the designers. This is covered in depth in the paper Failsafe: Guidance for Resilient Cloud Architectures.

The implementation of the application needs to factor in the probability of a capability outage. It also needs to consider the impact it will have on the application from the business perspective before diving deep into the implementation strategies. Without due consideration to the business impact and the probability of hitting the risk condition, the implementation can be expensive and potentially unnecessary.

Consider an automotive analogy for high availability. Even quality parts and superior engineering does not prevent occasional failures. For example, when your car gets a flat tire, the car still runs, but it is operating with degraded functionality. If you planned for this potential occurrence, you can use one of those thin-rimmed spare tires until you reach a repair shop. Although the spare tire does not permit fast speeds, you can still operate the vehicle until the tire is replaced. In the same way, a cloud service that plans for potential loss of capabilities can prevent a relatively minor problem from bringing down the entire application. This is true even if the cloud service must run with degraded functionality.

There are a few key characteristics of highly available cloud services: availability, scalability, and fault tolerance. Although these characteristics are interrelated, it is important to understand each and how they contribute to the overall availability of the solution.

Availability

An available application considers the availability of its underlying infrastructure and dependent services. Available applications remove single points of failure through redundancy and resilient design. When we talk about availability in Windows Azure, it is important to understand the concept of the effective availability of the platform. Effective availability considers the Service Level Agreements (SLA) of each dependent service and their cumulative effective on the total system availability.

System availability is the measure of what percentage of a time window the system will be able to operate. For example, the availability SLA of at least two instances of a web or worker role in Windows Azure is 99.95%. This percentage represents the amount of time that the roles are expected to be available (99.95%) out of the total time they could be available (100%). It does not measure the performance or functionality of the services running on those roles. However, the effective availability of your cloud service is also affected by the various SLA of the other dependent services. The more moving parts within the system, the more care must be taken to ensure the application can resiliently meet the availability requirements of its end users.

Consider the following SLAs for a Windows Azure service that uses Windows Azure roles (Compute), Windows Azure SQL Database, and Windows Azure Storage.

 

Windows Azure Service SLA Potential Minutes Downtime/Month (30 days)

Compute

99.95%

21.6

SQL Database

99.90%

43.2

Storage

99.90%

43.2

You must plan for all services to potentially go down at different times. In this simplified example, the total number of minutes per month that the application could be down is 108 minutes. A 30-day month has a total of 43,200 minutes. 108 minutes is .25% of the total number of minutes in a 30-day month (43,200 minutes). This gives you an effective availability of 99.75% for the cloud service.

However, using availability techniques described in this paper can improve this. For example, if you design your application to continue running when SQL Database is unavailable, you can remove that line from the equation. This might mean that the application runs with reduced capabilities, so there are also business requirements to consider. For a complete list of Windows Azure SLA’s, see Service Level Agreements.

Scalability

Scalability directly affects availability—an application that fails under increased load is no longer available. Scalable applications are able to meet increased demand with consistent results in acceptable time windows. When a system is scalable, it scales horizontally or vertically to manage increases in load while maintaining consistent performance. In basic terms, horizontal scaling adds more machines of the same size while vertical scaling increases the size of the existing machines. In the case of Windows Azure, you have vertical scaling options for selecting various machine sizes for compute. But changing the machine size requires a re-deployment. Therefore, the most flexible solutions are designed for horizontal scaling. This is especially true for compute, because you can easily increase the number of running instances of any web or worker role to handle increased traffic through the Azure Web portal, PowerShell scripts, or code. This decision should be based on increases in specific monitored metrics. In this scenario user performance or metrics do not suffer a noticeable drop under load. Typically, the web and worker roles store any state externally to allow for flexible load balancing and to gracefully handle any changes to instance counts. Horizontal scaling also works well with services, such as Windows Azure Storage, which do not provide tiered options for vertical scaling.

Cloud deployments should be seen as a collection of scale-units, which allows the application to be elastic in servicing the throughput needs of the end users. The scale units are easier to visualize at the web and application server level as Windows Azure already provides stateless compute nodes through web and worker roles. Adding more compute scale-units to the deployment will not cause any application state management side effects as compute scale-units are stateless. A storage scale-unit is responsible for managing a partition of data either structured or unstructured. Examples of storage scale-units include Windows Azure Table partition, Blob container, and SQL Database. Even the usage of multiple Windows Azure Storage accounts has a direct impact on the application scalability. A highly scalable cloud service needs to be designed to incorporate multiple storage scale-units. For instance, if an application uses relational data, the data needs to be partitioned across several SQL Databases so that the storage can keep up with the elastic compute scale-unit model. Similarly Azure Storage allows data partitioning schemes that require deliberate designs to meet the throughput needs of the compute layer. For a list of best practices for designing scalable cloud services, see Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services.

Fault Tolerance

Applications need to assume that every dependent cloud capability can and will go down at some point in time. A fault tolerant application detects and maneuvers around failed elements to continue and return the correct results within a specific timeframe. For transient error conditions, a fault tolerant system will employ a retry policy. For more serious faults, the application is able to detect problems and fail over to alternative hardware or contingency plans until the failure is corrected. A reliable application is able to properly manage the failure of one or more parts and continue operating properly. Fault tolerant applications can use one or more design strategies, such as redundancy, replication, or degraded functionality.

Disaster Recovery

A cloud deployment might cease to function due to a systemic outage of the dependent services or the underlying infrastructure. Under such conditions, a business continuity plan triggers the disaster recovery (DR) process. This process typically involves both operations personnel and automated procedures in order to reactivate the application at a functioning datacenter. This requires the transfer of application users, data, and services to the new datacenter. This involves the use of backup media or ongoing replication.

Consider the previous analogy that compared high availability to the ability to recover from a flat tire through the use of a spare. By contrast, disaster recovery involves the steps taken after a car crash where the car is no longer operational. In that case, the best solution is to find an efficient way to change cars, perhaps by calling a travel service or a friend. In this scenario, there is likely going to be a longer delay in getting back on the road as well as more complexity in repairing and returning to the original vehicle. In the same way, disaster recovery to another datacenter is a complex task that typically involves some downtime and potential loss of data. To better understand and evaluate disaster recovery strategies, it is important to define two terms: recovery time objective (RTO) and recovery point objective (RPO).

RTO

The recovery time objective (RTO) is the maximum amount of time allocated for restoring application functionality. This is based on business requirements and is related to the importance of the application. Critical business applications require a low RTO.

RPO

The recovery point objective (RPO) is the acceptable time window of lost data due to the recovery process. For example, if the RPO is one hour, then the data must be completely backed up or replicated at least every hour. Once the application is brought up in an alternate datacenter, the backup data could be missing up to an hour of data. Like RTO, critical applications target a much smaller RPO.

See Also

Business Continuity for Windows Azure
Business Continuity in Windows Azure SQL Database
High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines
Failsafe: Guidance for Resilient Cloud Architectures
Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services