The planning of a High Availability (HA) and Disaster Recovery (DR) solution for an on-premises environment involves balancing business continuity requirements with the complexity and cost of implementation. It typically involves hardware and software redundancy as protection against failures. As is well-known, HADR can be fairly expensive and increasing the HA targets can lead to exponentially higher implementation cost. Windows Azure provides various built-in platform capabilities for HADR that can help you reduce complexity and cost.
In this blog post I describe an example of a Windows Azure cross-datacenter DR solution based on real-life projects I was involved in. I also discuss additional considerations for implementing DR for applications running in Windows Azure.
Applications running in Azure benefit immediately from the high availability of the underlying services provided within a datacenter. In addition, Azure offers a number of services that help you to deliver a DR solution or cross-datacenter availability if required. The article Business Continuity for Windows Azure describes the fundamental concepts and Azure's built-in capabilities. However, applications will still need to be designed and prepared to take advantage of those capabilities in order to provide high availability to the end user. Instead of using hardware redundancy as in an on-premises world, you need to think about utilizing redundant services of the underlying cloud platform (e.g. PaaS or IaaS) and design for resiliency. You'll need to run multiple instances of compute roles, handle transient faults and potentially use fallback strategies for identified single points of failure. Because cloud apps tend to be composed of multiple services, it is important to consider what is necessary to achieve HADR for the individual services of your app (instead of thinking about one approach for the entire app). The following two white papers provide excellent guidance on strategies for your Windows Azure application to achieve HADR: Disaster Recovery and High Availability for Windows Azure Applications, Failsafe: Guidance for Resilient Cloud Architectures.
Before deciding on techniques for achieving high availability and disaster recovery, it is important to define the requirements and expected availability for the sub-services/components of your solution. Typically sub-components will have different requirements for availability, scalability, and performance. For example, from a use case perspective, changing settings and configuration of an application can have different performance and availability target levels than performing a business transaction. Based on the type of service, requirements and implementation details, you will apply different techniques. The following example distinguishes several categories of services and applies different strategies for those services. Figure 1 shows a simplified high level overview of a cross-datacenter deployment of the system.
Figure 1 – Deployment overview with two Azure datacenters
A management portal provides functionality for configuring and managing different aspects of the solution, including provisioning of tenants, managing accounts and users, monitoring, notifications, billing, etc. The functionality exposed through the management portal UI is supported by a number of services, which for simplicity, are omitted in figure 1. For the purposes of this article only the provisioning service is relevant and is illustrated above. The management portal and underlying services use Azure SQL Database to persist the configuration. The application itself – shown as Application Services in figure 1 - consists of another set of cloud services hosting the application UI (separated from the management portal) and application specific services. It uses Azure SQL Database as well as Windows Azure Storage (WAS) on the backend for persisting application data.
Each of the described components of the system have different HADR requirements and I'll discuss those in more detail separately.
The management portal is considered extremely important because it provides critical operations management capabilities that have an impact on all tenants. It is deployed in an 'active/active' configuration using Traffic Manager's performance policy to distribute traffic across the datacenters (see Business Continuity for Windows Azure for details on 'active/active' deployment). The data is replicated between the datacenters using SQL Data Sync. For this deployment option the application has to be designed to operate across datacenters. One major challenge is ensuring data consistency. There are several possibilities to achieve this, which highly depend on the app specifics and workload:
In any case, you should consider potential conflicts and conflict resolution handling based on the application logic. If it is too challenging to guarantee data consistency for your application, you can also consider an 'active/passive' deployment. Be aware, in that case you'll have running instances in the secondary location that are not utilized. And even then, when you perform a failover between DCs you still may run into data consistency issues that you need to plan for.
The provisioning service – running in both datacenters – is responsible for the deployment of the application itself. This happens in scale units, thus the service is not only used for initial deployment but also for scaling. You can think of this service being an 'active/passive' type of deployment, because it is not used on the secondary datacenter unless there is a need for failover.
Finally, the app itself is deployed in only one datacenter. In the case of disaster it can be redeployed in the secondary datacenter, which is automated by the provisioning service. The application packages are kept in a blob storage based library in the secondary center. The provisioning service can be scaled out to greatly accelerate the provisioning of the application. The app uses two types of data: relational and additional data stored in blob storage. Two different approaches are used to handle recovery of that data. The relational data is more critical for the app and is restored from a database export (bacpac) kept in the secondary datacenter. The frequency of exports from the primary site define the RPO. The RTO time for the application depends on the time necessary to import the bacpac (and on the time necessary to deploy the app services, which is typically faster than the bacpac import). For more information on how to export a bacpac file to blob storage, see Business Continuity in Windows Azure SQL Database. Depending on your requirements, you might consider reducing the RTO time by keeping a synchronized copy of the database in the secondary datacenter (e.g. by using SQL Data Sync) – you need to estimate the cost differences between both approaches. Once the new SQL Database is restored and the application services have been spun up, the traffic can be rerouted to the secondary datacenter using Traffic Manager and the app is fully operational.The data stored in blob relies on the built-in geo-replication of Windows Azure storage. In this specific example, the app is operational even without access to the blob data. Hence you can perform a failover to the secondary datacenter even if that data is not immediately available. Currently (as of writing) a failover of geo-replicated blob storage will be performed by Microsoft in a case of major disaster or datacenter-wide outage, when the primary site cannot be recovered in a reasonable timeframe. If you want to have full control to perform a failover to a secondary DC at any given time and/or a failover for individual set of blobs, you can keep copies of your blob data in the secondary datacenter. This is expected to have higher cost, but provides you much greater flexibility for failover scenarios to a secondary site.
The described example demonstrates an implementation of several techniques to achieve cross-datacenter HADR for the different components of an application. Typically in a distributed environment you should consider each component/service individually and decide on the appropriate techniques.In the example above, the management portal is deployed in an 'active/active' configuration which provides cross-datacenter availability. The secondary datacenter is prepared for a disaster, by having the provisioning service up and running and ready to redeploy the app. In the case of a disaster the application will be redeployed using service packages and data copies stored in blob storage in the secondary datacenter. This scenario is a type of hybrid between the 'active/passive' and the 'redeploy on disaster' strategies described in Business Continuity for Windows Azure. For the given example this is considered to be the best trade-off between RTO and cost. I have also described possible variations of the example scenario and the considerations which may apply depending on your solution's specific requirements.