Currently, both Windows Azure and SQL Azure offer high availability within a single data center. As long as a data center remains operational and accessible from the Internet, services hosted there can achieve high availability.
Windows Azure uses a combination of resource management, elasticity, load balancing, and partitioning to enable high availability within a single data center. The service developer must do some additional work to benefit from these features.
All services hosted by Windows Azure are collections of web, worker and/or virtual machine roles. One or more instances of a given role can run concurrently. The number of instances is determined by configuration. Windows Azure uses Fabric Controllers (FCs) to monitor and manage role instances. FCs detect and respond to both software and hardware failure automatically.
The FC dynamically adjusts the number of worker role instances, up to the limit defined by the service through configuration, according to system load.
All inbound traffic to a web role passes through a stateless load balancer, which distributes client requests among the role instances. Individual role instances do not have public IP addresses, and are not directly addressable from the Internet. Web roles are stateless, so that any client request can be routed to any role instance. A StatusCheck event is raised every 15 seconds.
FCs use two types of partitions: update domains and fault domains.
According to the Windows Azure SLA, Microsoft guarantees that when two or more web role instances are deployed to different fault and upgrade domains, they will have external connectivity at least 99.95% of the time. There is no way to control the number of fault domains, but Windows Azure allocates them and distributes role instances across them automatically. At least the first two instances of every role are placed in different fault and upgrade domains in order to ensure that any role with at least two instances will satisfy the SLA.
The service developer must do some additional work to benefit from these features.
The requirement to keep roles stateless deserves further comment. It implies, for example, that all related rows in a SQL Azure database should be changed in a single transaction, if possible. For example, instead of inserting a parent in one transaction, and then its children in another, the code should insert both the parent and the children in the same transaction, so that if it goes down after writing just one of the row sets, the data will be left in a consistent state.
Of course, it is not always possible to make all changes in a single transaction. Special care must be taken to ensure that role failures do not cause problems when they interrupt long running operations that span two or more updates to the persistent state of the service.
For example, in a service that partitions data across multiple stores, if a worker role goes down while relocating a shard, the relocation of the shard may not complete, or may be repeated from its inception by a different worker role, potentially causing orphaned data or data corruption. To prevent problems, long running operations must be idempotent (i.e., repeatable without side effect) and/or incrementally restartable (i.e., able to continue from the most recent point of failure).
Finally, all long running operations should be invoked repeatedly until they succeed. For example, a provisioning operation might be placed in an Azure queue, and removed from the queue by a worker role only when it succeeds. Garbage collection may be needed to clean up data created by interrupted operations.
Common long running operations that create special challenges include provisioning, deprovisioning, rolling upgrade, data replication, restoring backups and garbage collection.
SQL Azure uses a combination of replication and resource management to provide high availability within a single data center. Services benefit from these features just by using SQL Azure. No additional work is required by the service developer.
SQL Azure exposes logical rather than physical servers. A logical server is assigned to a single tenant, and may span multiple physical servers. Databases in the same logical server may therefore reside in different SQL Server instances.
Every database has three replicas: one primary and two secondaries. All reads and writes go to the primary, and all writes are replicated asynchronously to the secondaries. Also, every transaction commit requires a quorum, where the primary and at least one of the secondaries must confirm that the log records are written before the transaction can be considered committed. Most production data centers have hundreds of SQL Server instances, so it is unlikely that any two databases with primary replicas on the same machine will have secondary replicas that also share a machine.
Like Windows Azure, SQL Azure uses a fabric to manage resources. However, instead of a fabric controller, it uses a ring topology to detect failures. Every replica in a cluster has two neighbors, and is responsible for detecting when they go down. When a replica goes down, its neighbors trigger a Reconfiguration Agent (RA) to recreate it on another machine. Engine throttling is provided to ensure that a logical server does not use too many resources on a machine, or exceed the machine’s physical limits.