LinkedIn | FaceBook | Twitter
Windows Azure as a Platform as a Service (PaaS) means that there are various components you can use in it to solve a problem:
It’s important to understand that some of these services are Stateless and others maintain State. Stateless means (at least in this case) that a system might disappear from one physical location and appear elsewhere. You can think of this as a cashier at the front of a store. If you’re in line, a cashier might take his break, and another person might replace him. As long as the order proceeds, you as the customer aren’t really affected except for the few seconds it takes to change them out. The cashier function in this example is stateless.
The Compute Role Instances in Windows Azure are Stateless. To upgrade hardware, because of a fault or many other reasons, a Compute Role's Instance might stop on one physical server, and another will pick it up. This is done through the controlling fabric that Windows Azure uses to manage the systems.
It’s important to note that storage in Azure does maintain State. Your data will not simply disappear - it is maintained - in fact, it’s maintained three times in a single datacenter and all those copies are replicated to another for safety. Going back to our example, storage is similar to the cash register itself. Even though a cashier leaves, the record of your payment is maintained.
So if a Compute Role Instance can disappear and re-appear, the things running on that first Instance would stop working. If you wrote your code in a Stateless way, then another Role Instance simply re-starts that transaction and keeps working, just like the other cashier in the example.
But if you only have one Instance of a Role, then when the Role Instance is re-started, or when you need to upgrade your own code, you can face downtime, since there’s only one. That means you should deploy at least two of each Role Instance not only for scale to handle load, but so that the first “cashier” has someone to replace them when they disappear. It’s not just a good idea - to gain the Service Level Agreement (SLA) for our uptime in Azure it’s a requirement. We point this out right in the Management Portal when you deploy the application:
(Click to enlarge)
When you deploy a Role Instance you can also set the “Upgrade Domain”. Placing Roles on separate Upgrade Domains means that you have a continuous service whenever you upgrade (more on upgrades in another post) - the process looks like this for two Roles. This example covers the scenario for upgrade, so you have four roles total - One Web and one Worker running the "older" code, and one of each running the new code. In all those Roles you want at least two instances, and this example shows that you're covered for High Availability and upgrade paths:
The take-away is this - always plan for forward-facing Roles to have at least two copies. For Worker Roles that do background processing, there are ways to architect around this number, but it does affect the SLA if you have only one.
When a service is deployed, the role instances are spread across more than 1 upgrade domain (assuming you have more than 1 role instance). When you deploy an update for a service, the Windows Azure fabric controller will do a rolling upgrade, making sure that only one upgrade domain at a time is taken offline at a time so the upgrade can be applied. In this way, it ensures that your application is never completely offline. You do temporarily lose some capacity, but the service itself remains available.
The following diagram show a service with two roles, each with two instances. The service owner wants to deploy an upgrade to the worker role. The Windows Azure Fabric Controller first applies the update to domain 0, upgrading the instance of the role to V2. Once that is complete, it will then apply the upgrade to domain 1, upgrading the second of the two instances to V2.
You can set the number of upgrade domains via the service definition.
If you'd actually read the SLA you'd see that the 99.95% uptime relates only to internet facing roles. Having a single worker role has zero impact on compliance with the SLA.