The Windows Azure platform is a solid, highly available, highly scalable environment - but, like any system (on premise or in the cloud) there are risks which could threaten the desired operation of your app (what I refer to herein as the 'normal scenario'). In this article, I discuss the idea that failures can be categorised according to severity and that many effects associated with these failures are avoidable, various failure scenarios you might need to deal with (on any cloud platform - but specifically, Windows Azure), understanding and defining risk, and why it is important that you design for these failures and tolerate them in your cloud solutions.
First though, allow me to highlight my reasons for writing this article:
The idea behind this article is the thought that actually, failure is generally OK: what we try to avoid is the manifestation of individual risks becoming a problem. Risk avoidance is absolutely essential to cloud development and, to borrow an old phrase, sometimes you have to 'roll with the punches'. Accept that your cloud application might fail at some point and in some way and that it's your job to complement the 99.95% availability guarantee by having strategies to help you quantify and mitigate the risks beyond the up-time guarantee.
Building highly available, highly scalable applications in the cloud requires us to embrace this principle: to understand that we will encounter failures and that many of them can often be transparently handled until normal operation is restored.
Finally, before we continue I want to make an observation:
When you deploy to Windows Azure, you are asked to choose which data center you want to deploy to. Note here, that this is singular: you are selecting one data center. Thus, although the liklihood of a data center failing is very slim, you in theory do have a single point of failure. You can absolutely mitigate against this risk (see the foot of this article if you require pointers on how to mitigate risk above the data center level). However, in this article I am setting a cap at mitigating risk up to the point of failure of a single data centre: we are accepting the risk of a data centre failure is minimal and that is enough for us here. I appreciate that for others though, this is not possible and multiple failover options are required. This scenario is possible on Windows Azure and I point out where to start with this in the footer, and may in a later article cover how to do this in depth.
Let us first begin with a quick discussion about failure; what it means in this context and how it applies to you on Windows Azure:
The basic principle of failure states that at some point:
There is just no such thing as a mechanical item that isn't subject to failure.
In this article I am excluding from the scope any errors and/or risks caused by poorly written application code.
Let's get started with the excercise of defining what the risks are to your deployment on the Windows Azure platform.
"Risk is the potential that a chosen action or activity (including the choice of inaction) will lead to a loss (an undesirable outcome). The notion implies that a choice having an influence on the outcome exists (or existed). Potential losses themselves may also be called risks".1
This definition hints at the necessity to both understand that there is the potential for risk in any situation and that the outcome of any given situation may be influenced (otherwise, it is a certainty) in some way so as to be able to lessen or prevent the effect from being noticable. In this section, we will identify what the risks are and, what the effect of each risk manifesting itself is.
Integral to your deployment on Windows Azure should be an understanding of:
For example, when you buy a car, you know that there is a risk that it might get damaged, either by you (racing around again!), or by another road user. Assuming you're a law abiding citizen, you'll buy insurance to mitigate against the risk of damage to your car, or somebody else's. But within your policy document will be a list of expectations around what happens when your car is damaged: you'll be told how long your car will be unavailable, whether you'll have the use of a rental car, etc.
It is the same for deployments on Windows Azure, except this time we're not talking about the effects to your car, rather the effect to your business caused by risk actually becoming a reality (or, 'surfacing').
I've often found that the effects of the risk (the effect the risk has on your app once it has manifested) can generally be categorised according to the following scale (in order of descending severity):
In this discussion, I'm assuming that the primary risk we're attempting to mitigate is downtime caused by loss of connectivity to the data centre. In my example deployment, we're talking about a simple web application with two web roles, two worker roles and a dependancy on a database on SQL Azure. If we dig further, our full risk register may look similar to the following:
Only once both your technical team and your business leaders are aware of the risks, their manifested effect and what can technically be achieved to mitigate them, can a discussion about the extent to which you wish to implement these measures take place. Try and avoid the tendency of shooting for 100% availability across 100% of your dependant resources and remember that often, different parts of an app can tolerate different failures differently! Understand that risks also have a field of impact, too. For example, a catastrophic data centre failure would affect the whole of your app, whereas the failure of a database would impact only those sections which require connectivity to it.
Crucial to this discussion is having an open and honest discussion with the business, and with your customers, about what level of risk is acceptable to them. This will determine how much effort goes into your risk avoidance strategy. You need to understand what level of risk is acceptable.
On Windows Azure, one significant advantage is that the cost of maintaining a highly available, highly scalable solution that is both maintained and secure is generally orders of magnitude cheaper than the equivalent private, on-premise set up. The last thing you'd want to do is erode that saving by planning and deploying avoidance techniques that are completely over the top: so be reasonable with your understanding of acceptable risk.
This exercise may seem academic and fairly obvious but it is often overlooked for that reason. Without it, though, it is difficult to fully appreciate what steps are necessary, and to inform your UX designers properly about the types of scenarios that could naturally occur that you may well need to surface in your app to let your users know.
We've covered risk, now let's turn our attention to what we need to do should the worst happen: a risk has manifested and the effect has begun.
It's a common misconception that disaster recovery and risk mitigation are the same things.
'Disaster recovery' refers to the things you do (either automatically or as part of a manual activity) that restore you to your normal scenario; for example, something exceptional occurred and you have suffered a catastrophic event and need to get back to 'business as usual' as fast as possible, while minimising loss. Risk mitigation, on the other hand, is about the things you can do before a condition occurs that triggers your failure scenario.
So that you can do this effectively, you need to first understand what risk has surfaced, what your recovery options are for that particular risk, and therefore what your recovery strategy and objectives actually are.
Let's put this into context:
Your app went offline due to a failure of a database connection. The effect was that users of your app could no longer publish new content. There are potentially two recovery options available to you here: you could either write new content to a separate store temporarily and automatically update the failed database when it becomes available, or your other recovery option is to simply wait until the failed database is available again. Your strategy for recovery from this particular risk is therefore directly dependent on what your business expects you to be able to achieve in this scenario.
We've introduced the notion that risks are no less likely to occur on the Windows Azure platform than on-premise, and we know that Azure is capable of recovering from most of these risks without any input from you. What we're trying to look at here is what steps you can take as developers to stop any non-catastrophic effects from impacting your app, causing a 'failure scenario'. If you embrace the concept of expecting failure, it becomes quite easy to see what you must do in order to maintain normal operation during a failure situation. In general, remember you can:
When designing for high availability, it is a good idea to keep these questions in mind:
Do not rely on the availability guarantee: it isn't enough (a 100% up-time guarantee wouldn't be, either) and remember, availability is only one part of the equation. If we go back to the car insurance metaphor, you don't just buy car insurance to mitigate against the risk of injury or damage to yourself or to others: you also drive safely and obey traffic rules. So it's actually more about adopting a philosophy and taking a series of actions that is important.
In summary, Windows Azure is and will remain a highly available, stable and reliable cloud platform and it will continue to be enhanced and improved over time. As developers though, we have to appreciate that failures of course can, and do, occur. Every object is subject to entropy, and hard disks, network cables and switches are no exception. Understanding that there are parts of the availability equation that you can - and should - take responsibility for is essential to a healthy cloud deployment and arguably, even if your app is deployed on-premise, you might want to consider adopting 'cloud risk principles', too!
My point ultimately is that risk isn't a problem: not understanding the effects of the risks and how your app can restore normal operation with minimal loss and minimal effort is a problem.
Deployments on Windows Azure are typically within a single data centre, and in my experience few developers realise that by simply using Windows Azure Traffic Manager CTP, you can deploy your application multiple data centers globally, and fail over to your back up data centers within a predetermined period.
I'd just like to finish by saying that I expect this to be the first of a group of related articles and I welcome and encourage all feedback.
If you, or your team, need some help getting started with designing and developing resilient applications on the Windows Azure platform, then please get in touch and we'd be happy to assist.
REFERENCES:1 Risk (Wikipedia)
How would you deal with risk of data corruption on windows Azure?
Hi Rinat. The issue of data corruption is an altogether different problem: do you refer to corruption of data by a failure of a storage resource, or by a bug or misconfiguration in your app for instance? Do you mean blob storage or SQL Azure?
While both carry risks, it is appropriate to first quantify the risk factors once you have determined what the risk actually is. This will determine the scope of work you have to carry out to bring the risk within a known level of tolerance.
Perhaps I will cover this in a separate blog post!
Thanks for reading and taking the time to comment.