Since this blog will focus on OSS and device development for the Windows Azure platform, we want to develop a shared understanding of what it means to design and build applications for the cloud. To make sure that we start with that common understanding, we’ll pose answers to these three questions in this post:

  1. What are the benefits of cloud computing?
  2. What are the principles of cloud computing?
  3. What are the challenges of cloud computing?

Clearly, we can’t hope to explore possible answers to these questions in great depth in one post, so we’ll attempt to answer them only at a high level here. Our focus will be on exploring answers to these questions for any cloud application (not just Windows Azure applications) with the promise of getting into the details of OSS/device development for the Azure platform in the weeks and months ahead. And, we don’t expect our answers to be definitive…cloud computing is still too new and wide-open to expect that. We’d love to hear what you think about what it means to build an application for the cloud.

What are the benefits of cloud computing?

No matter how you slice it, the “benefits” of cloud computing come down to a single benefit: cost savings.

If you are considering using a cloud platform to run an application, then you are probably thinking about the availability and scalability of your application, or the cost savings of not having to maintain your own datacenter (and if you are not, then you should ask yourself why you are considering the cloud). Aside from development costs, your largest costs (or among the largest) are a data center (possibly several) capable of handling peak usage and people to manage the data center(s). By using a cloud offering instead of a traditional data center, you remove much of the effort required to manage your machines and also allow for rapid provisioning of new machines as you need them. The ability to rapidly provision/remove new machines and the decreased effort in managing machines allow you to ultimately save money. (That is, of course, over-simplifying things. Different cloud offerings offer different ways to reduce the cost of running a highly available, highly scalable application, but all are ultimately focused on reducing cost.)

The classic example is an ecommerce application that has varied use according to the time of year (e.g., high use during the winter holidays). If you are responsible for the data centers that run the application, you have to make sure you have enough capacity to handle peak traffic. Of course, this means many of your servers sit idly by for much of the year – you are paying for more than you use over the course of a year. When you run the same application in the cloud, you can match your capacity much more closely to the demand, thus paying only for what you use.

clip_image002

This diagram was adapted from a presentation by Josh Holmes.

The diagram above helps to illustrate the benefit. Note that with in-house servers, capacity often far exceeds actual demand (and can sometimes be less than the demand!). With servers in the cloud, you can rapidly spin up and spin down servers to closely match demand, thus paying only for what you use and always keep up with demand.

What are the principles of cloud computing?

To take advantage of the benefits of cloud computing, you need to design and build applications with the cloud in mind. The problems you will run into by simply taking an in-house application and deploying it on cloud servers will quickly demonstrate how important this point is.

The basic principles on which good cloud applications are built are the same as those for building a distributed application (with the added twist of building for elastic scalability). If you have been building applications on commodity hardware with availability and scalability in mind, you are likely already familiar with the principles of distributed computing. On the other hand, if you have been solving problems related to availability and scalability by upgrading hardware (i.e. throwing more CPUs per server and more RAM per server at the problem), then you have some work to do in understanding how to design an application for the cloud. In other words, if you understand scale out (as opposed to scale up), then you are well on your way to building applications for the cloud.

The following are the principles of distributed computing:

  • Use a stateless application model. If you are used to scaling up, then you may also be used to designing applications that maintain state. In distributed computing, because you cannot know which server a request will be routed to, you cannot expect the application state to be the same from user request to user request.
    • Use idempotent operations. A consequence of building stateless applications is the possibility of processing operations multiple times with adverse effects. In the e-commerce application example I used earlier, what should happen if a user mistakenly submits an order twice? In an application that maintains state, it would be easy to notify the user that his order had already been submitted and ignore the second request. In a distributed application, the second request may be sent to a server that is unaware of the first request. Unless you plan around this, the second order may mistakenly be processed.
  • Use loosely coupled, highly focused modules. In other words, build with more pieces, not bigger pieces. This principle encapsulates the idea that in distributed computing, availability and scalability issues are addressed with more servers, not bigger servers (scale out vs. scale up). By building modules that are only loosely dependent on each other and that perform narrow, specialized tasks, it becomes easier to add more capacity to an application, easier to trouble shoot, and easier to recover from failures. Examining some of the logical implications of this principle will help drive this point home.
  • Use asynchronous, event-driven processing. In the scale up scenario, it may have been a common practice to process an event and return immediately (which may have been practical with the right hardware). Consider again the example I used earlier of an e-commerce application. With a hefty server, it may have been possible to allow users to browse products and process orders at the same time (at least until your server reached its limitations). In distributed computing, you must use the availability of many servers (not hefty servers) to solve this problem. This means designing your application to hand off as much work as possible to background servers. In the context of the e-commerce example, it could mean loading orders into a queue and having background servers dedicated to processing the orders (asynchronously) while users continue to browse products on your main servers.
    • Use parallelization of event processing. When an application is designed to process events asynchronously, it becomes logical (and relatively easy) to add capacity by processing events in parallel. In the e-commerce example, if orders start flying in, you should be able to add more machines that are dedicated to processing orders, thus increasing the application’s capacity by processing orders in parallel.
    • Use common, standards-based communication between modules. When you build loosely coupled modules, communication between modules becomes very important. And, given that application modules may very well be developed in different programming languages, making sure that modules speak a common tongue becomes important. The best solution here is to use a standards-based, RESTful protocol (such as JSON).
  • Use a “shared nothing” architecture. A “shared nothing” architecture is (quoting Wikipedia), “a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contentionacross the system.” If you design your application such that resources are shared, you introduce bottlenecks and make it difficult (perhaps impossible) to scale an application by introducing more servers. One example stands out here…
    • Use de-normalized and partitioned data (i.e. use sharding). By de-normalizing (to a certain degree) and horizontally partitioning data into smaller databases, you can promote the use of parallelized processes, increase performance, and improve the overall reliability of the application (if one DB fails, you don’t bring down the entire application). In the e-commerce example, you might partition data into shards according to product category.
  • Design for failure. Servers will fail. This is not a likelihood, it’s a guarantee. Plan accordingly. Build in retry logic for requests. Design redundant systems and test them. This principle is essential to the success of any cloud application, but it is often the one that is most overlooked. Any good application should be designed to handle failures, but when you can physically lay your hands on your hardware, it’s easy to say to yourself “I’ll just replace X if it fails.” You can’t do this in the cloud.

The last principle I’d like to call out here (although I’m not sure this is technically a “principle”) is not necessarily related to distributed computing, but it is related to the value proposition of cloud computing: Anticipate scale up and scale down needs. In order to take advantage of the benefits of cloud computing, you have to plan for scaling up and scaling down in advance. Some of your planning can be programmed into your application (i.e. when traffic hits X requests per minute, spin up Y new servers), but other planning may be “manual”. In the context of the e-commerce example, you may want to adjust your programmatic scale rules for November and December (when you expect more traffic). And, in anticipation of Black Friday or Cyber Monday, you may want to spin up new servers in anticipation of more traffic (spinning up new servers isn’t instantaneous, so it may be a good idea to do this in anticipation of a dramatic spike in traffic).

What are the challenges of cloud computing?

Designing according to the principles above may seem like challenge enough for building a cloud application, but there are a few other challenges worth pointing out:

  • Decide whether or not you need the cloud for your app. Not all applications will benefit from being run in the cloud. If high availability and high scalability are not central to the success of your application, then you don’t stand to gain anything by running your application in the cloud. Ask yourself whether your application can take advantage of the benefits of cloud computing (outlined above) before you deploy in the cloud.
  • Minimize data transfer costs. Most people think about compute and storage costs when they think of running an application in the cloud. But, delivering data out of a data center costs money too and can be a hidden design decision. This means thinking carefully about caching strategies, using a content delivery network (CDN), etc.
  • Simply write good, robust code. Bad programming practices (such as not designing for failure) will be amplified in a cloud environment. Being disciplined and strictly adhering to basic good coding practices can be challenging, but will go a long way to ensuring the success of your application in the cloud.

So, those are the benefits, principles, and challenges of cloud computing as we see them. As mentioned earlier, we’d love to hear your thoughts in comments. We’ll use this post (and your comments) as a reference point as we write and publish content about building OSS and device applications for the Windows Azure platform.

Thanks.

-Brian