Hi, I’m Pankaj Arora, a senior manager in the Microsoft IT Global Strategic Initiatives team.

When Microsoft CEO Steve Ballmer publicly declared “We’re all in” on cloud computing in March 2010, he wasn’t just referring to Microsoft’s products. He also was giving his IT organization a mandate to move to cloud computing. Since that declaration, my colleagues and I have learned a lot about what it takes to adopt cloud computing at a global enterprise. We now have cloud deployments of all the common models—SaaS, PaaS, and IaaS—and we’re starting to use Data-as-a-Service. 

With numerous deployment experiences under our belt—and industry predictions of even greater cloud adoption in 2012 as a backdrop—I want this community to know about a book I’ve co-authored with two colleagues titled, To the Cloud: Cloud Powering an Enterprise. In summary, the book addresses the Why, What and How of enterprise cloud adoption. It’s based on our own experiences and best practices adopting cloud computing, while also drawing on industry and customers experiences. 

The following is an excerpt from Chapter 4 of the pre-production version of the book, which is available in print and eBook through Amazon, Barnes & Noble and McGraw-Hill amongst other outlets. You can see more on the book website here.

Feel free to ask questions, and I hope these excerpts (and the book) help you with your cloud computing strategy and deployments.

Pankaj

Architectural Principles

Moving applications and data out of the corporate data center does not eliminate the risk of hardware failures, unexpected demand for an application, or unforeseen problems that arise in production. Designed well, however, a service running in the cloud should be more scalable and fault-tolerant, and perform better than an on-premises solution.

Virtualization and cloud fabric technologies, as used by cloud providers, make it possible to scale out to a theoretically unlimited capacity. This means that application architecture and the level of automaton, not physical capacity, constrain scalability. In this section, we introduce several design principles that application engineers and operations personnel need to understand to properly architect a highly scalable and reliable application for the cloud.

Resiliency

A properly designed application will not go down just because something happens to a single scale unit. A poorly designed application, in contrast, may experience performance problems, data loss, or an outage when a single component fails. This is why cloud-centric software engineers cultivate a certain level of pessimism. By thinking of all the worst-case scenarios, they can design applications that are fault tolerant and resilient when something goes wrong.

Monolithic software design, in which the presentation layer and functional logic are tightly integrated into one application component, may not scale effectively or handle failure gracefully. To optimize an application for the cloud, developers need to eliminate tight dependencies and break the business logic and tasks into loosely-coupled modular components so that they can function independently. Ideally, application functionality will consist of autonomous roles that function regardless of the state of other application components. To minimize enterprise complexity, developers should also leverage reusable services where possible.

We talked about the Microsoft online auction tool earlier. One way to design such an application would be to split it into three components, as each service has a different demand pattern and is relatively asynchronous from the others: a UI layer responsible for presenting information to the user, an image resizer, and a business logic component that applies the bidding rules and makes the appropriate database updates. At the start of the auction, a lot of image resizing occurs as people upload pictures of items they add to the catalog. Toward the end of the auction, as people try to outbid each other, the bidding engine is in higher demand. Each component adds scale units as needed based on system load. If, for example, the image resizer component fails, the entire functionality of the tool is not lost.

Pessimism aside, the redundancy and automation built into cloud models make cloud services more reliable, in general. Often, cloud providers have multiple “availability zones” in which they segment network infrastructure, hardware, and even power from one another. Operating multiple scale units of a single application across these zones can further reduce risk; some providers require this before they will guarantee a higher SLA. Therefore, the real question when considering failure is, what happens if an instance of an application is abruptly rebooted, goes down, or is moved?

  • How will IT know the failure happened?
  • What application functionality, if any, will still be available?
  • Which steps will be required to recover data and functionality for users?

Removing unnecessary dependencies makes applications more stable. If a service that the application relies upon for one usage scenario goes down, other application scenarios should remain available.

For the back-end, because some cloud providers may throttle requests or terminate long-running queries on SQL PaaS and other storage platforms, engineers should include retry logic. For example, a component that requests data from another source could include logic that asks for the data a specified number of times within a specified time period before it throws an exception.

For the occasional reboot of a cloud instance, application design should include a persistent cache so that another scale unit or the original instance that reboots can recover transactions. Using persistent state requires taking a closer look at statelessness—another design principle for cloud-based applications.

Statelessness

Designing for statelessness is crucial for scalability and fault tolerance in the cloud. Whether an outage is unexpected or planned (as with an operating system update), as one scale unit goes down, another picks up the work. An application user should not notice that anything happened. It is important to deploy more than one scale unit for each critical cloud service, if not for scaling purposes, simply for redundancy and availability.

Cloud providers generally necessitate that applications be stateless. During a single session, users of an application can interact with one or more scale unit instances that operate independently in what is known as “stateless load balancing” or “lack of session affinity.” Developers should not hold application or session state in the working memory of a scale unit because there is no guarantee the user will exclusively interact with that particular scale unit. Therefore, without stateless design, many applications will not be able to scale out properly in the cloud. Most cloud providers offer persistent storage to address this issue, allowing the application to store session state in a way that any scale unit can retrieve.

Parallelization

Taking advantage of parallelization and multithreaded application design improves performance and is a core cloud design principle. Load balancing and other services inherent in cloud platforms can help distribute load with relative ease. With low-cost rapid provisioning in the cloud, scale units are available on-demand for parallel processing within a few API calls and are decommissioned just as easily.

Massive parallelization can also be used for high performance computing scenarios, such as for real-time enterprise data analytics. Many cloud providers directly or indirectly support frameworks that enable splitting up massive tasks for parallel processing. For example, Microsoft partnered with the University of Washington to demonstrate the power of Windows Azure for performing scientific research. The result was 2.5 million points of calculation performed by the equivalent of 2,000 servers in less than one week,[1] a compute job that otherwise may have taken months.

Latency

Software engineers can apply the following general design principles to reduce the potential that network latency will interfere with availability and performance.

  • Use caching, especially for data retrieved from higher latency systems, as would be the case with cross-premises systems.
  • Reduce chattiness and/or payloads between components, especially when cross-premises integration is involved.
  • Geo-distribute and replicate content globally. As previously mentioned, enabling the content delivery network in Windows Azure, for example, allows end users to receive BLOB storage content from the closest geographical location.

Automated Scaling

For cloud offerings that support auto-scaling features, engineers can poll existing monitoring APIs and use service management APIs to build self-scaling capabilities into their applications. For example, consider utilization-based logic that automatically adds an application instance when traffic randomly spikes or reaches certain CPU consumption thresholds. The same routine might listen for messages, instructing instances to shut down once demand has fallen to more typical levels.

Some logic might be finance-based. For example, developers can add cost control logic to prevent noncritical applications from auto-scaling under specified conditions or to trigger alerts in case of usage spikes.

Scaling of data is as important as application scaling, and once again, it is a matter of proper design. Rethinking the architecture of an application’s data layer for use in the cloud, while potentially cumbersome, can also lead to performance and availability improvements. For example, if no cloud data storage service offers a solution large enough to contain the data in an existing database, consider breaking the dataset into partitions and storing it across multiple instances. This practice, known as “sharding,” has become standard for many cloud platforms and is built into several, including SQL Azure. Even if this is not necessary initially, it may become so over time as data requirements grow.