Architecture + Strategy

Musings from David Chou - Architect, Microsoft

January, 2011

  • Architecture + Strategy

    Designing for Cloud-Optimized Architecture


    I wanted to take the opportunity and talk about the cloud-optimized architecture, the implementation model instead of the popular perceptions around leveraging cloud computing as a deployment model. This is because, while cloud platforms like Windows Azure can run a variety of workloads, including many legacy/existing on-premises software and application migration scenarios that can run on Windows Server; I think Windows Azure’s platform-as-a-service model offers a few additional distinct technical advantages when we design an architecture that is optimized (or targeted) for the cloud platform.

    Cloud platforms differ from hosting providers

    First off, the major cloud platforms (regardless how we classify them as IaaS or PaaS) at the time of this writing, impose certain limitations or constraints in the environment, which makes them different from existing on-premises server environments (saving the public/private cloud debate to another time), and different from outsourced hosting managed service providers. Just to cite a few (according to my own understanding, at the time of this writing):

    • Amazon Web Services
      • EC2 instances are inherently stateless; that is, their local storage is non-persistent and non-durable
      • Little or no control over infrastructure that are used beneath the EC2 instances (of course, the benefit is we don’t have to be concerned with them)
      • Requires systems administrators to configure and maintain OS environments for applications
    • Google App Engine
      • Non-VM/OS instance-aware platform abstraction which further simplifies code deployment and scale, though some technical constraints (or requirements) as well. For example,
      • Stateless application model
      • Requires data de-normalization (although Hosted SQL will mitigate some concerns in this area)
      • If the application can't load into memory within 1 second, it might not load and return 500 error codes (from Carlos Ble)
      • No request can take more than 30 seconds to run, otherwise it is stopped
      • Read-only file system access
    • Windows Azure Platform
      • Windows Azure instances are also inherently stateless – round-robin load balancer and non-persistent local storage
      • Also due to the need to abstract infrastructure complexities, little or no control for the underlying infrastructure is offered to applications
      • SQL Azure has individual DB sizing constraints due to its 3-replica synchronization architecture

    Again, just based on my understanding, and really not trying to paint a “who’s better or worse” comparative perspective. The point is, these so-called “differences” exist because of many architectural and technical decisions and trade-offs to provide the abstractions from the underlying infrastructure. For example, the list above is representative of most common infrastructure approaches of using homogeneous, commodity hardware, and achieve performance through scale-out of the cloud environment (there’s another camp of vendors that are advocating big-machine and scale-up architectures that are more similar to existing on-premises workloads). Also, the list above may seem unfair to Google App Engine, but on the flip side of those constraints, App Engine is an environment that forces us to adopt distributed computing best practices, develop more efficient applications, have them operate in a highly abstracted cloud and can benefit from automatic scalability, without having to be concerned at all with the underlying infrastructure. Most importantly, the intention is to highlight that there are a few common themes across the list above – stateless application model, abstraction from infrastructure, etc.

    Furthermore, if we take a cloud computing perspective, instead of trying to apply the traditional on-premises architecture principles, then these are not really “limitations”, but more like “requirements” for the new cloud computing development paradigm. That is, if we approach cloud computing not from a how to run or deploy a 3rd party/open-source/packaged or custom-written software perspective, but from a how to develop against the cloud platform perspective, then we may find more feasible and effective uses of cloud platforms than traditional software migration scenarios.

    Windows Azure as an “application platform”

    Fundamentally, this is about looking at Windows Azure as a cloud platform in its entirety; not just a hosting environment for Windows Server workloads (which works too, but the focus of this article is on cloud-optimized architecture side of things). In fact, Windows Azure got its name because it is something a little different than Windows Server (at the time of this writing). And that technically, even though the Windows Azure guest VM OS is still Windows Server 2008 R2 Enterprise today, the application environment isn’t exactly the same as having your own Windows Server instances (even with the new VM Role option). And it is more about leveraging the entire Windows Azure platform, as opposed to building solely on top of the Windows Server platform.

    For example, below is my own interpretation of the platform capabilities baked into Windows Azure platform, which includes SQL Azure and Windows Azure AppFabric also as first-class citizens of the Windows Azure platform; not just Windows Azure.


    I prefer using this view because I think there is value to looking at Windows Azure platform holistically. And instead of thinking first about its compute (or hosting) capabilities in Windows Azure (where most people tend to focus on), it’s actually more effective/feasible to think first from a data and storage perspective. As ultimately, code and applications mostly follow data and storage.

    For one thing, the data and storage features in Windows Azure platform are also a little different from having our own on-premises SQL Server or file storage systems (whether distributed or local to Windows Server file systems). The Windows Azure Storage services (Table, Blob, Queue, Drive, CDN, etc.) are highly distributed applications themselves that provide a near-infinitely-scalable storage that works transparently across an entire data center. Applications just use the storage services, without needing to worry about their technical implementation and up-keeping. For example, for traditional outsourced hosting providers that don’t yet have their own distributed application storage systems, we’d still have to figure out how to implement and deploy a highly scalable and reliable storage system when deploying our software. But of course, the Windows Azure Storage services require us to use new programming interfaces and models (REST-based API’s primarily), and thus the difference with existing on-premises Windows Server environments.

    SQL Azure, similarly, is not just a plethora of hosted SQL Server instances dedicated to customers/applications. SQL Azure is actually a multi-tenant environment where each SQL Server instance can be shared among multiple databases/clients, and for reliability and data integrity purposes, each database has 3 replicas on different nodes and has an intricate data replication strategy implemented. The Inside SQL Azure article is a very interesting read for anyone who wants to dig into more details in this area.

    Besides, in most cases, a piece of software that runs in the cloud needs to interact with data (SQL or no-SQL) and/or storage in some manner. And because data and storage options in Windows Azure platform are a little different than their seeming counterparts in on-premises architectures, applications often require some changes as well (in addition to the differences in Windows Azure alone). However, if we look at these differences simply as requirements (what we have) in the cloud environment, instead of constraints/limits (what we don’t have) compared to on-premises environments, then it will take us down the path to build cloud-optimized applications, even though it might rule out a few application scenarios as well. And the benefit is that, by leveraging the platform components as they are, we don’t have to invest in the engineering efforts to architect and build and deploy highly reliable and scalable data management and storage systems (e.g., build and maintain your own implementations of Cassandra, MongoDB, CouchDB, MySQL, memcarche, etc.) to support applications; we can just use them as native services in the platform.

    The platform approach allows us to focus our efforts on designing and developing the application to meet business requirements and improve user experience, by abstracting away the technical infrastructure for data and storage services (and many other interesting ones in AppFabric such as Service Bus and Access Control), and system-level administration and management requirements. Plus, this approach aligns better with the primary benefits of cloud computing – agility and simplified development (less cost as a result).

    Smaller pieces, loosely coupled

    Building for the cloud platform means designing for cloud-optimized architectures. And because the cloud platforms are a little different from traditional on-premises server platforms, this results in a new developmental paradigm. I previously touched on this topic with my presentation at JavaOne 2010, then later on at Cloud Computing Expo 2010 Santa Clara; just adding some more thoughts here. To clarify, this approach is more relevant to the current class of “public cloud” platform providers such as ones identified earlier in this article, as they all employ the use of heterogeneous and commodity servers, and with one of the goals being to greatly simplify and automate deployment, scaling, and management tasks.

    Fundamentally, cloud-optimized architecture is one that favors smaller and loosely coupled components in a highly distributed systems environment, more than the traditional monolithic, accomplish-more-within-the-same-memory-or-process-or-transaction-space application approach. This is not just because, from a cost perspective, running 1000 hours worth of processing in one VM is relatively the same as running one hour each in 1000 VM’s in cloud platforms (although the cost differential is far greater between 1 server and 1000 servers in an on-premises environment). But also, with a similar cost, that one unit of work can be accomplished in approximately one hour (in parallel), as opposed to ~1000 hours (sequentially). In addition, the resulting “smaller pieces, loosely coupled” architecture can scale more effectively and seamlessly than a traditional scale-up architecture (and usually costs less too). Thus, there are some distinct benefits we can gain, by architecting a solution for the cloud (lots of small units of work running on thousands of servers), as opposed to trying to do the same thing we do in on-premises environments (fewer larger transactions running on a few large servers in HA configurations).

    I like using the LEGO analogy below. From this perspective, the “small pieces, loosely coupled” fundamental design principle is sort of like building LEGO sets. To build bigger sets (from a scaling perspective), with LEGO we’d simply use more of the same pieces, as opposed to trying to use bigger pieces. And of course, the same pieces can allow us to scale down the solution as well (and not having to glue LEGO pieces together means they’re loosely coupled). Winking smile

    But this architecture also has some distinct impacts to the way we develop applications. For example, a set of distributed computing best practices emerge:

    • asynchronous processes (event-driven design)
    • parallelization
    • idempotent operations (handle duplicity)
    • de-normalized, partitioned data (sharding)
    • shared nothing architecture
    • fault-tolerance by redundancy and replication
    • etc.

    Asynchronous, event-driven design – This approach advocates off-loading as much work from user requests as possible. For example, many applications just simply incur the work to validate/store the incoming data and record it as an occurrence of an event and return immediately. In essence it’s about divvying up the work that makes up one unit of work in a traditional monolithic architecture, as much as possible, so that each component only accomplishes what is minimally and logically required. Rest of the end-to-end business tasks and processes can then be off-loaded to other threads, which in cloud platforms, can be distributed processes that run on other servers. This results in a more even distribution of load and better utilization of system resources (plus improved perceived performance from a user’s perspective), thus enabling simpler scale-out scenarios as additional processing nodes and instances can be simply added to (or removed from) the overall architecture without any complicated management overhead. This is nothing new, of course; many applications that leverage Web-oriented architectures (WOA), such as Facebook, Twitter, etc., have applied this pattern for a long time in practice. Lastly, of course, this also aligns well to the common stateless “requirement” in the current class of cloud platforms.

    Parallelization – Once the architecture is running in smaller and loosely coupled pieces, we can leverage parallelization of processes to further improve the performance and throughput of the resulting system architecture. Again, this wasn’t so prevalent in traditional on-premises environments because creating 1000 additional threads on the same physical server doesn’t get us that much more performance boost when it is already bearing a lot of traffic (even on really big machines). But in cloud platforms, this can mean running the processes in 1000 additional servers, and for some processes this would result in very significant differences. Google’s Web search infrastructure is a great example of this pattern; it is publicized that each search query gets parallelized to the degree of ~500 distributed processes, then the individual results get pieced together by the search rank algorithms and presented to the user. But of course, this also aligns to the de-normalized data “requirement” in the current class of cloud platforms, as well as SQL Azure’s implementation that resulted in some sizing constraints and the consequent best practice of partitioning databases, because parallelized processes can map to database shards and try not to significantly increase the concurrency levels on individual databases, which can still degrade overall performance.

    Idempotent operations – Now that we can run in a distributed but stateless environment, we need to make sure that same process that gets routed to multiple servers don’t result in multiple logical transactions or business state changes. There are processes that could and prefer duplicate transactions, such as ad clicks; but there are also processes that don’t want multiple requests be handled as duplicates. But the stateless (and round-robin load-balancing in Windows Azure) nature of cloud platforms requires us to put more thoughts into scenarios such as when a user manages to send multiple submits from a shopping cart, as these requests would get routed to different servers (as opposed to stateful architectures where they’d get routed back to the same server with sticky sessions) and each server wouldn’t know about the existence of the process on the other server(s). There is no easy way around this, as the application ultimately needs to know how to handle conflicts due to concurrency. Most common approach is to implement some sort of transaction ID that uniquely identifies the unit of work (as opposed to simply relying on user context), then choose between last-writer or first-writer wins, or optimistic locking (though any form of locking would start to reduce the effectiveness of the overall architecture).

    De-normalized, partitioned data (sharding) – Many people perceive the sizing constraints in SQL Azure (currently at 50GB – also note it’s the DB size and not the actual file size which may contain other related content) as a major limitation in Windows Azure platform. However, if a project’s data can be de-normalized to a certain degree, and partitioned/sharded out, then it may fit well into SQL Azure and benefit from the simplicity, scalability, and reliability of the service. The resulting “smaller” databases actually can promote the use of parallelized processes, perform better (load more distributed than centralized), and improve overall reliability of the architecture (one DB failing is only a part of the overall architecture, for example).

    Shared nothing architecture – This means a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. With data sharding and maintained in many distributed nodes, the application itself can and should be developed using shared-nothing principles. But of course, many applications need access to shared resources. It is then a matter of deciding whether a particular resource needs to be shared for read or write access, and different strategies can be implemented on top of a shared nothing architecture to facilitate them, but mostly as exceptions to the overall architecture.

    Fault-tolerance by redundancy and replication – This is also “design for failures” as referred to many cloud computing experts. Because of the use of commodity servers in these cloud platform environments, system failures are a common thing (hardware failures occur almost constantly in massive data centers) and we need to make sure we design the application to withstand system failures. Similar to thoughts around idempotency above, designing for failures basically means allowing requests to be processed again; “try-again” essentially.

    Lastly, each of the topic areas above is worthy of an individual article and detailed analysis; and lots of content are available on the Web that provide a lot more insight. The point here is, each of the principles above actually has some relationship with, and dependency on, the others. It is the combination of these principles that contribute to an effective distributed computing, and cloud-optimized architecture.

  • Architecture + Strategy

    Run Java with GlassFish in Windows Azure


    At PDC10 (Microsoft’s Professional Developers Conference 2010), Microsoft has again provided affirmation of support for Java in Windows Azure. “We're making Java a first-class citizen with Windows Azure, the choice of the underlying framework, the choice of the development tool.”, said Bob Muglia (President of Server and Tools at Microsoft), during his keynote presentation (transcript). Then during PDC Vijay Rajagopalan delivered a session (Open in the Cloud: Windows Azure and Java) which provided more details on the state of many deliverables, including:

    Vijay also talked about, during his presentation, a successful deployment of Fujitsu’s Interstage application server (a Java EE 6 app server based on GlassFish) in Windows Azure. Plus a whole slew of base platform improvements announced via the Windows Azure team blog, which helped to address many of the limitations we observed last year, such as not being able to use NIO as described in my earlier work with Jetty and Windows Azure.

    Lots of great news, and I was finally able to sit down and try some things hands-on, with the latest release of the Windows Azure SDK (version 1.3; November 2010) that included many of the announced improvements.

    Java NIO now works!

    First off, one major limitation identified previously was that because of the networking sandbox model in Windows Azure (for security reasons) also blocked the loopback adapter which NIO needed. At PDC this was discussed, and the fact that Fujitsu Interstage app server worked (which used GlassFish which used NIO) proved this works. And fortunately, there isn’t anything additional we need to do to “enable” NIO; it just works now. I tried my simple Jetty Azure project by changing it back to using the org.eclipse.jetty.server.nio.SelectChannelConnector, deployed into Windows Azure, and it ran!

    Also worth noting was that the startup time in Windows Azure was significantly improved. My Jetty Azure project took just a few minutes to become accessible on the Internet (it had taken more than 20 minutes at one point in time).

    Mario Kosmiskas also posted an excellent article which showed Jetty and NIO working in Windows Azure (and many great tips which I leveraged for the work below).

    Deploying GlassFish

    Since Fujitsu Interstage (based on GlassFish) already works in Azure, GlassFish itself should work as well. So I thought I’d give it a try and learn from the exercise. First I tried to build on the Jetty work, but started running into issues with needing Visual Studio and the Azure tools to copy/move/package large amounts of files and nested folders when everything is placed inside of the project structure – GlassFish itself has 5000+ files and 1100+ folders (the resulting super-long file path names for a few files caused the issue). This became a good reason to try out the approach of loading application assets into role instances from Blob storage service, instead of packaging everything together inside of the same Visual Studio project (as is the best practice for ASP.NET Web Role projects).

    This technique was inspired by Steve Marx’s article last year (role instance using Blob service), and realized using the work from Mario Kosmiskas’ article (PowerShell magic), I was able to have an instance of GlassFish Server Open Source Edition 3.1 (build 37; latest at the time of this writing) deployed and running in Windows Azure, in a matter of minutes. Below is a screenshot of the GlassFish Administration Console running on a subdomain (in Azure).


    To do this, basically just follow the detailed steps in Mario Kosmiskas’ article. I will highlight the differences here, plus a few bits that weren’t mentioned in the article. Easiest way is to just reuse his Visual Studio project ( I started from scratch so that I could use GlassFish for various names.

    1. Create a new Cloud project, and add a Worker Role (I named mine GlassFishService). The project will open with a base project structure.

    2. Copy and paste in the files (from Mario’s project, under the ‘JettyWorkerRole’ folder, and pay attention to his comments on Visual Studio’s default UTF-8 enocding):

    • lib\ICSharpCode.SharpZipLib.dll
    • Launch.ps1
    • Run.cmd

    Paste them into Worker Role. For me, the resulting view in Solution Explorer is below:


    3. Open ServiceDefinition.cscfg, and add the Startup and Endpoints information. The resulting file should look something like this:

    <?xml version="1.0" encoding="utf-8"?>
    <ServiceDefinition name="GlassFishAzure" xmlns="">
      <WorkerRole name="GlassFishService">
          <Import moduleName="Diagnostics" />
          <Task commandLine="Run.cmd" executionContext="limited" taskType="background" />
          <InputEndpoint name="Http_Listener_1" protocol="tcp" port="8080" localPort="8080" />
          <InputEndpoint name="Http_Listener_2" protocol="tcp" port="8181" localPort="8181" />
          <InputEndpoint name="Http_Listener_3" protocol="tcp" port="4848" localPort="4848" />
          <InputEndpoint name="JMX_Connector_Port" protocol="tcp" port="8686" localPort="8686" />
          <InputEndpoint name="Remote_Debug_Port" protocol="tcp" port="9009" localPort="9009" />

    As you can see, the Startup element instructs Windows Azure to execute Run.cmd as a startup task. And the InputEndpoint elements are used to specify ports that GlassFish server needs to listen to for external connections.

    4. Open Launch.ps1 and make a few edits. I kept the existing functions unchanged. The parts that changed are shown below (and highlighted):

    $connection_string = 'DefaultEndpointsProtocol=http;AccountName=dachou1;AccountKey=<your acct key>
    # JRE
    $jre = ''
    download_from_storage 'java' $jre $connection_string (Get-Location).Path
    unzip ((Get-Location).Path + "\" + $jre) (Get-Location).Path
    # GlassFish
    $glassfish = ''
    download_from_storage 'apps' $glassfish $connection_string (Get-Location).Path
    unzip ((Get-Location).Path + "\" + $glassfish) (Get-Location).Path
    # Launch Java and GlassFish
    .\jre\bin\java `-jar .\glassfish3\glassfish\modules\admin-cli.jar start-domain --verbose

    Essentially, update the script with:

    • Account name and access key from your Windows Azure Storage Blob service (where you upload the JRE and GlassFish zip files into)
    • Actual file names you’re using
    • The appropriate command that references eventual file locations to launch the application (or call another script)

    The script above extracted both zip files into the same location (at the time of this writing, in Windows Azure, it is the local storage at E:\approot). So your commands need to reflect the appropriate file and directory structure based on how you put together the zip files.

    5. Upload the zip files into Windows Azure Storage Blob service. I followed the same conventions of placing them into ‘apps’ and ‘java’ containers. For GlassFish, I used the downloaded directly from For Java Runtime (JRE) I zipped the ‘jre’ directory under the SDK I installed. To do the upload, many existing tools can help you do this from a user’s perspective. I used Neudesic’s Azure Storage Explorer. As a result, Server Explorer in Visual Studio showed this view in Windows Azure Storage (and the containers would show the uploaded files):


    That is it! Now we can upload this project into Windows Azure and have it deployed. Just publish it and let Visual Studio and Windows Azure do the rest:


    Once completed, you could go to port 8080 on the Website URL to access GlassFish, which should show something like this:


    If you’d like to use the default port 80 for HTTP, just go back to ServiceDefinition.cscfg and update the one line into:

    <InputEndpoint name="Http_Listener_1" protocol="tcp" port="80" localPort="8080" />

    GlassFish itself will still listen on port 8080, but externally the Windows Azure cloud environment will receive user requests on port 80, and route them to the VM GlassFish is running in via port 8080.

    Granted, this is a very simplified scenario, and it only demonstrates that the GlassFish server can run in Windows Azure. There is still a lot of work that are needed to enable functionalities Java applications expect from a server, such as systems administration and management components, integration points with other systems, logging, relational database and resource pooling, inter-node communication, etc.

    In addition, some environmental differences such as Windows Azure being a stateless environment, non-persistent local file system, etc. also need to be mitigated. These differences make Windows Azure a little different from existing Windows Server environments, thus there are different things we can do with Windows Azure instead of using it simply as an outsourced hosting environment. My earlier post based on the JavaOne presentation goes into a bit more detail around how highly scalable applications can be architected differently in Windows Azure.

    Java deployment options for Windows Azure

    With Windows Azure SDK 1.3, we now have a few approaches we can pursue to deploy Java applications in Windows Azure. A high-level overview based on my interpretations:

    Worker Role using Visual Studio deployment package – This is essentially the approach outlined in my earlier work with Jetty and Windows Azure. With this approach, all of the files and assets (JRE, Jetty distributable, applications, etc.) are included in the Visual Studio project, then uploaded into Windows Azure in a single deployment package. To kick-off the Java process we can write bootstrapping code in the Worker Role’s entry point class on the C# side, launch scripts, or leverage SDK 1.3’s new Startup Tasks feature to launch scripts and/or executables.

    Worker Role using Windows Azure Storage Blob service – This is the approach outlined in Mario Kosmiskas’ article. With this approach, all of the files and assets (JRE, Jetty distributable, applications, etc.) are uploaded and managed in the Windows Azure Storage Blob service separately, independent of the Visual Studio project that defines the roles and services configurations. To kick-off the Java process we could again leverage SDK 1.3’s new Startup Tasks feature to launch scripts and/or executables. Technically we can invoke the script from the role entry class in C# too, which is what the initial version of the Tomcat Accelerator does, but Java teams may prefer an existing hook to call the script.

    Windows Azure VM Role – The new VM Role feature could be leveraged, so that we can build a Windows Server-based image with all of the application files and assets installed and configured, and upload that image into Windows Azure for deployment. But note that while this may be perceived as the approach that most closely aligns to how Java applications are deployed today (by directly working with the server OS and file system), in Windows Azure this means trading off the benefits of automated OS management and other cloud fabric features.

    And perhaps, with the new Remote Desktop feature in Windows Azure, we can probably manually install and configure Java application assets. But doing so sort of treats Windows Azure as a simple hosting facility (which it isn’t) and defeats the purpose of all the fault-tolerance, automated provisioning and management capabilities in Windows Azure. In addition, for larger deployments (many VM’s) it would become increasingly tedious and error-prone if each VM needs to be set up manually.

    In my opinion, the Worker Role using Windows Azure Storage Blob service approach is ideally suited for Java applications, for a couple of reasons:

    • All of the development and application testing work can still be accomplished in the tools you’re already using. Visual Studio and dealing with Windows Azure SDK and tools are only needed from a deployment perspective – deploying the script that launches the Java process
    • Managing individual zip files (or other archive formats) and at any granularity level, instead of needing to put everything into the same deployment package when using Visual Studio for everything
    • Loading application assets from the Windows Azure Storage Blob service also provides more flexibility for versioning, reuse (of components), and managing smaller units of changes. We can have a whole set of scripts that load different combinations of assets depending on version and/or intended functionality

    Perhaps, at some point we could just upload scripts into Blob service and the only thing is to tell the Windows Azure management portal to run, as the Startup Task, which scripts for which roles and instances from the assets we store in the Blob service. But there are still things we could do with Visual Studio – setting up IntelliTrace, facilitating certificates for Remote Desktop, etc. However, treating the Blob service as a distributed storage system, automating the transfer of specific components and assets from the storage to each role instance, and launching the Java process to run the application, could be a very interesting way for Java teams to run applications in Windows Azure platform.

    Lastly, this approach is not limited to Java applications. In fact, as long as any software components that can be loaded from Blob storage, then installed in an automated manner driven by scripts (and we may even be able to use another layer of script interpreters such as Cygwin), they can use this approach for deployment in Windows Azure. By managing libraries, files, and assets in Blob storage, and maintaining a set of scripts, a team can operate and manage multiple (and multiple versions of) applications in Windows Azure, with comparatively higher flexibility and agility than managing individual virtual machine images (especially for smaller units of changes).

Page 1 of 1 (2 items)