Authors: Jonathan Doyon (Genetec), Konstantin Dotchkoff (Microsoft)

 

Introduction

Running an app in Windows Azure allows you to address global markets. Once you have built an app for Windows Azure you can deploy it in a datacenter (DC) of your choice. And you can easily expand your presence to new regions by deploying the app in multiple DCs. You can install an isolated instance of the app in each region or you can have a cross-datacenter deployment. Here are a few reasons why you would like to do a cross-datacenter deployment:

  • Serving customers across geographies from a single logical system - ease of use by exposing a single URL
  • Providing cross datacenter availability (in a case of a major site disaster)
  • Spreading load across multiple DCs for high performance and resiliency


The Stratocast solution, developed by Genetec, is an example of a multi-tenant globally distributed solution across multiple datacenters. Stratocast is a software security solution that allows the user to view live and recorded video that is safely stored in Windows Azure from your laptop, tablet or smartphone. The solution provides the ability to monitor multiple cameras and to playback recorded video.

In this article we describe how the solution solves some of the specific requirements:

  • Minimize latency for video streaming
  • Data needs to be stored in a specific location (ideally close to the source, or at a specific location for data governance and compliance reasons)
  • All data for a tenant needs to be kept at the same location
  • Gradually upgrade subparts of the systems

High-Level Concept

Although the Stratocast solution is deployed across multiple Windows Azure datacenters, it appears as a single logical instance to the end customer. The user interface (UI) of the system is completely stateless (from a server perspective) and can serve end-users from any Azure datacenter. The middle-tier components of the solution require affinity based on the location of the tenant's data. Because of that, those two layers (UI and middle-tier) are handled differently.

The end-user access to the web portal is distributed by Windows Azure Traffic Manager. In order to serve a client request from the "closest" Azure DC the Performance load balancing method is used. In addition, Azure CDN can be used for static content of the UI to further improve the experience.
Once the user has accessed the web portal, the location of the corresponding tenant will be determined by performing a look-up against a central Azure SQL Database. This database has information about all tenants in the system and contains the URL of the cloud service responsible to serve the tenant. The location of the tenant is configured during initial provisioning - the customer is required to provide a valid country and state which will be used for the selection of the closest Azure datacenter.
After determining the location of the tenant (using the look-up) the web portal will connect to the appropriate cloud service in a specific Azure datacenter. All persisted tenant configuration and data is co-located with the cloud service in the same DC. This ensures that end-users are served from the closest geo-location, while in the background the data for one customer/tenant is kept together (potentially in a different location).
The solution is multi-tenant by design and each middle-tier cloud service deployment handles multiple tenants. For manageability, the system is also partitioned based on the size of a deployment (e.g. number of connected cameras for each deployment) and there are multiple middle-tier deployments in each datacenter. This simplifies tenants management and allows for partial upgrades.

Having this topology, a tenant can be located in the West US Azure datacenter. However a user may access the solution from the east coast of the US and will be connected to the web portal deployed in the East US datacenter. The web portal will determine the "responsible" middle-tier in West US and will call services as necessary. The UI communicates with the middle-tier in a loosely coupled manner. This is not only a good practice because the communication potentially will go across the datacenter, but it is a recommended key pattern for communication between components of a cloud app in general (there is a great blog post on this pattern that you might want to take a look at).

Solution Architecture Overview

The following graphic provides an example of the topology with two Azure datacenters and one middle-tier cloud services deployment in each for simplicity:

Figure 1 – Deployment overview with two Azure datacenters


Stratocast uses multiple queues to decouple the components and layers of the solution.
The web portal consists of a web role responsible for the UI, a worker role that implements business logic as well as a queue and Service Bus for asynchronous communication. The communication from the web to worker role goes through a queue (shown as WR Queue in figure 1); response communication from the worker to web role is implemented using Service Bus Topics.
Each middle-tier cloud service (i.e. server) has a queue for incoming requests (Server Queue in figure 1). The server will pick up a request from the queue, will process it, and will place the response on a "response" queue, which in turn will be consumed by the web portal. Each web portal has its own queue for receiving messages from the middle-tier (Portal Queue in figure 1).

Let's expand a bit on the flow of communication between the components. Based on a user interaction the web portal sends a request for an operation to its worker role using the WR Queue. It includes in the message a generated unique transaction ID. The web portal UI displays a message that the operation is in progress (e.g. updating camera configuration) and at the same time, it subscribes for a notification to a Service Bus Topic with the transaction ID. The web portal worker role dispatches the request to the "right" middle-tier cloud service, which could be in the same or in a different DC. It places the request on the appropriate Server Queue and includes the transaction ID and the address of its own Portal Queue to where the response should be sent.
After processing the request the server will send a response back to the Portal Queue. Once the message is received, the web portal worker role performs business logic, persist information as required and posts a notification for the UI through the Service Bus topic. The web server that is listening on the Service Bus topic for that notification will update the UI with the outcome of the operation. Using Service Bus Topics for the UI notifications allows for long-running transactions (e.g. some operations such as camera plug-in may take long) and out-of-order processing.

For the sake of completeness we need to mention that not all Stratocast communication goes through the queues. There are solution specific requirements that demand "live" service calls. For example video streaming is served back to the client directly from the server to minimize latency. There are also other operations that require "live" information and are implemented through web services hosted on the server worker role. With the exception of those "special" operations, all transactional type operations are implemented using the asynchronous communication pattern as described above, which is fundamental when building distributed systems.

Summary

The presented example demonstrates important considerations for cross-datacenter deployments. Apart from the main goal of serving customers from the closest datacenter, having the web portal separated from the middle-tier services combined with the loosely-coupled communication through queues, minimizes the impact of failures of individual components. Also software maintenance of the web portal can be performed easier by redirecting traffic to another datacenter for the duration of the maintenance. The partitioning of the middle-tier components allows for gradual upgrades. Using the failover load balancing method of the Azure Traffic Manager can be used to redirect traffic to another datacenter in the case of a disaster.
In general, the described techniques such as distributing an installation across multiple datacenters, partitioning and decoupling the components of the solution can improve the overall availability of the system.