Larry Franks and Brian Swan on Open Source and Device Development in the Cloud
One of the projects I’m working on during my day job is pulling together information on how Windows Azure can be used to host social applications (i.e. social games.) It’s an interesting topic, but I think I’ve managed to isolate it down to the basics and wanted to put it out here for feedback. This post is just going to talk about some high level concepts, and isn't going to drill into any implementation information.
Note: this post won’t go into details of client implementation, but will only examine the server side technologies and concerns.
The basic requirement for any social interaction is communication. The client sends a message to the server, which sends the message to other users. This can be accomplished internally in the web application if both clients are connected to the same instance, but what about when we scale this out to multiple servers?
Once we scale out, there are a couple of options:
While direct communication is probably the fastest way to do inter-instance communication, it’s also not the best solution in the cloud. This sort of multi-instance direct communication would normally involve building and maintaining a map of what users are on what instances, then directing communication between instances based on what users are interacting. Instances in the cloud may fail over to different hardware if the current node they are running on encounters a problem, or if the instance needs more resources, etc. There's a variety of reasons, but what it boils down to is that instances are going to fail over, which is going to cause subsequent communications from the browser to hit a different instance. Because of this, you should never rely on the user to server instance relationship being a constant.
It may make more sense to use the Windows Azure Queue service instead, as this allows you to provide guaranteed delivery of messages in a pull fashion. The sender puts messages in, the receiver pulls them off. Queues can be created on the fly, so it would be fairly easy to create one per game instance. The only requirement of the server in this case is that it can correctly determine the queue based on information provided by the client, such as a value stored in a cookie.
Beyond queues, other options include Windows Azure Blob service and Table service. Blobs are definitely useful for storing static assets like graphics and audio, but they can be used to store any type of information. You can also use blob storage with the Content Distribution Network, which makes it a shoe-in for any data that needs to be directly read by the client. Tables can't be exposed directly to the client, but they are still useful in that they provide a semi-structured key/value pair storage. There is a limit on the amount of data they can store per entity/row (a total of 1MB,) however they provide fast lookup of data if the lookup can be performed using the partition key and row key values. Tables would probably be a good place to store session specific information that is needed by the server, but not necessarily by the client.
SQL Azure is more tailored to storing relational data and performing queries across it. For example, if your game has persistent elements such as personal items that a player retains across sessions, you might store those into SQL Azure. During play this information might be cached in Tables or Blobs for fast access and to avoid excessive queries against SQL Azure.
Windows Azure also provides a fast, distributed cache that could be used to store shared data, however it’s relatively small (4GB max) and relatively expensive ($325 for 4GB as of November 7, 2011.) Also, it currently can only be used by .NET applications.
I mentioned latency earlier, I’ll skip the technical explaination and just say that latency is the amount of delay your application can tolerate between one user sending a message and other users receiving it. The message may be an in-game e-mail, which can tolerate high latency well, to trying to poke another player with a sharp stick, which doesn’t tolerate high latency well.
Latency is usually measured in milliseconds (MS) and the lower the better. The link to AzureScope can provide some general expectations of latency within the Azure network, however there’s also the latency of the connection between Azure and the client. This is something that’s not as easily to control or estimate ahead of time.
When thinking about latency, you need to consider how immediate a user expects a social interaction to be. I tend to categorize expectations into ‘shared’ and ‘non-shared’ experience categories. In general, the expectation of immediacy is much higher for shared experiences. Here are some examples of both:
The most basic thing you can do to ensure low latency is host your application in a datacenter that is geographicaly close to your users. This will generally ensure good latency between the client and the server, but there’s still a lot of unknowns in the connection between client and server that you can’t control.
Internally in the Azure network, you want to perform load testing to ensure that the services you use (queues, tables, blobs, etc.) maintain a consistent latency at scale. Architect the solution with scaling in mind; don’t assume that one queue will always be enough. Instead, allocate queues and other services dynamically as needed. For example, allocate one queue per game session and use it for all communication between users in the session.
Another concern is at what point the data being passed exceeds your bandwidth. A user connecting over dial-up has much less bandwidth than one connecting over fiber, so obviously you need to control how much data is sent to the client. However you also need to consider how much bandwidth your web application can handle.
According to billing information on the various role VM sizes at http://msdn.microsoft.com/en-us/library/dd163896.aspx#bk_Billing (What is a Compute Instance section), different roles have different peak bandwidth limitations. For the largest bandwidth of 800Mbps, you would need an ExtraLarge VM Size for your web role. Another possibility would be to go with more instances with less bandwidth and spread the load out more. For exmaple, 8 small VMs have the same aggregate bandwidth of an ExtraLarge VM.
Compressing data, caching objects on the client, and limiting the size of data structures are all thinks that you should be doing to reduce the bandwidth requirements of your application. That's really about all you can do.
While this has focused on Windows Azure technologies, the core information presented should be usable on any cloud; the expectations of immediacy in social interactions provides the baseline that you want to meet in passing messages while latency and bandwidth act as the limiting factors. But I don’t think I’m done here, what am I missing? Is there another concern beyond communication, or another limiting factor beyond bandwidth and latency?
As always, you can leave feedback in the comments below or send it to @larry_franks on twitter.
What about Registration / Single Sign On with ACS and the decision of how and where to keep profile data?
Good question Wade, ACS is definitely the way I would go for authentication. It's fairly easy to set up, and allows you to accept Facebook logins (where a lot of the social gaming action is.)
As to storing the profile information, you could probably get by in a pinch with table storage. But if you want to do any sort of relationships with the profiles, such as sharing the profile across games or friend relationships with other players, then SQL Azure would be the best option.
What I would probably do is store the data in SQL Azure, then once a user logs on I'd cache the data either in AppFabric cache or table storage to avoid hits on the database while they user is logged in.