The announcement of Windows Azure is a big milestone for us in the Astoria team. We got a chance to add our little contribution to the platform by providing data service interfaces for a couple of the Azure services.

Currently there are two services that use the ADO.NET Data Services runtime: the Windows Azure Tables Service, which was announced this week as part of the whole Windows Azure story, and SQL Data Services, which has been around for a while but got a new experimental Data Services interface this week to coincide with the PDC.

These services -and others that will come in the future also based on Data Services- share a common aspect: they have extreme scalability requirements.

In order to enable them to use our Data Services server runtime we had to extend the data service framework to make it scale in various new dimensions. In the rest of this post I'll summarize some of the walls we hit and the changes we made to the system to handle these scenarios.

Things that already scaled

The Data Services runtime already incorporates many design principles that help with scalability.

For example, the system does not keep any required state between requests (we do cache stuff, but we can throw it away at any time), so scale out of the front-end servers of the storage systems is relatively straightforward. This allows the existing runtime to handle an arbitrarily large number of requests by throwing more front-ends to the problem (as long as the back-end systems can take it, of course).

Also, we don't make any assumptions around the size of the data and provide mechanisms to push-down filters in requests to the data source, so effectively in principle there are no limits to the amount of data that a data service may be fronting.

Hitting the scalability wall

While some things scaled, there are certain aspects in which we ran into a scalability wall that required a number of changes in the system.

Using .NET types to represent the shape of the services is great in a single application, but not-so-great if you have millions of users with hundreds or thousands. We needed another way of describing the "shape of the data in the service", that is the metadata or schema of the service.

Since you can't practically create a distinct type for every user/application/table in the system, that means that the instances of objects that represent data flowing through the data services runtime cannot be of a specific type for each entity type. Instead, we needed independent of the flow format with respect of the declared types.

Metadata and service schema

The data services runtime needs to know the "schema" of each service it exposes. That is, the list of entity-sets, the entity-types of the instances living in those entity sets and the relationships between the various entities.

In a typical data service, the service exposes data for a given application or domain-specific service, so the schema of the service is known and static (within a given version at least) and all the front-end servers simple share the same schema.

The way a service author specifies the schema of a service in the shipping version of the Data Services runtime is by using .NET classes or an Entity Framework model (which in turn generates .NET classes). That works great for application developers, because .NET classes are a simple and natural way of defining the shape of your objects.

Now, if the requirement is to be able to handle millions of applications, each of which can have hundreds or thousands of tables, does that mean that we have to create a .NET type for each service and for each table, and the corresponding number of properties and such? And if so, since the front-end systems are stateless and potentially don't have any affinity to parts of the data, does that mean that any given system may end up having to load up millions of types in memory? To complicate things further, once you load an assembly (the only container in which .NET types can exist), you can't unload it unless you unload the AppDomain.

.NET types are a great solution for the scenarios where the schema is known and more or less bounded, and will continue to be the primary way of creating services in that context. However, we needed something else to handle the high-end side of the spectrum.

To address this need we introduced a new interface that data services can optionally implement. We already had the internals of the system organized more or less like this, but didn't expose it in the first release. The idea is that there is main split between the "upper half" of the runtime that deals with URL translation, LINQ expression tree generation, interceptors, policies and all aspects that make a Data Service look like a Data Service. The "bottom half" is the "data service provider", and is responsible for describing the shape of the service among other things. There are two built-in "data service providers", the Entity Framework provider which is what you use when you create a data service over an Entity Framework model, and the reflection-based provider which is what you use when creating a service on top of an arbitrary object graph. With the new change you can now create new implementations of these data service provider thingies that can obtain and manage metadata any way they want.

The way we interact with the provider is carefully designed to avoid requiring long term state state in the provider or the consumer of the provider in any way, while at the same time allowing the provider to do caching of metadata and control information if desired.

First, we never hold on to information returned by the provider beyond the scope of a single request. So for all we know the provider could be reloading all the metadata in every request. In practice, providers will probably cache this metadata in some way or another.

Second, we load metadata on demand and piecemeal. For example, during URI translation we do a small scale version of the usual binding and semantic analysis that any compiler does, and for that we need metadata. In those cases we don't load all the metadata, but only the pieces we need to do type checks, symbol lookups, etc.

Making metadata dynamic

Another aspects around metadata to consider is the fact that the shape of one of these data services can be altered at any time. For example, the Azure Table service has the concept of tables, and you can add and remove tables whenever you want.

The new scheme with custom data service providers make this possible because we don't remember anything at all across requests. So all the provider needs to do when the underlying shape of the data changes is report a different schema on the next request, and the data services runtime will happily take it.

With .NET types this would have meant creating and re-distributing new types (or creating them on demand on each node), and dealing with not being able to unload the old types from memory. Clearly not an option at this scale.

Flow format independence

With the addition of the "data service provider" interfaces we no longer have .NET types to use for the instances of each entity-type that flows through the system (e.g. from the data source to the runtime via the IEnumerables returned in LINQ queries, and from there to the serialization stack).

Another important change we made in the system is that we no longer assume anything about the shape of each CLR object returned by the query. We treat instances just as "object" all over the code base. When we need to access a member, we use methods in the data service provider interface to do that, imagine something like GetPropertyValue(object o, string name).

That means it's now possible to use some form of generic record type across the system. Not only this avoids the need for specific types, but also allows providers to piggyback control information in the instances themselves, avoid copies from the original format into CLR objects just to flow them through our runtime and a few more benefits.

Impact on LINQ expressions

While having flow format independence is great, it did complicate things for query formulation.

We typically translate URLs to expression trees, and since we have all the CLR types in the server that correspond to the entity types, all the expression trees are nice and clear.

When we're operating against unknown types we can't generated "typed" expression trees anymore. In those cases we still produce expression trees, but the member-access operations (and certain operators) are represented using custom calls to a well-known set of static members. The providers that enable this feature need to know about this and do proper translation of these expression trees.

Extension to the data model

We did one more major change that while it's not directly related to scalability it has a lot to do with the database/storage services in Windows Azure.

In the current version of Data Services types are "closed" in the sense that they have a structure that's final. You list a set of properties for each type and instances of that type cannot have properties added dynamically.

It turns out that the data services we have online have a more flexible model, where each entity has a fixed portion but also a dynamic portion. Typically the fixed portion includes a key or some sort and a version property. The dynamic portion is a property bag where you can add any name/typed-value pair.

We call these types that can be extended on a per-instance basis at runtime "open types". We introduced support for open types in the Data Services runtime such that you can mark a given entity type as "open" in metadata and that would cause the system to allow unknown properties to be set, as well as the use of unknown properties in queries (e.g. in filter predicates).

There is a lot of details around open types that I won't go into here, maybe the topic for another post, but I wanted to point out the change because it was a significant addition.

What do these changes mean for developers?

What does all this mean to current users of data services. Well...not much other than some background on how the system is evolving. Other than open types, services created with custom metadata/custom flow formats are indistinguishable from the ones created the "classic" way.

Furthermore, we will preserve the existing model where creating a service based on some .NET objects or an Entity Framework schema is really straightforward, and we consider that our primary scenario for developers.

At the same time, addressing the needs for the highest-end services out there is important, so many (if not all) of these changes will eventually make it into the shipping product so that other folks out there can use them if they chose to. Beware that these interfaces are not designed to be "nice", but rather optimized for control and efficiency, so it may not be exactly a fun experience, but you'll get all the scalability you'll need out of them.

-pablo