Larry Franks and Brian Swan on Open Source and Device Development in the Cloud
One of the things that will be introduced in the next update to SQL Azure is a database sharding solution called SQL Azure Federations. Brian and I will have some examples of using this feature with specific languages once it releases, but we thought it would be nice to have an introductory article to explain a little about this feature and what it means to OSS developers.
So first, let’s summarize this entire article by stating that federations in SQL Azure should work with any OSS language or technology that can talk to SQL Azure. Federations will be implemented using SQL Azure Transact-SQL, so pretty much anything should be able to work with federated data. Keep reading for details and links to related information.
Your typical database contains one or more tables, which in turn contain rows that are made up of columns. So what happens when you have so much data that the database has grown to a massive size, you're running out of disk space, you're server can't handle all the incoming connections, or your index is huge and it's impacting your query performance? Often you turn to partitioning the data, splitting it up across multiple databases. Instead of one gigantic database, you now have a bunch of smaller ones. In most cases you wouldn't keep these small databases together on the same server, but instead split them up across multiple servers. This allows you to spread the processing load across multiple machines and scale out instead of up.
There's two ways that databases are typically partitioned: horizontally and vertically. Just think about the rows and columns in your database as a grid view; horizontal partitioning is splitting the rows, while vertical partitioning is splitting the columns. Sharding is a form of horizontal partitioning; breaking up your data at the row level. You end up with a bunch of databases (the shards,) each of which contains a slice of a table. So with sharding, you would end up with multiple databases (shards,) all of which contain the same table (let's say the customers table,) but each database (shard) only contains a specific subset of the total rows in the table.
Sharding can be a little tricky to implement, as you need to come up with a way to split the rows across databases that gives good utilization of all databases, and keeps all the data you commonly query on together. For instance, if you have customer table and an orders table, you might potentially use the customerID value for splitting rows. Rows in both tables that contain values 0-100 would go into databaseA, values 101-200 into databaseB, 201-300 into databaseC, etc. The thing to watch out for here is how you are assigning the customerID value; if you're sequentially assigning it to customers, databaseA is going to fill up before B and C are ever used. Ideally you would want to randomly assign the value so that new customers are spread across the databases.
Another challenge of sharding is that your application often has to have an understanding of the databases (shards) involved in the solution, so that it knows if a request comes in for customerID 167 that it needs to connect to databaseB to retrieve that information.
SQL Azure Federations is a sharding solution that is built into SQL Azure. One of the features of federations is that you can take a shard, and split it on the fly without any downtime. So you can create a federation that starts with one shard, covering all potential ranges (let's say 0-900 as an example.) If it starts approaching the maximum database size, or if your performance decreases due to large index size, etc., you can issue a command that splits the initial shard into two shards, both of which now cover a portion of the potential ranges for this federation. For example, one might now cover the range of 0-500 while the other covers 501-900.
Federations don't solve the first sharding problem I mention above, you still have to come up with a value to use that determines how you break the rows up across shards and ensure that it allows you to equally utilize the shards, but it does resolve the problem with having to know what database to connect to.
Going to gloss over these for now, as a full reference for them should be available once the feature goes live. But based on past blog posts from the SQL Azure product group, and some published hands on labs, here's what to expect:
This statement is used to create a new federation. Here’s an example:
CREATE FEDERATION Orders_Federation (CustID INT RANGE)
This would create a new federation named ‘Orders_Federation’. The 'CustID INT RANGE' is the federation distribution key, and specifies how data within this federation will be sharded; by range, using an integer value.
This statement will mark the database you run it in as the federation root database. The federation root database is used when performing some management tasks, and it is also responsible for routing connections to member databases. This command will also create the initial federation member (shard) for this federation. This initial member will cover all values from the min value to max value covered by the distribution type you specified.
Federation members are physically implemented as SQL Azure databases, and can be hosted on separate nodes within the Windows Azure data center. SQL Azure also automatically keeps three copies of them for disaster recovery purposes, just like all other databases in SQL Azure.
This statement is used to connect to a federation. This is where some of the federation magic happens; when connecting to a federation, your application doesn’t need to know anything about the underlying physical federation members, just the federation distribution key. The USE statement will connect to the federation root database, which will then route the connection to the federation member database that contains the rows matching the distribution value you specify. Here’s an example:
USE FEDERATION Orders_Federation (CustID = '100') WITH RESET, FILTERING = OFF
This statement will connect to the Orders_Federation root database, figure out what federation member (shard) contains the value of ‘100’ for the CustID distribution key, and route the connection to that database. FILTERING = OFF instructs the connection to make the entirety of the member database available on this connection. If FILTERING is set to ON, only rows containing the value of ‘100’ will be returned.
You can also connect directly to the federation root by specifying USE FEDERATION ROOT.
This statement is used to create federated tables; tables that will have their data partitioned across the federation members (shards). Here’s an example:
CREATE TABLE Customers(
CustomerID int NOT NULL,
CompanyName nvarchar(50) NOT NULL,
PRIMARY KEY (CustomerId) )
FEDERATED ON (CustID = CustomerID)
Note the FEDERATED ON clause. This associates the CustomerID field of this table with the CustID distribution key for this federation. This is how the federation knows how to split the rows of this table across the federation members (shards). After running this command, the Customers table will exist in every federation member, each of which will cover a subset of the table rows based on the value of the CustomerID field.
You can also create reference tables (tables created without the FEDERATE ON clause,) within a federation member. Generally reference tables hold static information that needs to be used in your queries, such as looking up what state based on a zip code. Since SQL Azure doesn’t support joins across databases, you’d want this reference information to be replicated in each federation member.
This is another bit of federation magic; you can alter the federation layout on the fly without incurring any downtime. This includes SPLIT operations that take one of your existing federation members (shards) and splits it into two new members. Here’s an example:
ALTER FEDERATION Orders_Federation SPLIT AT (CustID = '100')
This would find the federation member that contains the value ‘100’ for the CustID distribution key, and split it into two new federation members at that value. If this was performed against a federation that contained a single federation member (covering min value to max value for the INT type,) then after the operation the federation would contain two federation members; one containing distribution values from min value to 100 and another containing distribution values from 101 to max.
While this operation is going on, all data contained in the member being split is still available. You’re application has no clue that a split operation is going on (unless it initiated the operation or peeks at metadata about databases,) and queries should operate as normal. Once the operation is complete, rows for federated tables originally contained within the source federation member will now be contained with the two new members.
ALTER FEDERATION also supports DROP, which drops an existing federation member (shard). All data contained in the shard is dropped also, but the distribution range covered by the dropped database will now be covered by an adjacent member database.
There won’t be support for a MERGE operation in this release of federations, however it will be implemented at some point in the future: http://blogs.msdn.com/b/cbiyikoglu/archive/2011/07/11/shipping-fast-sharding-to-federations-in-from-pdc-2010-to-2011.aspx
No matter how many SPLIT or DROP operations you perform on a federation, the federation members will cover the entire range of values specified by the federation distribution key. In this case, the range of values for an INT.
SQL Azure federations is an upcoming feature of SQL Azure that allows you to dynamically scale your data, and since it’s implemented as SQL Azure Transact-SQL statements, any OSS technology that can talk to SQL Azure can use it.
Here’s some reference material that I used while creating this article: