Earlier this week we published a whitepaper entitled Sharding with SQL Azure to the TechNet wiki. In the paper, Michael Heydt and Michael Thomassy discuss the best practices and patterns to select when using horizontal partitioning and sharding with your applications.
Specific guidance shared in the whitepaper:
Overview
So what is sharding and partitioning, and why is it important?
Often the need arises where an application's data requires both high capacity for many users and support for very large data sets that require lightning performance. Or perhaps you have an application that by design must be elastic in its use of resources such as a social networking application, log processing solution or an application with a very high number of requests per second. These are all use cases where data partitioning across physical tables residing on seperate nodes; sharding or SQL Azure Federations, is capable of providing a performant scale-out solution.
In order to scale-out via sharding, an architect must partition the workload into independent units of data or atomic units. The application then must have logic built into it to understand how to access this data through the use of a custom sharding pattern or through the upcoming release of SQL Azure Federations.
Multi-Master Sharding Archetype
Also introduced by the paper is a multi-master sharding pattern where all shards are considered read/write, there is no inherent replication of data between shards and the model is referred to as a "shared nothing" as no data is shared or replicated between shards.
Use the Multi-Master Pattern if:
To use sharding with SQL Azure, application architeture must take into account:
SQL Azure Federations
SQL Azure Federations will provide the infrastructure to support the dynamic addition of shards and the movement of atomic units of data between shards providing built-in scale-out capabilities. Federation Members are shards that act as the physical container for a range atomic units. Federations will support online repartioning as well as connections, which are established with the root database to maintain continuity and integrity.
More Information
Additionally, we have spoken publicly about coming SQL Azure Federations technology at both PDC and PASS this year. Since that time we have published a number of blog posts and whitepapers for your perusal:
Nice to see something on sharding here. Even though I have a gruff point with some of Microsoft's staff, as they've reported that this is not sharding. There has been contention but hopefully you guys can work it out and educate those in the ranks that don't understand what sharding really is. With a consistent and well researched education coming form Microsoft, paying close attention to the path the industry has already tread, one can build some solid trust in solution Microsoft presents for sharding. I'm always happy to see good coverage of a topic like this.
Thanks!
thanks, adron. happy holidays!
This is a really good presentation of sharding, may I add a workaround I stumbled onto using F# but the same can be coded in .Net v4.0 Tuples, simplifying, you use the first entry of the tuple for your FederationID (until FERDERATION comes out), and the second is the value. Whatever system you choose, GUID, sequentiallly assigned ID's, ?? ... any will allow sync'ing, I preferred a 256-character string composed of server times, universal time, userID-connectionID & a random with collision checking works w/o the overhead of GUIDs but choices are there for preferences on this.
Printed this, very valuable to have, very well written, thanks o' ton.
Mulling things, only some user actions ever occur so worth handling, logging analysis of a user session history for all data retrieved will indicate what columns to include to increase concurrency by having the user data there as read-only once a datastore is created.
Without federation, why bother, but if you have to do it, seems a critical point to set up for high-concurrency data structures to add to your Azure deployment by design if you want to. My reaction so far is a Tuple, (UserID, UserDataEntity), can be one big row or a row per user per shard ... (FederationID, (UserID, UserDataEntity)).
What I haven't found yet is any performance overhead to using Tuples() this way, so far so good but not sure.
Nice paper.
Some suggestions for enhancements:
- code (or pseudo-code if it it's not firm yet) would help understanding in Section 5 around virtual/physical shard use, connections, data adaptors.
- being more explicit about what is firm and what is "planned" would be helpful.
Finally, is this limited to SQLAzure? That is, if I have sql server within my company, can I use the same techniques? Again, ensuring I know if it's Standard, Enterprise, etc. would be helpful
Thank you.