As you all should know by now, SSDS uses what we call the ACE model, compared to a traditional relational model. ACE standing for Authority, Container, and Entity. Keep in mind that you could directly map your relational tables directly to SSDS Entities, we will cover that in another posting. This is specifically about how to partition your data and what we consider the "Best Practice"

The first thing you need to do when thinking about partitioning your data is ask yourself some questions...

  • What do my queries look like?
  • How can I maximize my throughput by spreading my data across containers?
  • How much data do I have?
  • What do my use cases look like?

These are all valuable questions you need to ask yourself when thinking about how you are going to store your data in SSDS. Lets take a moment and review the SSDS Architecture and how data is placed on the nodes...

image

If you recall, SSDS is comprised of a series of front end servers which expose our web services, and a series of backend servers which store the data. The key here is that when you create a container, that container is placed on a backend node that is selected using a proprietary algorithm. So if you store all of your data in a single container, you can guarantee that all your data will be on a single machine (the data is replicated for DR purposes). Since SSDS has many, many, many backend servers, Why not take advantage of them? Think about it this way, if you issue multiple queries to SSDS, do you want a single machine processing the queries, or do you want to have many machines process the queries? That is where partitioning your data comes in...

At TechEd I talked about a Movie Showtimes application and how best to partition that data.

The relational data model looked like this (many props to the MSN Movies team for giving me their data)

 image

In looking at my queries, access patterns and use cases I chose to use the Zip Code to partition my data. Why you ask? Well, first of all the application is to present Movie Showtimes. Most if not all users either start with a movie and then choose a location to see the corresponding showtimes, or they pick a location to see what's playing. By placing each zip codes data into its own authority, I have a very quick and easy query pattern. I also have spread my data across all the machines in the SSDS backend. So if I have 100 users looking for showtimes in 100 different zip codes, I know that chances are that I will get the benefit of having many machines process those request in parallel. Since I might want to see what movies were playing in a theater near me, I can use one of the many readily available web services out there to give me a list of zip codes that are in close proximity to me and then issues the query to those containers as well.

The point to all this is since SSDS has a ton of machines in the backend, you should take advantage of them. Now, one thing I want to make perfectly clear. SSDS is a multi-tenant system. My containers will be placed on machines that have other user's containers on them. I don't want anyone to get the impression that each container is placed on a dedicated machine. We do have the necessary mechanisms in place to ensure that a query to a single container won't consume all the available resources of a single backend machine.

Another point that should be made is with regards to Authorities. Authorities are the unit of geo-location. What that means is when you provision an Authority, you will get to choose the data center that the Authority is hosted in. With that being said, create your Authority in a data center that is in close proximity to your users. While this functionality is not turned on during the beta, it will be by the time we go live.

For the showtimes app, I put all west coast zip codes in a west coast authority and all east coast zip codes in an east coast authority like so...

image

 

So to recap...

  • Take advantage of the Geo-Location aspect of Authorities
    • Choose an authority closest to your users
  • Take maximum advantage of Containers
    • Containers are placed on individual nodes
    • Partitioning your data across containers maximizes your throughput on Query and CRUD operations.

 

If anything is not clear, or if you have specific questions around your specific scenarios, let me know. I'd be happy to help.

-Dave