Welcome to MSDN Blogs Sign in | Join | Help

Life in the clouds...

My take on everything from SQL Data Services, to coding, to life in general
Best Practice - Data Partitioning

As you all should know by now, SSDS uses what we call the ACE model, compared to a traditional relational model. ACE standing for Authority, Container, and Entity. Keep in mind that you could directly map your relational tables directly to SSDS Entities, we will cover that in another posting. This is specifically about how to partition your data and what we consider the "Best Practice"

The first thing you need to do when thinking about partitioning your data is ask yourself some questions...

  • What do my queries look like?
  • How can I maximize my throughput by spreading my data across containers?
  • How much data do I have?
  • What do my use cases look like?

These are all valuable questions you need to ask yourself when thinking about how you are going to store your data in SSDS. Lets take a moment and review the SSDS Architecture and how data is placed on the nodes...

image

If you recall, SSDS is comprised of a series of front end servers which expose our web services, and a series of backend servers which store the data. The key here is that when you create a container, that container is placed on a backend node that is selected using a proprietary algorithm. So if you store all of your data in a single container, you can guarantee that all your data will be on a single machine (the data is replicated for DR purposes). Since SSDS has many, many, many backend servers, Why not take advantage of them? Think about it this way, if you issue multiple queries to SSDS, do you want a single machine processing the queries, or do you want to have many machines process the queries? That is where partitioning your data comes in...

At TechEd I talked about a Movie Showtimes application and how best to partition that data.

The relational data model looked like this (many props to the MSN Movies team for giving me their data)

 image

In looking at my queries, access patterns and use cases I chose to use the Zip Code to partition my data. Why you ask? Well, first of all the application is to present Movie Showtimes. Most if not all users either start with a movie and then choose a location to see the corresponding showtimes, or they pick a location to see what's playing. By placing each zip codes data into its own authority, I have a very quick and easy query pattern. I also have spread my data across all the machines in the SSDS backend. So if I have 100 users looking for showtimes in 100 different zip codes, I know that chances are that I will get the benefit of having many machines process those request in parallel. Since I might want to see what movies were playing in a theater near me, I can use one of the many readily available web services out there to give me a list of zip codes that are in close proximity to me and then issues the query to those containers as well.

The point to all this is since SSDS has a ton of machines in the backend, you should take advantage of them. Now, one thing I want to make perfectly clear. SSDS is a multi-tenant system. My containers will be placed on machines that have other user's containers on them. I don't want anyone to get the impression that each container is placed on a dedicated machine. We do have the necessary mechanisms in place to ensure that a query to a single container won't consume all the available resources of a single backend machine.

Another point that should be made is with regards to Authorities. Authorities are the unit of geo-location. What that means is when you provision an Authority, you will get to choose the data center that the Authority is hosted in. With that being said, create your Authority in a data center that is in close proximity to your users. While this functionality is not turned on during the beta, it will be by the time we go live.

For the showtimes app, I put all west coast zip codes in a west coast authority and all east coast zip codes in an east coast authority like so...

image

 

So to recap...

  • Take advantage of the Geo-Location aspect of Authorities
    • Choose an authority closest to your users
  • Take maximum advantage of Containers
    • Containers are placed on individual nodes
    • Partitioning your data across containers maximizes your throughput on Query and CRUD operations.

 

If anything is not clear, or if you have specific questions around your specific scenarios, let me know. I'd be happy to help.

-Dave

Posted: Wednesday, June 18, 2008 11:57 AM by davidrob

Comments

jamiet said:

Hi David,

"when you provision an Authority, you will get to choose the data center that the Authority is hosted in"

Hmm...what if my application uses a single authority at the back-end but is used globally? Do you have recommendations to share for that scenario?

Thanks

Jamie

# June 18, 2008 6:13 PM

davidrob said:

Hey Jaime,

That is a great question. I would say at this point choose an Authority that is closest the majority of your users. We are looking at some scenarios to support replication of data across Authorities, but it is too early to comment on it.

HTH,

Dave

# June 18, 2008 6:20 PM

jamiet said:

Hi David,

Thanks for the reply.

As ever when you make far-reaching decisions such as data replication there are going to be trade-offs.

If you were to do this what would the trade-off be? Would you introduce eventual consistency?

-Jamie

# June 19, 2008 2:03 AM

davidrob said:

Jaime,

We are still ironing out the details on this. One point I would like to make is around eventual consistency.

SSDS is from the SQL team at Microsoft. We are an enterprise database team and to be honest things like eventual consistency don't fly here. How can you trust a data platform when if you do an operation and then try to retreive the updated record you may or may not get the most recent copy?

Try telling this to a customer...

"I'm sorry Mr. Jones, but even though you removed the 1000 widgets from your order, due to our data platforms EVENTUAL CONSISTENCY model when you clicked complete order, the delete didn't happen as of yet. Feel free to return the 1000 widgets and unfortunately we need to charge you a 20% restocking fee"

That won't fly will it?

# June 19, 2008 1:03 PM

jamiet said:

That's music to my ears David. I've never been comfortable with the idea of eventual consistency.

As a user I shouldn't really care about HOW you're going to achieve this (i.e. global replication whilst still upholding ACID) but nonetheless I'm intrigued. If you want to share anything around it then I'll be listening!

-Jamie (not Jaime :)

P.S. I've subscribed to email alerts for this comment thread but haven't received any emails - I never have for any MSDN blogs. Good job I subscribed to the RSS feed.

# June 20, 2008 4:18 AM

Life in the clouds... said:

In the first best practice , we looked at how we can gain maximum performance out of SSDS by partitioning

# July 6, 2008 11:43 PM

nickfr said:

David,

an interesting article. I have a scenario where I have in relational terms a table with millions of rows of data. It sounds like I should partition the data into multiple containers - I guess almost like a hash table. I was wondering on the practical limits on this. Can I have a million containers? each with say up to 1000 entities? I guess I don't really understand the performance characteristics of SSDS yet and your articles are the first that address this. Could you or the team expand on this area?

thanks

-NiCK

# August 1, 2008 8:21 AM
Anonymous comments are disabled
Page view tracker