The long-term "storecast"

Philosophy behind the design of SSDS and some personal thoughts

With Sprint 3 winding down, I thought it is a good time for me to share with you some of the philosophy behind the design of SQL Server Data Services (SSDS) and a few personal thoughts about the experience.  When we started this project 2 years back, we realized that we had three fairly difficult problems to solve before we could credibly roll out an internet scale data service.  The 3 big problems in order of complexity are:

a. Building a scale free, highly available consistent data service that is fault tolerant and self healing
b. Building the service using low cost commonly available hardware
c. Building a service that was also cheap to operate - lights out operation

Solving problems a and b incidentally makes problem c a bit more complex as you end up with lot more hardware to manage and the hardware tend to fail more often.  As a team, we made what I think was a wise decision to use technology already proven to solve these problems.  The only area where we had to do a ton of heavy lifting was solving problem a.  It allowed us to focus the team's energy on the most difficult problem when it comes to scaling out stateful services.  I am not going to go into the details of our approach.  If you are interested in hearing about this, I urge you to attend PDC 2008 and hear it from the architects who actually solved this problem.

For b and c we mostly used technologies already available within Microsoft and adapted it for stateful services.  Initially we thought this would be fairly easy but it turned out to be more complicated than we thought, especially given the fact that we had to put in infrastructure software that allow us to debug problems with the service without attaching debuggers to the machine or touching the machine.  We had to put in logging and tracing infrastructure and given that we all got the logging and tracing religion, in one of our sprints early on we inadvertently dumped so much that we shut down the service effectively as there were no resources left to respond to user requests.  Some of our early experiences are fodder for some very interesting hallway conversations.  But it taught us that there are quite a few Ph.D. level research topics around debugging large scale distributed systems and if you are up for it and interested in working on them, do give us a holler.  Even though we have quite a few Ph.D.'s in the team, we could use some more help :-))

Since we made an early decision to limit the number of hard problems we needed to solve, we decided that we would focus less on the features of the service but more on the quality of the service and the cost of standing up and running the service.  The less the service does we argued, the easier it would be for us to achieve our objectives.  In hindsight, this was probably one of the best decisions we made.  Istvan, Tudor and Nigel deserve special credit for keeping us focussed on "less is better".  It also allowed us to learn about the pitfalls of running such a service, including upgrading the service without shutting it down.  We did not shy away from complex problems, but we made sure if we could limit the surface area without losing a ton of value, we always took that path.  We are still in the learning mode and learning every day about workloads that cause "irregular heartbeats" to our service and how to handle such workloads.  But the team has definitely come a long way, working with internal partner teams, working very closely with our operations guys (who by the way are absolutely awesome) and now with our beta customers. 

While a service is still about software, and the fundamentals still hold, the engineering process, cadence and discipline it requires, I think is quite different from shipping shrink wrapped software.  It is easier in some dimensions (like our test matrix is not huge), it is more difficult in others (debugging a large scale distributed system).  We had to unlearn quite a few things (like it is better to kill a sick process fast than try to keep it up at all cost) before we could start climbing up the learning curve.  It is really quite an experience for us, one that I would not trade for anything else.

If I have to think about this experience as crawl, walk and then run, as Dave Campbell puts it "we are just about getting our knees off the ground so we can start to walk".  Yes we have been cautious and yes it is frustrating that the rich capabilities of SQL Server are not accessible to our customers, but I think we are going about it the best way we know how and I am confident we are doing it the right way.  Over time, as we learn more about the system we have built, as we roll out more hardware in the datacenters (and find new problems), as we learn what it takes to run a 24x7x365 service (nobody we know of is running a data service using a commercial database system at this scale and cost point) like SSDS, I can assure you we will start to expose capabilities of the underlying engine.  How quickly and how much will depend on our ability to provide you, our customers with the quality of service you need to trust your business to SSDS.

Thank you for your patience. 

Published Friday, June 27, 2008 7:21 AM by Soumitra Sengupta

Comments

 

jamiet said:

Great blog entry Soumitra. I really enjoy reading insights like this and I hope to read more (much more) in the future.

-Jamie

June 27, 2008 5:57 AM
Anonymous comments are disabled

© 2008 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Microsoft
Page view tracker