Welcome to MSDN Blogs Sign in | Join | Help

We experience our first major unscheduled downtime

We are deeply sorry that our service experienced its first major unscheduled downtime.  While we work hard to avoid any downtime, we experienced one yesterday.  I have known all along that I will be writing about this someday.  You can hope that you have to never write about an outage but as James Hamilton told me the other day, as a service "if you never have an unscheduled downtime, you are either lying or you do not have any users".  So here I am writing the post that I was hoping I do not have to write.  We had our first major unscheduled downtime yesterday as reported by Roger Jennings here. Roger as a user received notification of the unavailability as soon as we discovered it.  The definitive root cause is still under investigation, but this downtime enabled us to debug a few important things that a service has to deal with to be reliable and trustworthy.

Our internal and external customer/user communication process worked and we also figured out a few things we can improve upon.  Our instrumentation worked as the automated email alert was the first clue I got about the downtime.  This allowed us to communicate with our customers quickly.  Thanks for all the feedback you provided.  Our restart and stabilization process worked once the all clear was sounded.  Again we found out a few things we can improve upon.  Most important of all, the response system that our operations team set in place worked.  We found out more about our systems plus the overall datacenter environment in which we run than we knew before yesterday.  While any unscheduled downtime is not good news and we work hard to avoid them, we are hopefully going to be better for this one.

The team is hard at work figuring out definitively what the root cause was and we will report back our findings and provide more details soon.  Trust can only be built through transparency and consistently delivering on your promise. 

Sorry again for any inconvenience we may have caused.

 

Published Thursday, August 28, 2008 2:02 AM by Soumitra Sengupta

Comments

# re: We experience our first major unscheduled downtime

Soumitra,

This is a GOOD thing - I'm glad it happened. The service will go down - better that you learn how to deal with it when you're in beta.

-Jamie

Saturday, August 30, 2008 4:23 AM by jamiet

# Microsoft’s cloud computing offering with SDDS « Fluent.Interface

# re: We experience our first major unscheduled downtime

Hi Jamie:

I think I know what you meant.  While I appreciate your support, I do not think it is ever a good thing for a service to go down.  We all know that it will happen but in a world where more and more people are depending on internet based services, an outage in anyone of them can have serious implications.  We spend a lot of time making sure that a service does not go down.  We spend even more time figuring out how to recover from such failures and making sure there is no data loss in case of such a failure.  Another property we try to cultivate is graceful degradation as opposed to catastrophic failures.  Unfortunately we did not degrade gracefully, we went out quite quickly.  Not good.

Soumitra

Wednesday, September 03, 2008 12:59 PM by Soumitra Sengupta
Anonymous comments are disabled
 
Page view tracker