We experience our first major unscheduled downtime
We are deeply sorry that our service experienced its first major unscheduled downtime. While we work hard to avoid any downtime, we experienced one yesterday. I have known all along that I will be writing about this someday. You can hope that you have to never write about an outage but as James Hamilton told me the other day, as a service "if you never have an unscheduled downtime, you are either lying or you do not have any users". So here I am writing the post that I was hoping I do not have to write. We had our first major unscheduled downtime yesterday as reported by Roger Jennings here. Roger as a user received notification of the unavailability as soon as we discovered it. The definitive root cause is still under investigation, but this downtime enabled us to debug a few important things that a service has to deal with to be reliable and trustworthy.
Our internal and external customer/user communication process worked and we also figured out a few things we can improve upon. Our instrumentation worked as the automated email alert was the first clue I got about the downtime. This allowed us to communicate with our customers quickly. Thanks for all the feedback you provided. Our restart and stabilization process worked once the all clear was sounded. Again we found out a few things we can improve upon. Most important of all, the response system that our operations team set in place worked. We found out more about our systems plus the overall datacenter environment in which we run than we knew before yesterday. While any unscheduled downtime is not good news and we work hard to avoid them, we are hopefully going to be better for this one.
The team is hard at work figuring out definitively what the root cause was and we will report back our findings and provide more details soon. Trust can only be built through transparency and consistently delivering on your promise.
Sorry again for any inconvenience we may have caused.