Everything you want to know about Visual Studio ALM and Farming
Brian Harry is a Microsoft Technical Fellow working as the Product Unit Manager for Team Foundation Server. Learn more about Brian.
More videos »
We had a pretty serious outage last Thursday all told it was a little over 5 hours. The symptoms were that performance was so bad that the service was basically unavailable for most people (though there was some intermittent access as various mitigation steps were taken). It started around 14:00 UTC and ended a little before 19:30 UTC. This duration and severity makes this one of the worst incidents we’ve ever had on VS Online.
We feel terrible about it and continue to be committed to doing everything we can to prevent outages. I’m sorry for the problems it caused. The team worked tirelessly from Thursday through Sunday both to address the immediate health issues and to fix underlying bugs that might cause recurrences.
As you might imagine, for the past week, we’ve been hard at work trying to understand what happened and what changes we have to make to prevent such things in the future. It is often very difficult to find proof of the exact trigger for outages but you can learn a ton by studying them closely.
On an outage like this, there’s a set of questions I always ask, and they include:
What happened was that one of the core SPS (Shared Platform Services) databases became overwhelmed with database updates and started queuing up so badly that it effectively blocked callers. Since SPS is part of the authentication and licensing process, we can’t just completely ignore it – though I would suggest that if it became very sluggish, it wouldn’t be the end of the world if we bypassed some licensing checks to keep the service responsive.
Though we’ve worked hard on this question, we don’t have any definitive answer (we’re still pursuing it though). We know that before the incident, some configuration changes were made that caused a significant increase in traffic between our “TFS” service and our “SPS” (Shared Platform Service). That traffic involved additional license validation checks that had been improperly disabled. We also know that, at about the same time, we saw a spike in latencies and failed deliveries of Service Bus messages. We believe that one or both of these were key triggers but we are missing some logging on SPS database access to be able to be 100% certain. Hopefully, in the next few days, we’ll know more conclusively.
This is different than the trigger in the sense that the trigger is often a condition that caused some cascading effect. The root cause is more about understanding why the effect cascaded and why it took the system down. It turns out that, I believe, the root cause was that we had accumulated a series of bugs that were causing extra SPS database work to be done and that the system was inherently unstable – from a performance perspective. It just took some poke at the system – in the form of extra identity or licensing churn to cause a ripple effect on these bugs. Most, but not all, of them were introduced in the last few sprints. Here’s a list of the “core” causal bugs that we’ve found and fixed so far:
We’ve also found/fixed a few “aggravating” bugs that made the situation worse but wouldn’t have been bad enough to cause serious issues on their own:
All of these, in some form, have to do with updates to identities in the system that then often cause propagating change notifications (which in some cases were over propagated) that caused extra processing/updates/cache invalidations. It was “unstable” because anything that caused an unexpected increased load in these identity updates would spiral out of control due to multiplicative effects and cycles.
I always want to look beyond the immediate and understand the underlying pattern. This is sometimes called “The 5 whys”. This is, in fact, the most important question in the list. Why did this happen and what can we do differently? Not what bugs did we hit. Why were those bugs there? What should we have done to ensure those bugs were caught in the design/development process before anything went into production?
Let me start with a story. Way back in 2008, when we were beginning to rollout TFS across very large teams at Microsoft, we had a catastrophe. We significantly underestimated the load that many thousands of people and very large scale build labs would put on TFS. We lived in hell for close to 9 months with significant performance issues, painful daily slowdowns and lots of people sending me hate mail.
My biggest learning from that was, when it comes to performance, you can’t trust abstractions. In that case, we were treating SQL Server as a relational database. What I learned is that it’s really not. It’s a software abstraction layer over disk I/O. If you don’t know what’s happening at the disk I/O layer, you don’t know anything. Your ignorance may be bliss – but when you get hit with 10x or 100x scale/performance requirement and you fall over dead. We became very deep in SQL disk layout, head seeks, data density, query plans, etc. We optimized the flows from the top to the very bottom and made sure we knew where all the CPU went, all the I/Os went, etc. When we were done, TFS scaled to crazy large teams and code bases.
We then put in place regression tests that would measure changes, not just in time but also in terms of SQL round trips, etc.
So back to last Thursday… We’ve gotten sloppy. Sloppy is probably too harsh. As with any team, we are pulled in the tension between eating our Wheaties and adding capabilities that customers are asking for. In the drive toward rapid cadence, value every sprint, etc., we’ve allowed some of the engineering rigor that we had put in place back then to atrophy – or more precisely, not carried it forward to new code that we’ve been writing. This, I believe, is the root cause – Developers can’t fully understand the cost/impact of a change they make because we don’t have sufficient visibility across the layers of software/abstraction, and we don’t have automated regression tests to flag when code changes trigger measurable increases in overall resource cost of operations. You must, of course, be able to do this in synthetic test environments – like unit tests, but also in production environments because you’ll never catch everything in your tests.
So, we’ve got some bugs to fix and some more debt to pay off in terms of tuning the interaction between TFS and SPS but, most importantly, we need to put in place some infrastructure to better measure and flag changes in end to end cost – both in test and in production.
The irony here (not funny irony but sad irony), is that there has been some renewed attention on this in the team recently. A few weeks ago, we had a “hack-a-thon” for small groups of people on the team to experiment with new ideas. One of the teams built a prototype of a solution for capturing important performance tracing information across the end-to-end thread of a request. I’ll try to do a blog post in the next couple of weeks to show some of these ideas. And just the week before this incident Buck (our Dev director) and I were having a conversation about needing to invest more in this very scenario. Unfortunately we had a major incident before we could address the gap.
OK, so we learned a lot, but what are we actually going to do about it. Clearly step 1 is mitigate the emergency and get the system back to sustainable health quickly. I think we are there now. But we haven’t addressed the underlying whys yet. So, some plans we are making now include:
As always, hopefully you get some value from the details behind the mistakes we make. Thank you for bearing with us.