Everything you want to know about Visual Studio ALM and Farming
Brian Harry is a Microsoft Technical Fellow working as the Product Unit Manager for Team Foundation Server. Learn more about Brian.
More videos »
Yesterday, we completed our sprint 60 deployment. With it comes two new features: Exporting test plans as HTML documents and a permission to control who can create new work item tags. You can read more about the new features on the service release notes here: http://www.visualstudio.com/news/2014-feb-10-vso.
As you may know, since ~Oct, we’ve had a run of “bad” deployments that caused unacceptable down time. Due to this we made a number of changes. One was to move our deployments to weekends to reduce the number of people that any unexpected interruptions affect. This was the second deployment to happen over a weekend and thankfully went very smoothly. Sprint 59’s weekend deployment did not go well though and we had a good number of people working all day Saturday and Sunday to get the service back to health. While the weekend deployment had the desired effect – it reduced the impact on customers, it confirmed for us the heavy toll it takes on the team.
In fact, we moved sprint 60’s deployment to this weekend (it was supposed to be last weekend) because we didn’t want to risk having to ask the team to miss the Super Bowl (which, I might add, the Seattle Seahawks dominated). Thankfully this one went smoothly and the number of people who had to work the weekend was comparatively small. However, it’s clear weekend deployments aren’t really going to be sustainable – in the sense of keeping a talented group of people willing to work 7 days a week. Next sprint we are going to try doing the deployment on a weekday evening. I ‘m well aware that no time is a good time for an outage. In fact, I even got some customer emails when 59’s weekend deployment experienced issues, but there are less bad times than others. For context, here’s a graph of activity on our service, starting from Sun morning.
In the end, the most important this is getting to the point where every deployment is a “non-event” (meaning no one even notices it happens because it is so seamless) – both for our customers and our employees. Time shifting the deployment isn’t a fix, it’s a mitigation and while it’s a wise precaution to take, the effect on the team is profound enough that we have to be very careful how much we rely on it.
In that light, we have done a lot of technical work in the past few months to “get back on top” of deployments. It became clear that the scale of the service had outgrown the engineering processes and resiliency we had built in and it was time to do more. Among other things we…
I’d like to think that the smooth deployment this weekend was a reflection of the work we’ve been doing but I’d like to see 3 or 4 smooth ones in a row before I’m willing to give too much credit. We’ll continue to work hard on it. I don’t expect to be at a new “plateau” of service stability and excellent for another few months but we’re definitely on the ascent.
Thanks for your patience and understanding as we continue to work to produce the best developer services on the planet. As always, if you see things we can do better, please let us know.