Everything you want to know about Visual Studio ALM and Farming
Brian Harry is a Microsoft Technical Fellow working as the Product Unit Manager for Team Foundation Server. Learn more about Brian.
More videos »
I’ve been meaning to write about this for a while but somehow the days just slip by and I never find the time.
If you are a reader of my blog then you’ve been seeing my posts on our service updates for months now. But let me rewind a bit and start at the beginning.
About 2 years ago we began the journey to bring TFS to the cloud. In the beginning it was just an experiment – Can we port TFS to Azure and have a viable cloud hosted solution? It took a summer to prove that we could do it and the fall/winter to shore it up and make it production ready. So a little over a year ago, we decided we were serious about this and started asking what a product plan for it looked like.
Obviously our background is as an on-premises mission critical server team – we’ve been doing that for 10 years. Further, we’re part of the Microsoft “machine” and that has it’s own set of ingrained practices. We shipped on 2-3 year cycles. I believe we had/have a very good 2-3 year cycle with strong customer engagement, good planning, agile release mechanisms (like Power Tools), etc but still – it’s a 2-3 year cycle. We knew going into the cloud space, that wasn’t going to work.
In the cloud, people expect freshness. They expect progress. If your site/app hasn’t progressed recently, people start to assume it’s dead. We needed to rethink a lot about what we do. That thinking started with trying to figure out what we wanted. Like people often do, we started with what we knew and tried to evolve from there. We spent a few months thinking through “What if we do major releases every year and minor releases every 6 months?”, “Major releases every 6 months, patches once a month?”, “What if we do quarterly releases – can we get the release cycle going that fast?”, etc. We spent time debating what constitutes a major release vs a minor release. How much churn are customers willing to tolerate? We went round and round. Ultimately we concluded we were just approaching the question wrong. When a change this big is necessary – forget were you are and just ask where you want to be and then ask what it would take to get there.
So late last summer, we were shipping service updates about every 4-6 months and we made the call that we were going to go to weekly updates. The goal was to ship new features (not just bug fixes) every week. To some degree it was a statement. Let’s figure out how fast we can go. Some asked why not every day? – certainly some services out there do that. Ultimately I felt that it just wasn’t necessary for our product/customer base/size of team. Maybe someday that will make sense but I didn’t feel it did for where we are today. To avoid having the capabilities delivered in this weekly cadence be random, we decided on a 6 month planning horizon. We’d plan in roughly 6 month chunks and then deliver in weekly increments.
We started working through this last fall (I’ll write more about this effort later) and gradually turned up the frequency from 4-6 month updates. The team was already executing with a Scrum based process, using aligned 3 week sprints. As our release cycles got shorter and shorter, we realized that those 3 week sprints actually formed a natural cadence for the team. We plan the sprint, we build it, we deliver a “potentially shippable increment” – except, in this case, it wasn’t “potentially”; now it really was “shippable”. Because of this natural alignment, we decided to stop the cycle tuning process at 3 weeks and ship feature updates to the service every 3 weeks rather than every week. We knew we still needed ways to update the service to resolve high priority issues more frequently than that, so we instituted a “Patch Monday” plan that says any given Monday we can, if needed, roll out important but not urgent service fixes. We also can roll out urgent fixes any given day, and sometimes do – literally within a few hours of discovering the issue.
The last piece that came in to place was realizing that 6 month planning wasn’t even really enough to make sure the team really had a clear view of where we were going so we added, a 12-18 month “vision” to make sure we were all rowing in the same direction.
So our cadence is:
12-18 month vision – This is pretty high level and describes the kinds of scenarios we want to enable. It often includes some storyboard that demonstrates the value but is not intended to be either design or a feature list – it is just illustrative of the kind of experience we want to create.
6 month planning – In this window we get more crisp about what features we are building. Here we work out high level cross team commitments – often our work requires coordination across multiple teams. It’s still not design but rather agreement about what scenario we’re delivering and what features support that scenario.
3 week sprints – We generally keep a sprint schedule looking out 2-3 sprints for each feature team. It has a lot more detail for “this sprint” than the next couple but it gives us some clarity on what is coming and when, allows the next level of dependency planning and allows us to understand what kind of progress we are making on our scenarios and balance work where we need to. At the end of every sprint – we deploy to production. Some of what we deploy may be “hidden” behind a feature flag. That enables us to deploy in progress work and, where appropriate, expose it to limited sets of users to test/give feedback.
Patch Monday – Every Monday we are capable of deploying important but non-urgent service fixes. All teams know that if they have something they need to get in, there’s a window every Monday. If there’s nothing needed, we don’t deploy and most of the time we don’t need to.
Daily hotfixes – On any given day, we can patch the service with a hotfix if we have any urgent service issue. In practice, this seems to wind up happening about once every 6 weeks (once every other sprint). It’s usually the result of some regression that got deployed with a sprint payload but sometimes it’s something else like a new load induced problem, etc.
I expect as we continue to learn and mature this may evolve further. Maybe someday we’ll break the aligned sprint model and then go to weekly deployments. But for now, this is working well for us and seems to be working well for the consumers of TFSPreview, so we’ll keep doing it.
Once we had started to settle on a pretty rapid cadence for service updates, we realized we were going to have another problem. Some of our new service features are going to require updates to the clients to really expose them in a way that works great for developers. This means that having a 3 week cadence for the service and a 2 year cadence for the client (or even 1 year if you count a service pack in the middle) really isn’t going to work. So last fall, as the service cadence firmed up, we started looking at what to do about the client.
It didn’t take much thinking for us to realize that significant changes to the Visual Studio client every 3 weeks was probably not going to fly. Most customers don’t want to update their clients that often. The quality assurance model for an on-premises release has to be more rigorous than a cloud service because fixing an issue once it’s deployed is much harder. Etc. So we started looking at a model that had a few constraints:
We ultimately settled on quarterly updates as a reasonable trade-off between the costs of frequent updates and the lag with the service. However, it’s clear to accomplish this, we’ll need to be thoughtful about what changes we make in these quarterly updates. 3 months is not enough time to run a full validation pass for arbitrary sets of changes. As such we’ll have to focus mostly on changes “higher in the stack” to minimize the potential destabilizing impact. To use a ridiculous example, significant feature changes to the .NET Framework every 3 months and deploying that to the world would be a very bad idea given our current abilities.
We introduced a Visual Studio Extensions and Updates feature in VS 2010. As we looked at mechanisms for delivering updates to VS it looked like an appealing mechanism. We also considered Microsoft Update and other options but we felt the VS Update mechanism was the most appealing. Unfortunately, in 2010, it didn’t support the full power we needed to update the breadth of components we felt we might need to update. Fortunately, having started to think this through last fall, we were able to pull together a plan to extend the VS Update mechanism in VS 2012 to support the power that we need.
So, the plan is to, now that VS 2012 has shipped, move to a quarterly update cadence for our clients. This won’t, of course eliminate the need for us to do longer cadence “major updates” too. So expect major releases to continue but I’m very happy to be able to provide continuing value on a regular basis.
Once we had locked our plans for service updates every 3 weeks and client updates every quarter, the next obvious question was about our on premises TFS server. It’s clear that we have a large number of customers are going to continue to want to use an on premises solution – you might say that’s our bread and butter. We also have some very good hosting partners filling needs that our Team Foundation Service doesn’t address who would like to be able to provide the latest capabilities. How are these groups of people going to feel about waiting 2 years for features that the online service gets within 3 weeks of release? In fact it’s worse than that. A few months before we released TFS 2012 we had to start locking down the churn and as a result the service was getting new features that were not in TFS 2012 *before* TFS 2012 even shipped.
On the other hand, in as much as it is true that not everyone wants to update their client every 3 months, it’s even more true that not everyone wants to update their server every 3 months. Further, we don’t have a clean and simple a mechanism for updating the server as we do for the client. It’s also the case that the QA process for a mission critical server is even more involved and costly than for a client.
All this taken together, I’d rather not try to update the on-premises server every 3 months. However, as we’ve started to figure out how to put any cadence plan for the server into action, we are finding that it actually depends a great deal on what kinds of capabilities we are trying to deliver. So we’ve ultimately landed in a place where we’ve said, we aren’t going to make a firm commitment to a cadence for the on-premises server. Instead, we’ll “play it by ear”. At this point, it’s clear that we *will* need to update the on-premises server in our first quarterly update later this year. Once we’ve been through one of those cycles once, we’ll revisit the cadence question and see if we are in a better position to lock on a cadence or whether we continue to “play it by ear.”
It’s a long post and I’m sorry about that but I wanted to give you some flavor of the journey. The summary is:
Hope this helps