This is a post I wrote about a week ago and somehow left it in my drafts folder.  Though it’s “old” now, I think there’s still some good stuff here…  The good news is the deployment we did today seems to have gone off without a hitch (knock on wood)!

We did a TFSPreview.com deployment on 4/26 and you may have noticed that we suffered a few hours of down time.  It’s the first upgrade we’ve had go really bad in a long time – the last several have suffered little or no down time.  I wanted to share a little bit about what happened and what we learned from it.

Some context

First, let  me set some context.  In my post the other day about changes, I mentioned that there were a lot of non-visible infrastructure changes.  There’s a bit of backstory there, that’s provides some context.  We’ve generally taken a fairly incremental approach to moving TFS into the cloud.  We’ve also been evolving our thinking what what the offering/business model would look like.  One of the changes in our thinking is that the service will be lower cost and have more free accounts than we had expected from the beginning.  No, I don’t have any details to share on that now we’re still working it out.  Suffice it to say that just merging Codeplex with the service (which I’ve alluded to in previous posts) shifts the model a good bit.  The result is that the straight port of TFS where every project collection is a separate database wasn’t going to work from a cost perspective.  We concluded a few months ago that we needed to move to a multi-tenancy model, for at least a portion of our accounts.

We implemented database multi-tenancy over the past couple of months and we finally rolled it out in this last update (though the dial on tenancy is still set to 1 tenant per database for now – we will begin to turn that dial in the coming weeks).  As you might imagine, this was a pretty impactful change.  It affected virtually every feature area to, at least, some degree.  And more than that, it was a pretty big schema change to the database – logically, we had to add a partitioning column to every table in every database.

Add to this that we also had some sizeable changes in the communication protocol in the build farm and this was a big release. In fact big enough that we missed our deployment windows 3 weeks before (the first scheduled deployment we have ever missed), but that’s a story for another day.  Suffice it to say that this was the biggest and most impactful deployment we’ve done in the last 6 months.

And things didn’t go well

Because it of the nature of the update, we knew we were going to have to schedule some downtime – most of our updates are fully online and only individual tenants are down at any given time, and generally only for a couple of minutes – the overall service is not down.  But this update was different because we also had to make significant changes to what we call the “config DB” – it’s the one central database we have that stores global service configuration and data.  So we decided to start this update at 4:00am to minimize the impact (of course, on a global service, there’s really no convenient time).

  • At 6:28am PDT the upgrade failed.  We quickly got devs investigating.
  • By 7:06am we had a fix
  • At 7:20am the upgrade was resumed.
  • At this point the upgrade was running slower than expected and we were investigating.
  • At 9:07am we patched a query plan to improve performance
  • At 9:23am the offline portion of the upgrade was complete.

All told, it was about 3 hours and it should have been less than 1 hour.  Clearly not what we want and there’s some good stuff to learn from here.  However, I want to say that I’m very proud of how well the team handled the problem.  The right people were on top of it, quick to respond and quick to move the issue to the next stage.  Their diligence is what kept it to only a 2 hour delay.

It turns out that the root underlying problem had to do with the way the upgrade scripts were authored.  We did not pay enough attention to how the upgrade steps were arranged into transactions and the primary failure was caused by one of the upgrade scripts on the config DB doing too much work inside a single transaction.  It exceeded the capacity of the SQL log and caused the transaction to roll back.  The remedy in this case was to clear out the offending table – it turns out the data in the table that pushed it over the tipping point was transient data that was only relevant while the service was active.  Since the service was down for upgrade none of the data was necessary and could just be deleted.  To be safe, we also reviewed and made other defensive changes to the upgrade scripts.

Things like log space and temp db space are bigger considerations on a shared database service like SQL Azure than than might be on an enterprise SQL Server.  The shared service has to limit carefully how much resources a single tenant can use to prevent them come compromising other tenants.

So what did we learn?

Every one of these is a learning opportunity.  Every time we do something new (in this case a large schema change with a fairly large production data set) we learn new things.  Some thoughts…

Be thoughtful about transaction size.  As we go forward we’re going to make sure we include careful consideration about transaction size in any of our database upgrade scripts.  Our current default authoring mechanism groups all upgrade steps into a single transaction.  In the future, we’ll require developers be explicit about how upgrade steps are grouped.

New insights on upgrade testing.  This one is more complicated that you might imagine.  There’s a saying “There’s no place like production.”  It means that no matter how hard you try, you’ll never be able to fully simulate your production environment – things will still go wrong.  There’s diminishing returns on your effort to simulate production with increasing fidelity.  However, that doesn’t mean you shouldn’t do anything.  We don’t try to fully simulate production.  It’s 10’s of thousands of databases and a ton of data.  Just creating the simulation might take longer than our whole development cycle.  However, given that config DB is such a core dependency for our system, there’s no reason we couldn’t have tested the upgrade scripts against that database.  In the future we will.  We did, of course, test it against other config DBs but they were smaller.

Listen to your gut.  One of the ironic things here is that the day before the upgrade, the team considered truncating this table as part of the upgrade.  It was discussed and ultimately discarded because the upgrade process hadn’t been tested that way and they were being risk averse.  Often, in these kinds of situations, your instinct is good.  There’s no question you have to be careful about last minute changes but all too often I’ve seen production teams there error on the side of being too change averse resulting in lower overall application health than necessary.  When your gut says you may have an issue don’t dismiss it just because “it’s not in the plan”.

Stuff will go wrong.  You can’t stop it.  Sometimes your best offense is a strong defense.  Be ready to jump on things quickly.  Have the right people ready.

Communication.  I’m a very strong advocate for communicating clearly and openly with your customers when you have issues.  It’s something I’ve practiced religiously with our internal service.  This incident has made me realize that we don’t have great mechanisms set up for this for our public service.  We have a service blog that shares status and we update Twitter.  But it’s not good enough.  Issues include:

  1. We need a better way to give notice about upcoming significant servicing events.  It can’t rely on you trolling Twitter or watching our blog.  It needs to be evident in a tool you are likely to be using.
  2. Our blog format is not great.  The information isn’t structured well enough to help you understand what you really need to know about and it’s too focused on current status without enough about why it’s happening or what to expect.
  3. Our error messages are not helpful in this scenario.  “TF30040: The database is not correctly configured.  Contact your Team Foundation Server administrator.” just isn’t going to cut it.

 

Conclusion

It’ll be a while before we make another change of this magnitude. so we’ve got a little time to work out some of these issues before we’re like to have another test like this.  Thank you for bearing with us as we work through the learning process.  We know an increasing number of people are relying on our service to get their work done and outages are simply not acceptable.  Our goal is clearly no outages – ever, for any reason.  6 months ago all upgrades were “offline”.  Now most are online.  Our goal is for all of them to be online.

Always interested in your feedback and appreciative of your patience,

Brian