Welcome to MSDN Blogs Sign in | Join | Help

TFS and Reliability and Disaster Recovery

We continue to evolve and improve upon the TFS reliability and disaster recovery story.  Fundamentally reliability and disaster recovery are about preserving service (or minimizing outage) and eliminating data loss in the face of failure in components of your system.  When we look at TFS component failure, we primarily focus on the application tier (or web tier), the data tier and the disk subsystem.

The solutions to addressing each of these failure points can be different.  You can buy redundant hardware – machines with redundant power supplies, network connections, etc;  RAID disk systems with redundant drives, controllers, etc.  You can build amazingly fault tolerant hardware but the more you want, the more expensive it gets and none of it helps if you have an earthquake and your data center collapses.  For both of these reasons people look to alternative ways of dealing with reliability and disaster recovery.

Looking at the TFS components, let’s examine the reliability and disaster recovery story for each beyond highly reliable hardware.

Application Tier

In V1 of TFS, we tested and supported what we called Application Tier “warm standby”.  This means you can configure a second (or third, etc) application tier and have it ready to take over in the event of a failure in the “active” application tier machine.  In the event of a failure, the secondary application tier needs to be activated – it doesn’t automatically take over.  This process is described here.  Because it requires manual intervention, it generally requires someone to notice what has happened and then an administrator to run the redirection process – and may include updating DNS information to point the old server name to the new server so all of the clients don’t have to be updated.

HP recognized an opportunity here and developed a solution using their HP Systems Insight Manager described here.  This enables the application tier fail over to happen automatically when Insight Manager discovers that the active application tier is no longer functioning properly.

In future versions of TFS, we plan to support multiple (load balanced) active application tiers for the same Team Foundation Server so that, in the event that any one AT fails, a standard load balancer can remove it from the rotation and the system can continue to operate with the remaining functioning application tier machines.

Data Tier

The only TFS solution for Data Tier availability (without data duplication) is SQL Clustering.  Clustering is a hardware configuration where multiple SQL database machines share the same disk subsystem (usually a SAN).  Clustering provides automatic and transparent failover in the event of a failure of the primary SQL Server machine but does not address a failure in the shared disk subsystem.  Although it is a very robust solution, the downside is that it is a fairly expensive solution and requires careful selection and matching of hardware components.  You can read about how to configure TFS data tiers for clustering here.

Data Tier + Disk subsystem

Before shipping TFS V1, we did not test or document any solutions to reliability and disaster recovery of the disk subsystem beyond backup and restore (which for any large system can be a time consuming task).  Since then we have tested both mirroring and log shipping.  You can read more about them here, here and here.  In both mirroring and log shipping you can configure a secondary, redundant system (data tier machine and disks) that can either be co-located or geographically distributed.  These can help protect against a total catastrophe (like a fire or earthquake), allowing you to get your system back up and running on new hardware (servers and disks) in a short period of time.  The primary difference is that log shipping is a scheduled, periodic update of the secondary database whereas mirroring is either synchronous or asynchronous with relatively short time lags.  Mirroring also has additional features like a witness server that can automatically fail over to the secondary in the event the primary becomes unavailable.  Unfortunately witness servers only work for single databases and TFS uses 7 different databases so fail over for TFS, even with mirroring, is a manual process.

Disk subsystem

In addition to the mirroring and log shipping described above, we support the variety of hardware level disk solutions.  For example, you can use either RAID 5 or RAID10 disk configurations, multiple controller cards, host bus adaptors, etc. to make your disk subsystem fault tolerant.  As a general rule, for any high traffic system, we recommend RAID10 in favor of RAID5 because RAID10 has substantially better write throughput than RAID5 and TFS is a “write-heavy” application.

And, of course, as a last resort, we support a strong backup and restore story that includes online and incremental backup.

 

We are in the process of evolving our TFS admin and operations documentation on MSDN.  Over time all of this information will migrate there and should be much easier to find.  In the meantime, I hope this provides some context and pointers to resources you can use as you learn about reliability and disaster recovery for TFS.

Brian

Published Friday, November 24, 2006 8:06 AM by bharry

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: TFS and Reliability and Disaster Recovery

Sorry, I'll fix up the links.  I originally wrote it as an email message and didn't recheck the links.

Friday, November 24, 2006 9:35 AM by bharry

# re: TFS and Reliability and Disaster Recovery

Are you also aware how the Google Code subversion repositories store data, redundantly and reliably across their massive infrastructure/BigTable DB?

It's an alternative strategy that's not mentioned in your post.

Friday, November 24, 2006 11:49 AM by RichB

# re: TFS and Reliability and Disaster Recovery

Have you considered adding the ability to install the databases onto separate SQL Server instances/servers. This would provide some level of scale out.

Also as the version control database appears to be the most heavily used of the databases, have you considered using SQL Server 2005 table partitioning for very large installations. It would seem to lend itself to this kind of partitioning. This would allow the database tables to be spread over more spindles and potentially ease the archiving of historical information. Another more complex option would be data dependent routing whereby you spilt the version control database over multiple physical servers and route to the appropriate one based upon some criteria such as project or date.

Sunday, November 26, 2006 3:56 PM by Phil

# re: TFS and Reliability and Disaster Recovery

Yes, we have considered allowing the databases to be installed on separate servers.  In fact, we recently published how you can most Analysis Services and its database to another machine.  I think in the future we'll be making this even more flexible.

We have considered using table partitioning but haven't tried it yet.  It should actually be pretty transparent so I don't expect any problems.  We're looking at setting it up on our dogfood server soon because some of the tables have gotten to be hundreds of millions of rows and table maintenance operations are unreasonably long.

We've though about database partitioning across multiple servers but have stayed away from it for now.  Part of the problem is that most of the data in TFS is pretty interrelated and we'd pretty quickly get into distributed transactions and that's something I'd like to avoid.

Thanks for the thoughts,

Brian

Monday, November 27, 2006 6:53 AM by bharry

# VSTS Links - 11/29/2006

Ars Technica on Keeping up with the Visual Studio Joneses. Brian Harry on TFS and Reliability and...

Wednesday, November 29, 2006 9:47 AM by Team System News

# TFS Reliability and Disaster Recovery

I had written in my previous post about having a high level topic that brings together all the information

Friday, December 01, 2006 9:42 AM by Sudhir Hasbe

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker