The biggest area of growth in the last month is in work item tracking activity.  The reason is that we hooked up a mirror between our "old" work item tracking system and TFS.  This means that changes in each system is being replicated into the other.  This has added some additional load to the server (although the increment is fairly minor in terms of available server capacity).

In addition to the additional WIT load, we continue to add source branches for more teams as we bring them on - that accounts for most of the file growth (about 6,000,000 files).

The big learning (for me) in the past month or two has been around SAN configuration and performance.  A month or so ago we started getting event log errors on our SQL Server during peak load times saying that some I/O operations were taking more than 15 seconds to complete.  This has been coincident with us seeing some performance degradation for end users during these peak times.

We initially suspected the SAN and ultimately concluded that the SAN is significantly underpowered and that there are some software improvements we can make.

What I learned about our SAN is that it is RAID5 with 16 spindles.  There are 2 problems here:

  • Our application is fairly write intensive and RAID5 is a very poor choice for applications that do a lot of writing.  Some would say it is a bad choice for all database applications.  In investigating this I was directed (by someone on the SQLServer team) to an interesting site: http://www.baarf.com.  We are currently in the process of planning a migration to a RAID10 SAN configuration.
  • 16 spindles is not enough for the number of people we have on this server.  A single spindle can handle 100-150 I/Os per second.  This means our 16 spindles can handle 1600 - 2400 I/Os per second.  We are seeing I/O peaking at about 3900 I/O requests per second.  This means we need about twice as many spindles.  Although switching from RAID5 to RAID10 will significantly reduce the number of I/Os per second (because a write is only 2 I/Os for RAID10 but 4 I/Os for RAID5).

We also learned some stuff looking at the source of the I/Os.  We tracked back I/Os to the top sprocs generating them and some were obvious - prc_Get and prc_Checkin.  However, some were surprising (prc_iFindPendingChange - was an example, I think).  After investigation we found 3 indexes that could be tuned that would reduce the numbers of I/Os we generate for the same operations.  We're continuing to investigate and I expect we'll find further improvements that will reduce the I/O we do per operation.  As always, we'll roll these changes into our next service pack so everyone can benefit from them.

And here are the latest statistics...

Users
Recent users: 680
Users with assigned work items: 1,662
Version control users: 1,312

Work items
Work items: 105,2770
Areas & Iterations: 6,059
Work item versions: 770,438
Attached files: 28,582
Queries: 9,954

Version control
Files/Folders: 19,580,651/2,570,303
LocalVersion: 110.9M
Total compressed file sizes: 193.3G
Workspaces: 2,702
Shelvesets: 4,098
Checkins: 81,292
Pending changes: 397,954

Requests (last 7 days)
Work Item queries: 250,592
Work Item updates: 36,793
Work Item opens: 157,748
Gets: 14,033
Downloads: 4.3M
Checkins: 3,719
Uploads: 10,852
Shelves: 450

As always comments and questions are welcome (...even encouraged)  You can even tell me if I'm boring you to death :)

Brian