Over the past few weeks we've seen, in two completely separate environments, all host instances go to throttling state 6, which is defined as the host message queue size, the spool table size or the tracking table size exceeding the specified threshold.  In both cases, when I checked the message box counters in perfmon, the tracking table size exceeded 500,000 items (the default threshold as described here).  What was interesting though is the fix was completely different for the two environments.

Environment "A"

This environment is a production environment and had been happily chugging along for some time when our ops folks noticed that a number of work items was accumulating in some of the receive locations where BizTalk is picking up work items.  BizTalk was picking up some traffic, but definitely was not keeping up.  I looked at all running service instances and saw that we had exactly 29 processes in flight, so I knew that we were likely throttling.  Looking at the message box counters confirmed that we were in fact throttling and the Message Box General Counters pretty explicitly showed why, as the Tracking Data Size counter was at ~550,000 items and climbing.  Looking in the BizTalkMsgBoxDb database instance in SQL Management Studio, we saw that two of the Tracking_1_n tables accounted for 99.9% of our tracking data surplus, and based on the time stamps of the items in the database, apparently we had been accumulating traffic for about a week unnoticed until we hit the magic 500,000 item threshold.  That we noticed.

The resolution

The fact that we'd been accumulating tracking events for a week suggested that one of our tracking jobs or processes was not working.  We verified that all of the pertinent SQL jobs were running (DTA Purge and Archive, MessageBox_*_Cleanup, etc.) so we started looking at our tracking host.  At some point in the past, a (good) decision was made to create a dedicated track host, and as this host has its own service account, it's easy to see its activity in SQL by opening Activity Monitor and filtering on the service account.  When we shut down the track host, we noticed some processes did not cleanup, including one with a Wait Time which either reflected the national debt, or was roughly the number of milliseconds in a week, correlating to our oldest tracking instance in the database.  Killing this process, then restarting the tracking host instances caused the Tracking Data Size counter began to drop, and once we crossed under 500,000 items, the system started processing normally.

Environment "B"

This environment is NOT a production environment, but is rather a virtualized test environment in which we occasionally push production-like volume, and is so forlorn and neglected from an ops management and monitoring perspective that we were considering making a "Please Help" commercial with a Sarah McLachlan soundtrack.  This is the environment where the "Something Monster" visits regularly and time vampires lurk in the shadows between the server racks.

In this case the Something Monster visited in the form of the underlying SAN storage in our environment taking an unscheduled vacation and bringing the environment back up afterwards took a lot of TLC as a lot of services either didn't restart cleanly or were crash on audit fail due to the logs filling up.  However, once everything was "up" it became obvious that we still weren't keeping up with the aforementioned production-like volume.  A quick check of perfmon verified that we were in fact throttling on state 6 due to a high Tracking Data Size counter.  However, in this case the tracking host did not have a hung SQL process.  Given that we'd just recovered from a major outage, and the SQL Server Agents had been down (and who knows when the tracking SQL jobs had been run) I made sure everything was kicked off, and for good measure ran the DTA purge job with an aggressive hard termination of 1 day.  Eventually I saw the Tracking Data Size clear out to zero, but when I restarted host instances to start processing again, I noticed the Tracking Data Size counter ratcheting up again.  Obviously I had alleviated some symptoms, but had not yet found the cure.

The resolution

After several troubleshooting steps, I found that again, there were two TrackingData_1_x tables holding most of the tracking data.  A colleague suggested comparing the lastSeqNum value in the TDDS_StreamStatus table in the DTA database with the SeqNum value in the corresponding TrackingData_1_x table.  I also found this excellent write-up by Yossi Dahan which described a similar problem, and while the issue that got us to this state was different (our tracking DB had not grown overly large) the lastSeqNum values and the SeqNum values were off by several thousand.  By updating the lastSeqNum value to be same as the max(SeqNum) value in the corresponding TrackingData_1_x table and restarting the track instances, the Tracking Data Size came back down to a normal near-zero level.

Follow-up

I have a couple of follow-ups I want to do:  First is making sure that SCOM will surface a Tracking Data Size (and Spool Size and whatever other throttling thresholds we want to monitor) alert when we get to ~50% of the limit.  Second, and relating to our issue in Environment "B" is to research if the hotfix specified in Yossi's blog, which is specific to 2006R2, should have been ported to 2009 as well.  My instinct says yes, as the date of the hotfix is from June 2009, and the date on the RTM file is March 2009.