Hang Bucketing, A Better Way

In the previous post I gave a brief introduction of how the first version of hang reporting was implemented using the existing crash reporting infrastructure.

Eventually (after Windows XP shipped) a new general purpose event reporting and bucketing mechanism was built. In a nutshell, this mechanism provides a very flexible way to report incidents to the WER pipeline.  It supports a custom named event (internally at Microsoft dubbed “generic events”) and up to 10 custom bucketing parameters P1...Pn...P10.

During the development of Windows Vista, we decided to do something about the problematic hang bucketing and leverage the new general purpose reporting to gain a much better client-side classification of hang issues.  Before I explain what bucketing parameters these have, let me discuss a few other new things in Vista:

Another problem with Windows XP hang reporting was that the application process was often hung waiting on an external process.  The report would only include a memory dump of the hung application and developers were often forced to give up not able to debug the other process.  (Now, granted, you very often don't need to debug the other process to fix a hang, but that is a discussion for a later blog post.)

To solve this problem, the GetThreadWaitChain API was created.  This API allows the caller to discover the blocking graph (or “wait chain”) for a given thread.  A trivial example: Thread A is waiting to enter a critical section owned by Thread B which is making a blocking SendMessage call to Thread C which is waiting on a mutex owned by Thread D which is running in another process.  The output from the API provides the caller all of this information.

It's this mechanism that hang reporting uses in Vista to discover not only the wait chain information but to collect memory dumps of external processes.

Back to the bucketing parameters – the hang reporting infrastructure actually reports hangs to two different generic events: AppHangB1 and AppHangXProcB1 (yes, B1 stands for Beta 1.  Don’t ask…).  On Winqual, AppHangB1 is shown as Event Type "Hang" and AppHangXProcB1 is shown as "Hang XProc".  AppHangB1 has 5 parameters (by convention P1 & P2 are typically Application Name & Version):

P1 - Application Name
P2 - Application Version
P3 - Application Timestamp
P4 - Stack Hash
P5 - Type Code

AppHangXProcB1 has those 5 and adds 2 more:

P6 - Waiting On Application Name
P7 - Waiting On Application Version

Most of these parameters are self explanatory.  The two that might not be are P4 and P5:

Stack Hash (P4) - hang reporting traverses the wait chain and for the final thread in the chain before it jumps to a different process, we generate a hash based on the thread’s stack back trace. P4 is a restricted hash which means we chop the MD5 128-bit hash down to 16 bits.  This prevents any wild spraying of buckets and when we studied it we were still achieving ~85% uniqueness.

Type Code (P5) - this is a bit field based in part on the WCT_OBJECT_TYPEs found during wait chain traversal with GetThreadWaitChain (e.g., mutex, COM, etc.) but also on a few other items too - like if there's a deadlock or if the hang report came from End Task in Task Manager (more blog fodder).

As rich and wonderful as hang bucketing has become in Windows Vista, there are still edge cases (just as there are in crash bucketing) where a bucket does not uniquely identify a single bug.  In a future post I will discuss bucketing more generally and how we REALLY determine and quantify bugs using WER data at Microsoft.