Those of you signed up on the Winqual site are likely familiar with the "Event ID" and what it (ideally) represents – that is, a basic demarcation of unique software defects. Additionally you’re probably more familiar with crash-related Event IDs like “Crash 32-bit” and “Crash 64-bit”. But what about the other event types? Specifically “Hang” and “Hang XProc”? And by the way – what’s a crash in module “hungapp”?
In this and later posts I’m going to explain a little bit about error reporting and a little bit more about hangs – what they are, how they’re detected, reported, organized, counted, etc. Before getting into all that, however, we need a (very) brief history lesson.
Some of you might remember something like the following scene from the days of Windows NT 4 (and maybe you’ve seen similar scenes in more recent times):
This is a screenshot of CALC.EXE moved across an intentionally hung instance of NOTEPAD.EXE.
Without getting into too much nitty gritty, a thread that creates GUI elements, namely windows, has an implicit contract with the desktop window manager that it will service messages that arrive in its message queue... in a timely fashion. “Servicing” means retrieving and dispatching the messages (aka pumping messages). There is plenty of content on MSDN that covers the specific APIs and mechanisms used for this and plenty of content elsewhere about the interesting peculiarities of this area. What "timely fashion" means is somewhat a matter of debate among developers and users. To the window manager, it means 5 seconds by default. Our user research shows that this is actually quite generous.
If a thread stops pumping messages from its message queue, bad things start to happen. In such cases, the thread is often off busy working on whatever task the user just “asked” it to do (e.g. opening a file or recalculating some total or talking to a web service over the internet) and there’s a good chance that it didn’t update its UI to indicate that it’s off doing this work; it almost certainly hasn’t shown any UI resulting from that work. But there is a worse and more fundamental problem – while it’s away, the messages it isn’t pumping could be mouse, keyboard, or touch input along with paint messages. Not pumping those messages would result in a window stuck on the user’s screen – the user wouldn’t be able to move, minimize, or close it and if they moved something in front of it, they’d get something like the jaggy mess in the screen shot above.
I say would because beginning with Windows 2000, we added functionality to the window manager to handle these cases. The concept is actually quite simple – the window manager watches for when a thread has pending input in its queue but hasn’t serviced it for more than 5 seconds. When the situation is detected (aka IsHungAppWindow), the window manager does a presto chango maneuver whereby it hides the window whos thread isn’t pumping input messages and seamlessly shows a replacement window (known as a “ghost” window) with its client area filled with the real window’s last known good client area bits.
While the original thread isn’t responding, the window manager manages both windows in parallel, to the extent that the application doesn’t know this is happening. For example, IsWindowVisible() will actually return true for the hidden/unresponsive window. The only visual difference (initially) to the user is that the text “(Not Responding)” is appended to the ghost window’s title. Using the ghost window, the user can effectively move, minimize, and even close the unresponsive application window. When (if) the thread starts pumping messages again, the original window is re-shown and the ghost window goes away. By the way, this window manager feature is called (obviously) window ghosting - and naturally, we offer an API you should almost never use: DisableProcessWindowsGhosting.
While we were working on Windows XP, we realized that it might be nice to know about these unresponsive applications. We had already started to collect data about crashes, which were clearly annoying to customers, and some schools of thought (and some research) indicated that unresponsive UI was even more annoying -some might even use words like “infuriating” (I know I do). So, we decided to wire up a ghost window’s close button to the infrastructure that sent crash reports back to Microsoft… and voila!, hang reporting was born.
[Part 1] Read part 2 >