Disclaimer: I hesitated posting this because this is a topic that is extremely complicated and deep, and like I point out in the article, lots of smart people have talked about some aspect of this problem, but I thought that I would share our specific problem and what some of the results are if you mess up in a real application (the VB compiler).

Last week, a couple of us on the team spent a fair bit of time on a bug that ended up being difficult to diagnose, but relatively simple to fix. As the title suggests, it has to do with COM, re-entrant calls to the STA, and message pumping.

The problem manifested itself in the form of a hang. When we stopped the application in the debugger, the UI thread was blocked on a WaitForMultipleObjectsEx call. From experience, hangs that result in the UI thread blocking are either serious design flaws, or subtle message pumping issues. We ruled out the design flaw by doing code analysis and flow analysis; typically, bugs that reproduce infrequently are a result of subtle issues, but design flaws often rear their head pretty early in the product stage. So we set to task identifying sources of re-entrant and message pumping problems in hopes of rectifying this issue.

Before I get started, I wanted to go over briefly a bit of background on COM threading models. Many smarter people have talked about this in blogs and books that I consult frequently, so please check up there as well. Most applications today require the UI thread (or main thread) to enter the STA apartment. This is because many UI components require the calling thread to be in the STA in order for it to function properly. A lot of this has to deal with the simpler (although not trivial) threading model that STA provides. But the STA also needs to respond to COM requests (such as RPC calls from other apartments into the STA apartment). If the UI thread is blocked say on waiting for user input, this could effectively block all COM requests. Enter the message queue. COM translates requests into Windows messages and sends them to the message queue of the STA thread. Therefore it is the responsibility of the STA thread to pump windows messages "whenever blocking calls are made" (I'll clarify and discuss this later).

Now consider what happens when you have a COM component in the STA that makes a blocking call. This could be the result of a call to another apartment that blocks, or the result of calling an API that calls one of the Wait functions. As a result of making this blocking call (say CoWaitForMultipleHandles) messages may be automatically pumped (by Windows or by the code itself). But this could process a message that another component sent to your component, and suddenly, your component will be called - but now you have a re-entrant problem, because your component is on the call stack twice! Once for the original call that blocked, and a second call that resulted because your blocking call pumped Windows messages. This is why STAs are evil; unless you design with this re-entrancy in mind, your application will be prone to random AVs (easy to debug and diagnose if you have a callstack), random hangs (harder to debug and diagnose), and other side effects that customers may see but are impossible to reproduce (extremely difficult to debug and diagnose). But what if you don't pump messages when you make a blocking call? Well, you could be blocked waiting for, say, a UI event (wait for the user to select an option, for example), and because you are not pumping messages, the UI does not respond and the user cannot interact with the application - you are now deadlocked.

It seems like a catch-22 - pump messages and you may get re-entrancy, but don't pump messages and you may get deadlock. FYI: in this context, Chris Brumme prefers deadlocks because you can always get an accurate call stack. I totally agree. Problems caused by re-entrancy happen way after the re-entrant code is initiated.

With this background now, you can see one source of hangs; if the UI thread is blocked but not pumping messages, it may never process a message that will cause the event to signal, thus causing a deadlock. This can happen if you use MsgWaitForMultipleObjects and forget to pump messages as required by the API, and indeed, this is related to what we saw. But the fix isn't as simple as to call CoWaitForMultipleHandles (which pumps messages for you among other things) or to pump some messages manually when MsgWaitForMultipleObjects returns WAIT_HANDLE_0 + cEvents; it turns out that many applications (including Visual Studio) not only contain custom COM filters, they also have custom message filters and does special message processing in the application's main message pump. Take a look at your main application's message pumping loop; chances are that for a large application, there is some custom processing going on in there. There may be some filtering, there may be some message translating, etc. In any application where the main message pumping loop is customized (and by customized, I mean any logic beyond GetMessage/DispatchMessage), then it becomes dangerous to pump your own messages blindly or use APIs (such as CoWaitForMultipleHandles) that pump messages. The reason should be obvious; your custom filtering and dispatching code in your message pump won't execute!

Now, then, you have all the context for the bug we had to fix last week. In short, a scenario occurred where we blocked the UI thread but did not pump messages. We then switched to using CoWaitForMultipleHandles because it looked promising; it would block and pump messages when messages came in. But the kicker here was that we have custom message filtering and dispatching code, and this code was being skipped if we called CoWaitForMultipleHandles. In the end, we had to use MsgWaitForMultipleObjects and call the custom message handler when a message came in. Only when we did that did we get the right behavior.

In short, be extremely careful when writing COM code with blocking UI operations. It's almost essential in these applications to factor out the message pumping so that it can be called in other locations that require pumping. In addition, you should almost never call APIs that pump messages for you (unless they pump specific messages that you don't handle, but even then it doesn't scale well; who knows how your message pumping will change in the future?).

I've only really scratched the surface of this huge topic. Many smarter people have written more about it. I'll try to continue to write about experiences that I've had on my team to hopefully make your experiences more painless :)

tags: , ,