Welcome to MSDN Blogs Sign in | Join | Help

Workflow Foundation: Migration from WF3 to WF4 - Notes and References

As you probably know, the .NET Framework 4.0 is going to introduce a new object model for the Workflow Foundation (WF) technology, in short called WF4, which will be added to the existing WF object model, in short called WF3. The WF3 object model uses the namespaces System.Workflow.*, whereas the WF4 object model uses the namespaces System.Activities.*.

Our product group published a Migration Guidance set of documents, available for download at http://go.microsoft.com/fwlink/?LinkID=153313.

 

I strongly encourage everyone working with WF to read those documents and to check the team blog at http://blogs.msdn.com/endpoint. In the following I report some quick takeaways, mainly coming from those documents and from questions we get from customers. Note that these notes are by no means meant to be a replacement to reading the whole set of documents.

  • .NET Framework and Development Environment: a very important aspect, not always well-understood, is the fact that WF3 is in .NET Framework 4.0. In other words, it is possible to run worflows using the WF3 object model on top of the CLR 4.0 and the .NET Framework 4.0 Class Library. The WF3 Designer runs in Visual Studio 10, so everything that could be done with WF3 under the CLR 2.0 and the .NET Framework 3.x can also be done with WF3 under the CLR 4.0. A consequence is that it is not necessary to use Visual Studio 9 in order to author/build/debug WF3 applications: Visual Studio 10 will works with WF3 
  • Rules: forward-chaining of rules is missing in WF4 (at least in its initial version). If forward-chaining is needed, you may consider using WF3 even for new development. This could be conveniently wrapped in a custom WF4 activity. The Policy40 SDK sample exemplifies this concept
  • State Machine Workflows: in my support activity on WF I have seen several customers using this type of workflow. There is no direct equivalent in WF4, but the new Flowchart activity can conveniently model some of the state machine workflows: typically, those where the flow is mostly sequential and "non-sequential" links betwen states are more the exception than the rule. For State Machine workflows that are not easily modeled with a Flowchart activity, staying with the WF3 object model is a valid option
  • Workflow Services: the term "Workflow Services" designates workflows that communicate with the outer world through WCF. In other words, they are the business logic of a WCF service. In WF3, workflow services used the ReceiveActivity activity in order to receive incoming WCF calls. In WF4 ReceiveActivity plays a different role, that of receiving a SOAP request. Therefore, a WF3 ReceiveActivity activity with the "One Way Operation" unchecked (request-reply MEP) is typically replaced, in WF4, by a pair of activities: Receive and SendReply. The WF4 object model is more flexible in modeling MEPs other than Request-Reply
Posted by Carlo Colombo | 0 Comments
Filed under:

A CLR COM Interop Issue

Hello again.

Today's case is pretty interesting, in my opinion, because it merges a number of different areas.

We'll start from the analysis of a memory corruption to go to Runtime Callable Wrappers in the .NET CLR to finish with COM objects lifetime considerations. If all that sounds interesting, or at least not too boring or scaring Smile, please stay with me as we start the analysis.

The Problem

The problem is reported as a 100% CPU usage in a customer-written, managed application. The problem happens, most of the times, at application shutdown, occasionally in other circumstances.

Dump Analysys - Understanding the reason for the high CPU usage

When an application takes 100% of CPU time (or 50%, or 25%, depending on the number of processors), it is usually an indication that one of its threads ran into an endless loop.

All we have to start with is a dump of the process taken at this time, so let's have a look at the dump first.

Note

When analyzing a 100% CPU issue, analysis of one single dump is not the best way to identify the problem. A set of dumps (for example 3 or 4 dumps taken at short interval, say 30 seconds) is much better because it makes it possible to inspect the call stack of the threads at different stages in the endless loop and therefore to possibly understand why we are staying endlessly in the loop. With a single dump, careful analysis of each of the stack frames, including its state (local variables and parameters) is required, and this can be time consuming. Even more sophisticated techniques can be used to identify endless loops, for example profiling. Note however that these techniques are sometimes impractical if the problem happens seldom, or if it only happens in production environments or on remote machines at customer's locations.

As usual, let's set the symbol path:

0:000> .sympath srv*c:\publicsymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: srv*c:\publicsymbols*http://msdl.microsoft.com/download/symbols
0:000> .reload

Then let's check the CPU usage for each thread with the !runaway command:

0:000> !runaway
 User Mode Time
  Thread       Time
   0:714       0 days 2:38:16.218
   1:b4c       0 days 0:00:00.000

Thread 0 is the main thread, and its CPU usage of more than 2 hours shows that this is the thread which is taking all the CPU time. Thread 1 is the thread that was injected by the debugger in order to break the process and take a dump and, as expected, it took practically no CPU time. Under normal circumstances the process would have many other threads but at this stage the process is shutting down and all the other threads have already terminated.

This is the call stack for the main thread:

0:000> kb200
ChildEBP RetAddr  Args to Child              
0012f91c 773f8417 001a0000 001adeb8 0012f9c4 ntdll!RtlpCoalesceFreeBlocks+0x4db
0012fa14 773f8652 001ade30 001ade38 001ade38 ntdll!RtlpFreeHeap+0x1e2
0012fa30 772fc56f 001a0000 00000000 001ade30 ntdll!RtlFreeHeap+0x14e
0012fa44 70fc4c39 001a0000 00000000 001ade38 kernel32!HeapFree+0x14
0012fa90 790aeebd 001ade38 00000001 00000000 msvcr80!free+0xcd
0012faa8 790aef38 79060000 75d3c590 00000001 mscorjit!_CRT_INIT+0x195
0012faec 790a5e16 79060000 773ce1c4 79060000 mscorjit!_CRT_INIT+0x27a
0012faf4 773ce1c4 79060000 00000000 00000001 mscorjit!_DllMainCRTStartup+0x1d
0012fb14 773ba8dd 790a5df9 79060000 00000000 ntdll!LdrpCallInitRoutine+0x14
0012fbb4 773ba85f 0012fbd0 772d3b69 00000000 ntdll!LdrShutdownProcess+0x1a9
0012fbbc 772d3b68 00000000 77e8f3b0 ffffffff ntdll!RtlExitUserProcess+0x64
0012fbd0 79fd9e8f 00000000 00000000 00000001 kernel32!ExitProcess+0x12
0012fdf8 79f7479c 00000000 0012fe3c 79f7394a mscorwks!SafeExitProcess+0x157
0012fe94 79004fab 00000000 79e70000 0012fef0 mscorwks!DisableRuntime+0xd1
0012fea4 70fc170b 00000000 70fc1715 00000000 mscoree!CorExitProcess+0x46
0012feac 70fc1715 00000000 70fc1a04 00000000 msvcr80!__crtCorExitProcess+0x25
0012feb4 70fc1a03 00000000 d8c61ab3 0041b120 msvcr80!__crtExitProcess+0x9
0012fef0 70fc1a4b 00000000 00000000 00000000 msvcr80!_cinit+0x101
0012ff00 0040ef5f 00000000 d8c868b0 00000000 msvcr80!exit+0xd
WARNING: Stack unwind information not available. Following frames may be wrong.
0012ff88 772f4911 7ffdf000 0012ffd4 773ce4b6 LXDeviceServer+0xef5f
0012ff94 773ce4b6 7ffdf000 58366917 00000000 kernel32!BaseThreadInitThunk+0xe
0012ffd4 773ce489 0040f0bb 7ffdf000 00000000 ntdll!__RtlUserThreadStart+0x23
0012ffec 00000000 0040f0bb 7ffdf000 00000000 ntdll!_RtlUserThreadStart+0x1b

The callstack above confirms that the process is shutting down. As part of shutdown of a managed application, the CLR is also shutting down.

We already talked about ntdll!RtlpCoalesceFreeBlocks in a previous post. In that case, an access violation was a consequence of a previous heap corruption. Let's try and understand what happens in this case.

In the previous post we also enphasized that some of the heap data structures changed in Vista / Windows Server 2008, so let's check which operating system this process was running on:

0:000> vertarget
Windows Server 2008/Windows Vista SP1 Version 6001 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
kernel32.dll version: 6.0.6001.18000 (longhorn_rtm.080118-1840)

So this gives us the chance to look at the new heap data structures of Windows Vista and Windows Server 2008.

We are freeing memory at address 001a0000. As part of the free, the heap manager tries to coalesce this block with nearby (previous and / or following) blocks in an attempt to reduce heap fragmentation. While doing so, the heap manager seems to have entered the endless loop.

Upon further analysis, very similar to the one in the previous post, we can conclude that the cause of the problem is heap corruption. More precisely, a heap block that was previously freed and placed in the heap free list has been subsequently accessed by the program, overwriting the fields that chain the list (remember, when a block is freed the booking information is stored in the user-accessible part of the heap block). The corruption just so happened to create a circular list, thus causing ntdll!RtlpCoalesceFreeBlocks() to enter an endless loop while scanning the list items.

Looking for the source of the heap corruption

One important information coming from our customer is that not all the runs create the problem in the same way. So let's try and run the application under the debugger for some times and check what happens at different runs.

One such run causes a first-chance Access Violation (AV) with this call stack:

0:000> kb30
ChildEBP RetAddr  Args to Child              
0012efd0 79e7e432 0002ded0 002bd9b8 22b60a79 mscorwks!SafeReleaseHelper+0x77
0012f004 79f86f8f 0002ded0 002bd9b8 22b60a45 mscorwks!SafeRelease+0x2f
0012f038 79f86e9c 002bd9b8 00000001 002bd9bc mscorwks!IUnkEntry::Free+0x68
0012f04c 79f86e7a 22b60a01 00000001 002bd9b8 mscorwks!RCW::ReleaseAllInterfaces+0x18
0012f07c 79f86efc 002bd9b8 22b60ad1 00000001 mscorwks!RCW::ReleaseAllInterfacesCallBack+0xbd
0012f0ac 79f8787e 035ae598 0012f0e8 79f87846 mscorwks!RCW::Cleanup+0x22
0012f0b8 79f87846 002bda50 22b60a95 0012f104 mscorwks!RCWCleanupList::ReleaseRCWListRaw+0x16
0012f0e8 7a0bf7a6 002622b0 002622b0 0012f114 mscorwks!RCWCleanupList::ReleaseRCWListInCorrectCtx+0xdf
0012f0f8 76302fe0 035ae434 0012f118 0012f334 mscorwks!CtxEntry::EnterContextCallback+0x89
0012f114 7647333b 00268198 03f18e48 00000202 ole32!CRemoteUnknown::DoCallback+0x7a
0012f130 764e0841 76302fa5 0012f338 00000002 rpcrt4!Invoke+0x2a
0012f55c 764e0ff6 00262010 0026a740 00314f68 rpcrt4!NdrStubCall2+0x27b
0012f5ac 76409759 00262010 00314f68 0026a740 rpcrt4!CStdStubBuffer_Invoke+0xa0
0012f5f4 764096f3 00314f68 002f3764 00000000 ole32!SyncStubInvoke+0x3c
0012f640 76329d67 00314f68 0026a650 00262010 ole32!StubInvoke+0xb9
0012f71c 76329c5c 0026a740 00000000 00262010 ole32!CCtxComChnl::ContextInvoke+0xfa
0012f738 7632aab0 00314f68 00000001 00262010 ole32!MTAInvoke+0x1a
0012f764 7640961b 00314f68 00000001 00262010 ole32!STAInvoke+0x46
0012f798 76409498 d0908070 0026a740 00262010 ole32!AppInvoke+0xaa
0012f874 76409aa2 00314f10 00270328 00000000 ole32!ComInvokeWithLockAndIPID+0x32c
0012f89c 7632a8ea 00314f10 00000400 00265ee8 ole32!ComInvoke+0xc5
0012f8b0 7632a8a9 00314f10 0012f970 00000400 ole32!ThreadDispatch+0x23
0012f8f4 765cf8d2 00030058 00000400 0000babe ole32!ThreadWndProc+0x167
0012f920 765cf794 7632a80c 00030058 00000400 user32!InternalCallWinProc+0x23
0012f998 765d0008 00000000 7632a80c 00030058 user32!UserCallWinProcCheckWow+0x14b
0012f9fc 765d0060 7632a80c 00000000 0012fa3c user32!DispatchMessageWorker+0x322
0012fa0c 762fb0a6 0012fa20 00000000 0012fb08 user32!DispatchMessageW+0xf
0012fa3c 762f8a9d 00000102 00080008 762faff5 ole32!CCliModalLoop::PeekRPCAndDDEMessage+0x4c
0012fa6c 762f89d7 00080008 2286611c 00000102 ole32!CCliModalLoop::FindMessage+0x30
0012fac8 762f898f 80010115 80010115 0026abe8 ole32!CCliModalLoop::HandleWakeForMsg+0x3e
0012fadc 762f5384 0026abe8 00013880 0012fb04 ole32!CCliModalLoop::BlockFn+0x108
0012fb58 79f752f0 00000002 00013880 00000001 ole32!CoWaitForMultipleHandles+0xcd
0012fb78 79f75232 00000000 00013880 00000001 mscorwks!NT5WaitRoutine+0x51
0012fbe4 79f7519a 00000001 0026abe8 00000000 mscorwks!MsgWaitHelper+0xa5
0012fc04 79ed9808 00000001 0026abe8 00000000 mscorwks!Thread::DoAppropriateAptStateWait+0x28
0012fc88 79ed96c4 00000001 0026abe8 00000000 mscorwks!Thread::DoAppropriateWaitWorker+0x13c
0012fcd8 79ed9a62 00000001 0026abe8 00000000 mscorwks!Thread::DoAppropriateWait+0x40
0012fd34 79e78944 00013880 00000001 00000000 mscorwks!CLREvent::WaitEx+0xf7
0012fd48 79f7515d 00013880 00000001 00000000 mscorwks!CLREvent::Wait+0x17
0012fdc8 79f750b9 79f7509d 22b60785 00000000 mscorwks!WaitForEndOfShutdown_OneIteration+0x9f
0012fdcc 79f7509d 22b60785 00000000 00000000 mscorwks!WaitForEndOfShutdown+0x5
0012fdf8 79f7478c 00000eb0 0012fe3c 79f7394a mscorwks!EEShutDown+0xae
0012fe94 79004fab 00000000 79e70000 0012fef0 mscorwks!HandleExitProcessHelper+0x41
0012fea4 72f4170b 00000000 72f41715 00000000 mscoree!CorExitProcess+0x46
0012feac 72f41715 00000000 72f41a04 00000000 msvcr80!__crtCorExitProcess+0x25
0012feb4 72f41a03 00000000 22866778 0041b120 msvcr80!__crtExitProcess+0x9
0012fef0 72f41a4b 00000000 00000000 00000000 msvcr80!doexit+0xc1
0012ff00 0040ef5f 00000000 2285db91 00000000 msvcr80!exit+0xd

0:000> .exr -1
ExceptionAddress: 79e7e4a5 (mscorwks!SafeReleaseHelper+0x00000077)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00000008
Attempt to read from address 00000008

Note

A question I would expect at this stage Smile is: why doesn't this exception tear down the process? Or, in other words, who is handling this exception, thus preventing the process from terminating? It is the CLR itself which does that by wrapping the call to Release() on COM objects in an exception handler which basically ignores the exception and allows the program to continue. This has been done in order to be forgiving towards those COM objects that do not handle their lifetime properly and avoid that they can tear down the whole process.

RCW represents, in the CLR runtime, a Runtime Callable Wrapper, the .NET interface to a COM object. Before making a sense of the call stack above, it is helpful to underline some key points on the lifetime of COM objects and on how RCWs handle it:

  • COM objects are reference-counted: they keep track of how many clients are keeping a reference to them. They can safely delete themselves when the reference count drops to 0.
  • A RCW is just another client of a COM object. As such, it may keep one or more references to the COM object.
  • A RCW, like other managed objects, is not released deterministically, but through Garbage Collection (GC). It is at this stage, by default, that it calls Release() on the referenced COM object.
  • It is possible to force the release of the underlying COM object before the RCW undergoes GC by calling Marshal.ReleaseComObject().

Without looking into the undocumented RCW structure, we know from the call stack above that it is releasing a reference to a COM object at address 0002ded0. So we can check with our customer which type of COM Interoperability takes place in their application. It turns out their application uses COM Connectable Objects in managed form (this can be achieved by using the ComSourceInterfacesAttribute attribute). Some of their sinks are managed, some are native. So COM Interoperability takes place for native COM sinks. When a connectable object has one or more sinks to notify when an event occurs, it must obviously keep a reference to them. At the same time, the sink typically keeps a reference to the connectable object because it will sooner or later disconnect from it by calling IConnectionPoint::Unadvise. With COM Interop in-between, the overall picture is as follows:

 

 Drawing1

 

Now, we know that we get an AV when the RCW releases the underlying Sink (red arrow in the picture). This may indicate that there is something wrong with the lifetime of the Sink itself. Analysis of our customer source code revealed that this was, indeed, the case: the sink object, implemented in C++, was created with new and destroyed with delete. new was called immediately before handing a reference to the Connectable Object in IConnectionPoint::Advise(). In simplified code:

Sink* pSink = new Sink();
CComPtr<IUnknown> pSinkUnk;
pSink->QueryInterface(__uuidof(IUnknown), (void**) &pSinkUnk);
hr = pUnk.CoCreateInstance(__uuidof(Button));
if (FAILED(hr))
    return hr;
DWORD dwCookie;
hr = AtlAdvise(pUnk, pSinkUnk, __uuidof(ButtonEvents), &dwCookie);

delete was called after the sink unregistered with the connection point by calling IConnectionPoint::Unadvise(). In simplified code:

hr = AtlUnadvise(pUnk, __uuidof(ButtonEvents), dwCookie);
delete pSink;

Why COM Interoperability makes a difference

With reference to the last code snippet, when AtlUnadvise() is called (IConnectionPoint::Unadvise), the CLR deregisters the connection but the RCW does not release its reference to the sink. Since the Sink destroys itself at this stage, the RCW in the CLR is left with a pointer to a former COM object which is now released memory. When the CLR shuts down, or when GC takes place, the RCW tries to call Release() on the already-released sink.

At this point 2 things can happen:

1. The memory that was previously used by the sink is free. The attempt to call Release() on the sink generates an Access Violation but the CLR handles and ignores it (see Note above). Note that for heaps "freed" means reserved, uncommitted, memory, because the heap never releases reserved regions of memory.

2. The memory that was previously used by the sink is still allocated memory. This, in turn, can happen in 2 subcases:

                2.a: because the memory was placed in the heap free list, ready for reuse, instead of being released. The free list is a functionality of a heap manager

                2.b: because the memory has been reallocated for something else

The case of 100% CPU experienced by our customer is caused by case 2.a above: the call to Release() does not cause an access violation but it corrupts the structures of the free list in the heap manager. This causes the free list to create a circular reference, which causes an endless loop, hence 100% CPU, the next time the heap manager scans the list.

The case of access violation to which refers the second call stack is 1. above. They are just 2 different symptoms of the same problem. Note that the second symptom, unlike the first, allowed to find out the reason of the problem.

Fixing the Issue

With the findings above, fixing the issue is straightforward: the problem is caused by incorrect handling of the lifetime of a COM object, causing the CLR to access a piece of unmanaged memory that has already been released, therefore the fix consists in correctly handling the lifetime of COM objects by calling Release(). In simplified code:

CComPtr<IUnknown> pUnk;
CComPtr<IUnknown> pSinkUnk;
CComObject<Sink>::CreateInstance(&pSinkUnk);
hr = pUnk.CoCreateInstance(__uuidof(Button));
if (FAILED(hr))
    return hr;
DWORD dwCookie;
hr = AtlAdvise(pUnk, pSinkUnk, __uuidof(ButtonEvents), &dwCookie);
...
...
hr = AtlUnadvise(pUnk, __uuidof(ButtonEvents), dwCookie);
pSinkUnk.Release();

Note

It may be argued that the CLR concurs to creating the issue by not releasing the sink reference when the sink calls Unadvise(). It is certainly true that this behavior concurs to the problem: if it called Release() on the sink in its implementation of IConnectionPoint::Advise(), the bad lifetime management of the sink would not surface, because by the time the sink deletes itself after calling IConnectionPoint::Advise(), there wouldn't be any outstanding references to it. However, the CLR behavior is not incorrect: once a client acquires a reference to a COM object, it can keep it at its own discretion. So there is nothing, in the CLR behavior, that violates the COM rules. Nondeterministic release of COM references is also consistent with the nondeterministic release of managed objects through Garbage Collection.

Posted by Carlo Colombo | 2 Comments
Filed under: , , ,

Analyzing a user-mode exception and conditionally taking a kernel dump

Introduction

Hello again. It has been more than one month since the last post. I guess the Christmas vacation and the  local christmas sweet played a role in this delay Smile.

In the last few days I was faced with an interesting issue: a Windows Forms application abruptly terminates despite the fact that there are exception handlers wrapping the code. Needless to say Smile, the problem occurs in a production environment, rather seldom and at random times. Nonetheless, its impact is huge and so the problem needs to be sorted out.

When a process having exception handlers terminates, the most obvious reason is that exceptions occur within exception handlers themselves. It's not the only possibility, but certainly the first to check.

The first step we take is to attach the debugger to the process (through Adplus) and look at the log file it produces; this confirms the first theory: from some point, an access violation occurs every time the process calls into the windowing subsystem (user32.dll and then w32k.sys in the kernel): the call to the kernel does not crash (otherwise the entire system would crash, not only the process), but the kernel makes a callback to the user-mode code with an invalid address, which causes a user-mode access violation. And since exception handlers display a message box in order to notify the user of the exception, they also generate the exception because this is a call to the windowing subsystem.

A user-mode dump, by itself, is not enough to identify the problem, because the bad address originates from the kernel. In order to understand where this invalid address comes from, we need a kernel dump as well, taken approximately at the same time when the problem happens.

A kernel dump crashes the system, so we need to make sure we are taking the dump only when we run into the issue. Clear enough, but how can this be done? The idea is that we attach a debugger to the user-mode process and, when the "right condition" occurs, we cause the system to crash somehow, so as to take the dump. As usual when something seems complex, let's break it down into pieces.

Analyzing the user-mode exception

By looking at the adplus log of a previous run of the faulty application, we know that there are benign access violations that occur from time to time in the application. They have nothing to do with the problem and are successfully handled by the process.

Note

You may wonder how an Access Violation exception can be benign and therefore successfully handled by the code. Typically, benign access violations are those that dereference a NULL pointer. There are pieces of code that, instead of checking whether a pointer is NULL, try to dereference it and handle the exception. The Just-In-Time compiler in .NET does that as well: it sets up a handler for a native Access Violation exception, the handler creates a NullReferenceException managed object and throws it.

Obviously, we should not crash the system and take a kernel dump in case of a benign access violation. But this poses the problem of understanding which ones are benign and can be ignored by the debugger, allowing the application to handle them, from those that indicate we ran into the problem and should trigger the system crash.

One easy way to achieve this could be to rely on the debugging architecture, whereby debuggers are notified of exceptions at 2 stages:

- first-chance, before the exception handlers in the process are given a chance to handle the exception

- second-chance, only if the exception handlers in the debuggee did not handle the exception

The idea here is pretty simple: if we assume the application handles benign exceptions, we can simply configure the debugger to handle the exception second-chance: at that time we know that the application did not handle the exception, therefore the program is bound to terminate.

In our case, however, there is a complication: the code that does the exception handling is outside our and our customer's control, being it third-party, so we cannot change it. This means, unfortunately, that we cannot configure the debugger to catch the exception second-chance, because the debugger won't get notified.

So we need to handle the exception first-chance and, by analyzing the exception, understand whether it is one for which we should crash the system or if it is one which we should ignore and allow the application code to handle. The exception is the last exception event that waked up the debugger, which can be displayed with the .exr -1 command:

0:000> .exr -1
ExceptionAddress: 773d7dfe (ntdll!DbgBreakPoint)
   ExceptionCode: 80000003 (Break instruction exception)
  ExceptionFlags: 00000000
NumberParameters: 3
   Parameter[0]: 00000000
   Parameter[1]: 918d63c0
   Parameter[2]: 7741d094

The output of this command is essentially the content of the EXCEPTION_RECORD struct found in the WinNT.h header file:

typedef struct _EXCEPTION_RECORD {
    DWORD    ExceptionCode;
    DWORD ExceptionFlags;
    struct _EXCEPTION_RECORD *ExceptionRecord;
    PVOID ExceptionAddress;
    DWORD NumberParameters;
    ULONG_PTR ExceptionInformation[EXCEPTION_MAXIMUM_PARAMETERS];
    } EXCEPTION_RECORD;

We will distinguish whether an access violation is benign or otherwise based on the address of the data that is being dereferenced. If this address is 0, we consider it a benign AV. In order to accomodate the dereferencing of a member of a structure, we also consider benign an exception dereferencing data at addresses up to 4096. The choice of this particular value is certainly arbitrary, but probably reasonable (structs larger than 4KB are unusual).

In an access violation, the dereferenced address is the second parameter:

0:027> .exr -1
ExceptionAddress: 74a5849f (msdart!XxMpHeapFree+0x00000037)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: a0068421
Attempt to read from address a0068421

So we need to dynamically check the value of the second parameter. The ".exr -1" command is not suitable for that, since its output is failry complex and its format is not documented: we need a debugger command that returns just the arguments of the exception.

When there isn't a debugger command that does what we need, it si time to write a debugger extension.

The Debugger Extension

For generality, we will write a debugger extension, let's call it ExParams, that outputs the parameters of the exception event that caused the debugger to wake up. If the last event is not an exception (for example, if it is a breakpoint), the extension writes out an error message.

If the debugging tools are installed with the optional SDK, it will create an sdk subfolder with folders for libraries (lib subfolder), include files (inc sufolder) and samples (samples subfolder). If you need to quickly start writing a simple extension, I suggest to start from the "extcpp" sample. It has a sample command, ummods, that you can customize to fit your needs.

The logical steps we need to take in our extension are as follows:

  • Get the last event, the event that caused the debugger to wake up, by calling IDebugControl::GetLastEventInformation()
  • Check the type of the event: if it is not DEBUG_EVENT_EXCEPTION, return an error message
  • Cast the ExtensionInformation output argument to an EXCEPTION_RECORD64 data structure and write out the parameters

A simplified version with fixed-size arguments and with no error handling is reported here:

EXT_COMMAND(ExParams,
            "Outputs the parameters of an exception",
            "")
{
    ULONG  Type;
    ULONG  ProcessId;
    ULONG  ThreadId;
    ULONG  ExtraInformationSize = 2000;
    PVOID  ExtraInformation = new BYTE[ExtraInformationSize];
    ULONG  ExtraInformationUsed;
    ULONG  DescriptionSize = 2000;
    PSTR   Description = new char[DescriptionSize];
    ULONG  DescriptionUsed;

    HRESULT hr = m_Control->GetLastEventInformation(&Type, &ProcessId, &ThreadId, ExtraInformation, ExtraInformationSize, &ExtraInformationUsed,
                    Description, DescriptionSize, &DescriptionUsed);
    if (Type != DEBUG_EVENT_EXCEPTION)
    {
        Out("Sorry: last event is not an exception\n");
        return;
    }
    EXCEPTION_RECORD64* pRec = reinterpret_cast<EXCEPTION_RECORD64*> (ExtraInformation);
    for (DWORD l = 0; l < pRec->NumberParameters; l++)
    {
        char sParam[20];
        itoa(pRec->ExceptionInformation[l], sParam, 16);
        Out(sParam);
        Out("\n");
    }
}

Since the extension should execute quickly and does not use symbols, I did not implement command cancellation.

We compile and build the extension with either the WDK tools or with Visual Studio. If you use the latter, please pay attention to the dependency on the C++ runtime library: better to do a static link (/MT compiler option) so as to avoid having to deploy the C++ redistributable on the target machine.

We are now ready to use our brand new extension.

Creating the ADPlus Script

The debugger extension, when run, will output 2 parameters. We need to check the second one. To achieve that, the control flow tokens of the debugging tools come to the rescue. The .foreach and .if tokens can be used to parse the output of ExParams and get the 2nd argument.

Also, we need a way to cause a crash of the system. The utility NotMyFault.exe, available from download here, can do the trick. It dynamically loads the driver MyFault.sys, which then causes the crash. NotMyFault can be downloaded at http://download.sysinternals.com/Files/Notmyfault.zip.

Furthermore, we use the facilities of AdPlus in order to have a log file where the events are reported and where we can declaratively specify what we need the debug session to do. Without further comments (see the documentation), here is the config file:

<ADPlus>
    <PreCommands>
        <Cmd> .load DbgExt.dll;.expr /s MASM </Cmd>
    </PreCommands>
    <Exceptions>
        <Config>
                  <Code> AllExceptions </Code>
            <Actions1> Log;Stack;MiniDump  </Actions1> 
            <Actions2> Log;Stack;FullDump  </Actions2> 
            <ReturnAction1> GN </ReturnAction1>
            <ReturnAction2> Q </ReturnAction2>
        </Config>
        <Config>
                  <Code> av </Code>
            <Actions1> Log;Stack;FullDump  </Actions1> 
            <Actions2> Log;Stack;FullDump  </Actions2> 
            <CustomActions1>
                r $t1 = 0; .foreach (prm  { !ExParams } ) {.if ($t1 == 1 & prm &gt; 1000) {.shell NotMyFault.exe /crash} .else{r $t1 = $t1 + 1}};
            </CustomActions1>
            <ReturnAction1> GN </ReturnAction1>
            <ReturnAction2> Q </ReturnAction2>
        </Config>
    </Exceptions>
</ADPlus>
Notes
  • Since the kernel dump has to be the last thing we do on the machine (at that point, the machine crashes), we obviously need to take the user dump before crashing the machine. That's automatic in the script because the <CustomActions1> block is executed after the <Actions1> block.
  • Only administrators, by default, have the privilege of loading device drivers. This means that the debugger must be run with administrative privileges, otherwise NotMyFault.exe will not be able to crash the system.
  • You may have noticed the &gt; entity in the CustomActions1 element. This is the ">" character. The direct use of ">" in the XML would cause Adplus to interpret it as an element bracket and therefore to display an error message. By using &gt; we make sure that the ">" sign will not be interpreted by AdPlus and, instead, will be delivered to the debugger.
  • The command .expr /s MASM at the beginning of the debug session is meant to set the default expresssion evaluator for the debugger. The debugging tools documentation has more details.

Sorting out the details

The very last thing we need to care about is to make sure that the machine is properly configured to take a full memory dump when the system crashes. I think you'll agree with me that it would really be a shame to do all this work and then not have a dump just because of a machine configuration detail Confused.

The details are reported in the article Overview of memory dump file options for Windows Server 2003, Windows XP, and Windows 2000. For our case, being the target system a Windows XP machine, the things to check are the following:

  1. Make surethat the paging file is on the boot/system disk (usually c:)
  2. Check that the paging file is at least 10MB larger than the physical memory of the machine
  3. Ensure that the option to save the required type of memory dump is selected into Computer's properties, tab "advanced", "startup and recovery": "Complete Memory Dump"
  4. Make sure you have enough space for the dump to be saved in the specified location (equal to the size of physical memory plus 10MB).

Summary

In today's topic we have touched on several areas in order to achieve an advanced and not-so-common goal: automatically analyze, in the debugger, the details of an event (an exception, in our case) and, based on that, take a specific action (crash the system and take a kernel dump, in our case). This requirement is not so uncommon in a production environment or test environment where the problem happens very seldom. In these circumstances the debuggging needs to be automated as well.

We leveraged several tools and technologies in order to achieve that:

  • A debugger extension which dumps out the data we need
  • The control flow tokens in the debuggers of the " Debugging Tools for Windows" which parse and filter the output of a debugger extension
  • The NotMyFault.exe tool (or equivalent) which can be run to crash the system at a given point in time
  • The AdPlus tool to simplify the set of actions that we want the debugger to take in reaction to specific events
  • The configuration of the dump option on the machine

While the set of steps may seem complex at first sight, this is a one-time effort: reapplying the same principles and techniques, or adapting them to a similar need, is straightforward. And the ability to track down difficult and complex problems which seldom occur and have an impact on the production environment is a huge added value.

Why Carlos?

Some colleagues and friends pointed out that, after a couple of tough postings on, respectively, DCOM Communication and Heap Corruption, it may be time for a light one. Fair enough, we are approaching the winter holiday season so you may prefer something easier. For guys interested in deep technical matters: you're probably better off skipping this post since there is nothing technically interesting in it.

The URL for this blog, http://blogs.msdn.com/carlos, is qualified by the word carlos. Now, this is not my email alias within Microsoft, nor it is my first name (being it Carlo, without the trailing s). So what's the rationale behind it?

Well, in the 12+ years as a Support Engineer / Escalation Engineer at Microsoft, I have interacted with literally thousands of clients in the EMEA (Europe, Middle East and Africa) region, and with colleagues from all the continents except Antarctica. And very often, in return to an email that I sent to customers/colleagues, I got this greeting:

"Hi Carlos,

...."

At first, I was a bit annoyed by seeing my name mispelled. And I did not understand why it happened: I always finish my emails with my first name/last name, or my first name only, and the email header contains my first name and last name as well, this shouldn't leave much room for misunderstanding on what is my first name, should it?

Over time, as the number of repliers using the name Carlos increased, it became clearer and clearer to me that there was very little I could do: the Spanish Carlos was much more popular than my real name, the Italian Carlo, and as a consequence non-Italian writers used the former in place of the latter. I even wondered if some of them noted that I signed as "Carlo" and just thought: "This guy mispelled his own name, let's show in my answer that, notwithstanding his mistake, I can address him in the right way" Smile.

Admittedly, I am the one to blame for a couple of occurrences: I don't use pre-created e-mail signatures for outgoing messages (yes, those signatures that, with disclaimers, phone numbers, urls, alternative contacts etc, over time grow longer than the actual email message) and sometimes I just write "carlo" at the end of my email before hitting the "Alt+s" key sequence which sends the email in Outlook. It just so happens that, if "Alt" and "s" are not pressed exactly at the same time, the email is not sent and the s is written in the email message at the cursor position. If you are in a hurry, you do not pay attention to that: you just enter "Alt+s"  one more time, and the email goes out with the signature "carlos"!

Even written communications sometimes address me as Carlos: it is the case, for example, with a hotel chain to whose mailing list I subscribed (don't ask me how it happened, since I am not interested in these types of programs: it was probably when checking in at one of their hotels): for them, I am Carlos Colombo. A kind suggestion that they may start by spelling my name correctly did not have any effect so far and I am still Carlos to them.

So, when I came to choose a url for the MSDN Blog I just thought: if I am going to be addressed as Carlos anyway, maybe I should start myself. And here it came the URL for the Blog: http://blogs.msdn.com/carlos Smile.

I don't use quotations very often but I guess the popular saying "If you can't beat them, join them" applies here (BTW if you know of a more suitable saying, please let me know: I am not a Native English Speaker and I am definitely interested in learning the language better).

And you know what? Over time I came to like "Carlos" as much as I like my real name "Carlo" Smile.

Heap Corruption: A Case Study

Here we go again: it is snowing in Milan while I am publishing a new blog post Confused. Admittedly, this coincidence is getting more and more interesting, maybe meteorogists can spot a pattern here Smile.

Today's topic is a tough one: memory corruption.

Introduction

Memory corruption, in general, is one of the toughest issues to work with. For several reasons:

  • It is not immediate, for starting, to understand that a problem (endless loop, unexpected behavior, crash) is caused by a memory corruption.
  • Historically, user-mode processes with their own virtual address space and the separation of user-mode and kernel mode were meant to provide an isolated environment for code, so that bad code which, for example, could cause a memory corruption, was not able to adversely affect other code. On the other hand, the appearance of "host processes" like svchost.exe for services, dllhost.exe for COM+ applications and w3wp.exe for ASP.NET and Web Services, made again different components run in the same process. There are benefits to it, but the fact that different software shares a common address space means that, when a memory corruption occurs, the whole process is affected. Moreover, it may be difficult to determine which component is at fault.
  • The consequences of a memory corruption typically manifest themselves at a later time, when the corrupted area is read. At that time it is difficult, if not impossible, to backtrack to the source of the corruption.

In case you wonder, the additional isolation provided by .NET through AppDomains does not help in this case: if memory gets corrupted in an address space there is no way to recover.

Memory, memory, memory: is it really all alike?

Well no, it is not. The virtual address space can in turn be divided into:

  • Plain virtual memory, accessed directly through VirtualAlloc() and VirtualFree()
  • Stacks: each thread in a process is given a user-mode stack in the process' virtual address space. It's size can be determined through the /STACK linker option.
  • Code: this piece of memory contains code which is loaded from an executable module (dll or exe), including static/global variables
  • Heaps: a process starts life with an initial heap (called the process heap), whose size is determined by the /HEAP linker option. At runtime, any number of heaps can be created/destroyed with the HeapCreate()/HeapDestroy() functions. Within heaps, memory can be allocated/deallocated through HeapAlloc()/HeapFree()
  • Thread-Local Storage (TLS)
  • Managed heap: this is the memory used by the CLR to allocate objects and to provide its memory services like for example the Garbage Collector

It is important to understand a couple of things:

  • The high-level memory structures described above (Stacks, Code, Heaps, Managed Heap) are all layered on top of plain virtual memory. In other words, the system uses VirtualAlloc()/VirtualFree, or logical equivalents, to allocate memory for stacks, heaps and managed heaps. For example, the Heap Manager calls VirtualAlloc() to get a chunk of virtual memory (a Heap Segment). The Heap Manager then divides the segment into smaller, variable-size areas (Heap Blocks).
  • The managed heap is layered on top of plain virtual memory, not on top of operating system heaps. In other words, a "managed heap" is not an OS heap.

Inspecting memory with the debugger

So what happens if you would like to know to which area of memory a given address belongs? The !address debugger extension is your friend here. Let's see this in practise.

Let's attach the debugger to a running instance of notepad.exe (BTW: I am not sure how useful is notepad for other purposes Smile, but I definitely recommend it as a debugging target: easy to launch, no problem if you stop it in the debugger, no problem if you terminate it while debugging) and let's familiarize ourselves with the !address command:

0:000> kb
ChildEBP RetAddr  Args to Child              
0011fe10 768bf837 768bf86a 0011fe54 00000000 ntdll!KiFastSystemCallRet
0011fe14 768bf86a 0011fe54 00000000 00000000 USER32!NtUserGetMessage+0xc
0011fe30 002f1418 0011fe54 00000000 00000000 USER32!GetMessageW+0x33
0011fe70 002f195d 002f0000 00000000 0051213a notepad!WinMain+0xec
0011ff00 77a74911 7ffdc000 0011ff4c 778ee4b6 notepad!_initterm_e+0x1a1
0011ff0c 778ee4b6 7ffdc000 4b9dc4d5 00000000 kernel32!BaseThreadInitThunk+0xe
0011ff4c 778ee489 002f31ed 7ffdc000 00000000 ntdll!__RtlUserThreadStart+0x23
0011ff64 00000000 002f31ed 7ffdc000 00000000 ntdll!_RtlUserThreadStart+0x1b
0:000> !address 0011fe30
 ProcessParametrs 00511400 in range 00510000 0054b000
 Environment 005107e8 in range 00510000 0054b000
    000e0000 : 0010f000 - 00011000
                    Type     00020000 MEM_PRIVATE
                    Protect  00000004 PAGE_READWRITE
                    State    00001000 MEM_COMMIT
                    Usage    RegionUsageStack
                    Pid.Tid  17c.1de
00:000> !address 0051213a 
 ProcessParametrs 00511400 in range 00510000 0054b000
 Environment 005107e8 in range 00510000 0054b000
    00510000 : 00510000 - 0003b000
                    Type     00020000 MEM_PRIVATE
                    Protect  00000004 PAGE_READWRITE
                    State    00001000 MEM_COMMIT
                    Usage    RegionUsageHeap
                    Handle   00510000
0:000> !address 768bf86a 
 ProcessParametrs 00511400 in range 00510000 0054b000
 Environment 005107e8 in range 00510000 0054b000
    768a0000 : 768a1000 - 00068000
                    Type     01000000 MEM_IMAGE
                    Protect  00000020 PAGE_EXECUTE_READ
                    State    00001000 MEM_COMMIT
                    Usage    RegionUsageImage
                    FullPath C:\Windows\system32\USER32.dll
0:000> !address 4b9dc4d5 
 ProcessParametrs 00511400 in range 00510000 0054b000
 Environment 005107e8 in range 00510000 0054b000
    10011000 : 10011000 - 6325f000
                    Type     00000000 
                    Protect  00000001 PAGE_NOACCESS
                    State    00010000 MEM_FREE
                    Usage    RegionUsageFree

The first address 0011fe30 is the start of a frame on the call stack, and the !address command consequently reports this address being in a stack range, also reporting which thread owns that stack.

The second address 0051213a comes from a value on the stack, and it is not immediately clear where it points to. The !address command tells us that this is a heap address, and it also reports the handle for the owning heap. This handle can then be used as an argument to the !heap command in order to find out more about that heap.

The third address 768bf86a is stored as a return address in the call stack, so we would expect it to point to executable code for some loaded module. The !address command confirms this and it also reports which module (user32.dll in this case) contains that address.

Note: executable code does not necessarily fall in the in-memory image of loaded modules: this is the case, for instance, with .NET code, which is compiled just-in-time at runtime from Intermediate Language (IL) code to machine code.

Last, the value 4b9dc4d5, also found on the stack, does not point to allocated memory, and the !address command indicates this by displaying the usage RegionUsageFree.

A first look at the crash dump

Let's open the dump of the process at crash time and have a look.

The call stack:

0:016> kb50
ChildEBP RetAddr  Args to Child              
049aa218 7c827d0b 77e61d1e 000007ac 00000000 ntdll!KiFastSystemCallRet
049aa21c 77e61d1e 000007ac 00000000 049aa260 ntdll!NtWaitForSingleObject+0xc
049aa28c 77e61c8d 000007ac 0001d4c0 00000000 kernel32!WaitForSingleObjectEx+0xac
049aa2a0 6951163f 000007ac 0001d4c0 049ac350 kernel32!WaitForSingleObject+0x12
049aa308 69506136 049ae350 049ac350 00000088 faultrep!MyCallNamedPipe+0x15b
049ae764 69508b5c 049af858 049af38c 00000001 faultrep!StartManifestReport+0x1d5
049af5b0 77e7650f 049af858 00000001 c0000005 faultrep!ReportFault+0x3d2
049af80c 77bc3e74 049af858 00000000 00000000 kernel32!UnhandledExceptionFilter+0x494
049af82c 77bcb547 c0000005 049af858 77bc6cd5 msvcrt!_XcptFilter+0x178
049af838 77bc6cd5 049af860 00000000 049af860 msvcrt!_endthreadex+0xba
049af860 7c828752 049af944 049affa8 049af960 msvcrt!_except_handler3+0x61
049af884 7c828723 049af944 049affa8 049af960 ntdll!ExecuteHandler2+0x26
049af92c 7c82855e 049a5000 049af960 049af944 ntdll!ExecuteHandler+0x24
049af92c 7c82a754 049a5000 049af960 049af944 ntdll!KiUserExceptionDispatcher+0xe
049afc38 7c82a82b 00030000 00323030 049afd00 ntdll!RtlpCoalesceFreeBlocks+0x36e
049afd20 77bbcef6 00030000 00000000 04b0e060 ntdll!RtlFreeHeap+0x38e
049afd68 61494feb 04b0e060 04ce3320 00000001 msvcrt!free+0xc3
WARNING: Stack unwind information not available. Following frames may be wrong.
049afd80 61494fac 04b02d20 00000001 04ce83f0 oran9+0x14feb
049afd9c 61494f8f 04ce3320 00000001 029bb838 oran9+0x14fac
049afdb8 61494f8f 04ce83f0 00000001 04c49e80 oran9+0x14f8f
049afdd4 61494fac 029bb838 00000001 77bbce33 oran9+0x14f8f
049afdf0 614950d8 04c49e80 00000000 049afe14 oran9+0x14fac
049afe00 61401fda 04c49e80 028e56d0 00000000 oran9+0x150d8
049afe14 614015ef 04c05acc 04bd9a18 04bd9a00 oranl9+0x1fda
049afe30 614bee6d 04bd9a00 04bd99e0 02959b40 oranl9+0x15ef
049afe44 614bed0c 04bd9a18 04bd99e0 00000001 oran9+0x3ee6d
049afeb0 6148f77c ffffffff 00000000 00000000 oran9+0x3ed0c
049afedc 6071631e 03dbb60c 00000000 00000000 oran9+0xf77c
049afefc 606f3e92 03dbb568 03dbb300 03dbb528 oraclient9+0x11631e
049aff10 606aed21 03dbb568 049aff2c 606ad7f9 oraclient9+0xf3e92
049aff1c 606ad7f9 03dbb568 0278fc20 049aff38 oraclient9+0xaed21
049aff2c 027c2095 03dbb528 049aff50 4c9bcf5a oraclient9+0xad7f9
049aff38 4c9bcf5a 03dbb528 7739cf99 00000000 ociw32+0x2095
049aff50 4c9bd296 03dbb300 00000000 02983bc0 msorcl32!ServiceOCIWorkRequest+0x74
049aff84 77bcb530 03dbb300 00000000 00000000 msorcl32!OCIWorkerThreadFunc+0x57
049affb8 77e64829 00038198 00000000 00000000 msvcrt!_endthreadex+0xa3
049affec 00000000 77bcb4bc 00038198 00000000 kernel32!BaseThreadStart+0x34

tells us that an exception occurred while executing ntdll!RtlpCoalesceFreeBlocks (see ntdll!KiUserExceptionDispatcher executing on top of it). There wasn't an exception handler for it so we resorted to the unhandled exceptions filter (kernel32!UnhandledExceptionFilter), whose handling of the exception involved Windows Error Reporting (see faultrep!ReportFault on the stack), which took the dump. Then the process was terminated.

The first step is to restore the context of the exception. kernel32!UnhandledExceptionFilter takes an EXCEPTION_POINTERS argument, whose second argument is the CONTEXT structure with the context information of our exception.

0:016> dd 049af858 L2
049af858  049af944 049af960
0:016> .cxr 049af960
eax=04b0e070 ebx=00030000 ecx=00323030 edx=31203a72 esi=04b0e068 edi=04b0e058
eip=7c82a754 esp=049afc2c ebp=049afc38 iopl=0         nv up ei ng nz na pe cy
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010287
ntdll!RtlpCoalesceFreeBlocks+0x36e:
7c82a754 8b09            mov     ecx,dword ptr [ecx]  ds:0023:00323030=????????
0:016> kb50
ChildEBP RetAddr  Args to Child              
049afc38 7c82a82b 00030000 00323030 049afd00 ntdll!RtlpCoalesceFreeBlocks+0x36e
049afd20 77bbcef6 00030000 00000000 04b0e060 ntdll!RtlFreeHeap+0x38e
049afd68 61494feb 04b0e060 04ce3320 00000001 msvcrt!free+0xc3
WARNING: Stack unwind information not available. Following frames may be wrong.
049afd80 61494fac 04b02d20 00000001 04ce83f0 oran9+0x14feb
049afd9c 61494f8f 04ce3320 00000001 029bb838 oran9+0x14fac
049afdb8 61494f8f 04ce83f0 00000001 04c49e80 oran9+0x14f8f
049afdd4 61494fac 029bb838 00000001 77bbce33 oran9+0x14f8f
049afdf0 614950d8 04c49e80 00000000 049afe14 oran9+0x14fac
049afe00 61401fda 04c49e80 028e56d0 00000000 oran9+0x150d8
049afe14 614015ef 04c05acc 04bd9a18 04bd9a00 oranl9+0x1fda
049afe30 614bee6d 04bd9a00 04bd99e0 02959b40 oranl9+0x15ef
049afe44 614bed0c 04bd9a18 04bd99e0 00000001 oran9+0x3ee6d
049afeb0 6148f77c ffffffff 00000000 00000000 oran9+0x3ed0c
049afedc 6071631e 03dbb60c 00000000 00000000 oran9+0xf77c
049afefc 606f3e92 03dbb568 03dbb300 03dbb528 oraclient9+0x11631e
049aff10 606aed21 03dbb568 049aff2c 606ad7f9 oraclient9+0xf3e92
049aff1c 606ad7f9 03dbb568 0278fc20 049aff38 oraclient9+0xaed21
049aff2c 027c2095 03dbb528 049aff50 4c9bcf5a oraclient9+0xad7f9
049aff38 4c9bcf5a 03dbb528 7739cf99 00000000 ociw32+0x2095
049aff50 4c9bd296 03dbb300 00000000 02983bc0 msorcl32!ServiceOCIWorkRequest+0x74
049aff84 77bcb530 03dbb300 00000000 00000000 msorcl32!OCIWorkerThreadFunc+0x57
049affb8 77e64829 00038198 00000000 00000000 msvcrt!_endthreadex+0xa3
049affec 00000000 77bcb4bc 00038198 00000000 kernel32!BaseThreadStart+0x34

The call stack above is the one which brought to the exception. Just forgot: which exception are we talking about? Here it is:

0:016> .exr 049af944 
ExceptionAddress: 7c82a754 (ntdll!RtlpCoalesceFreeBlocks+0x0000036e)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00323030
Attempt to read from address 00323030

The argument to msvcrt!free, 04b0e060, is the address that the application is freeing. msvcrt!free calls ntdll!RtlFreeHeap to do the real job, because the C Runtime Heap is implemented on top of the operating system heap. In order to understand what's happening up in our call stack, a bit of background information is needed.

A couple of relevant things about heaps

The first important thing we need to be aware of is that the data structures of the operating system heap changed in Vista and Windows Server 2008, so we need to check which operating system this process was running on. The ever-useful !vertarget command comes to the rescue:

0:016> vertarget
Windows Server 2003 Version 3790 (Service Pack 2) MP (8 procs) Free x86 compatible
Product: Server, suite: Enterprise TerminalServer SingleUserTS
Machine Name:
Debug session time: Wed Oct 15 11:44:02.000 2008 (GMT+1)
System Uptime: not available
Process Uptime: 0 days 5:01:05.000
  Kernel time: 0 days 0:06:49.000
  User time: 0 days 0:04:21.000

So this is Windows Server 2003 and we can forget about the new data structures introduced in later operating systems. Nonetheless, if you are interested, you can find more details in the book "Advanced Windows Debugging", authors Mario Hewardt and Daniel Pravat.

Another thing we need to know is that ntdll!RtlpCoalesceFreeBlocks(), showing up at the top of the call stack, is called when a block of heap memory is freed and the heap manager detects that there are adjacent blocks that are also free. In this case, the 2 or 3 adjacent blocks are merged into one, larger free block, so as to reduce heap fragmentation. The access violation occurring in ntdll!RtlpCoalesceFreeBlocks() therefore indicates that, while manipulating the heap data structures to merge bocks, we ran into a bad address. This in turn is an indication that some of those data structures bacame corrupted some time earlier.

So at this point we can conclude that the crash was caused by a corruption of a heap in the process. This, alone, may be enough to set up some standard troubleshooting steps, like for example enabling the page heap for the process.

However, as it is often the case in troubleshooting, the deeper we go in our analysis, the more we'll be able to devise an effective set of "next steps" to take. In some cases, a careful comparison of this in-depth analysis with the source code of the application can even allow to identify and fix the bug directly. So let's take the pain of looking into the details of the heap blocks.

Narrowing down the corruption

The above considerations on the structure of a process memory comes into play here. The address 04b0e060 that we are trying to free is part of a heap. The !address command, indeed, confirms that. If we want more details on what this address means in the heap, we need to switch to the !heap command. In particular, !heap with the -x option allows us to find out the information about the heap block that address belongs to:

0:016> !heap -x 04b0e060
List corrupted: (Blink->Flink = 00000000) != (Block = 04b0e060)
HEAP 00030000 (Seg 04a90000) At 04b0e058 Error: block list entry corrupted

Entry     User      Heap      Segment       Size  PrevSize  Unused    Flags
-----------------------------------------------------------------------------
04b0e058  04b0e060  00030000  04a90000        10        20         a  free last 

The fields above are basically those of the internal HEAP_BLOCK data structure. The block starts at address 04b0e058, 8 bytes are taken by the header of the block (the HEAP_BLOCK structure), so the address of the user memory is 04b0e060. Those fields will be relevant for our analysis:

  • Size: the size, in bytes, of this block. Note that, since the header takes 8 bytes, size 0x10 means that 8 bytes were allocated by the caller of HeapAlloc()
  • PrevSize: the size of the previous block. Note that, since blocks are layed out without holes in the heap, Size and PrevSize allows us to move from block to block. For example, we now know that the next block is at address 04b0e058+10 = 04b0e068 and that the previous block is at address 04b0e058-20 = 04b0e038
  • Flags: there are different flags for a heap block. free means that this block is marked as free. Free blocks are placed in the free list. The free list is a linked list of blocks that are available for reuse. Pointers to previous and next elements in the linked list are kept in the user part of the heap block, in this case starting at address 04b0e060. This is always possible because the user-allocated part of a heap block is at least 8 bytes and, once a block is free, that part is no longer used. This guarantees that we can always store 2 addresses there, which we can call the Blink and Flink of the free list.

We can also dump out the heap block header manually to figure out the offset of those fields in the HEAP_BLOCK structure:

0:016> dd 04b0e058 L2
04b0e058  00040002 030a10f2

So Size (0x2, expressed in 8-byte units) is at offset 0, PrevSize (0x4, again in 8-byte units) is at offset 2, Flags (0x10) is at offset 5 and Unused (0x0a) is at offset 6:

 

Drawing1

With this information in our hands, let's now try and understand the first 2 lines of the !heap -x output above. BLink->Flink means to go back to the previous entry in the free list and then follow its FLink. So let's manually dump out the BLink of our block:

0:016> dd 04b0e060 L1
04b0e060  00000000

So this is 0, which explains the message that came from the debugger. This does not mean, however, that this is a real corruption in the heap: the bookkeeping data structures of a heap may be inconsistent while they are being modified by the heap manager, because they are in a transient state. And here, since ntdll!RtlFreeHeap is executing, calling ntdll!RtlpCoalesceFreeBlocks, we are, indeed, modifying those data structures. In particular, the heap manager has already marked the entry as free (this is done in RtlHeapFree before calling RtlpCoalesceFreeBlocks), but its FLink and BLink have not been set yet (note that, if this block will be merged with a previous block, FLink and BLink won't be set at all).

So let's ignore that debugger message and let's progress in the search of the problem that caused the process crash. ntdll!RtlpCoalesceFreeBlocks looks at nearby blocks in order to check whether a block merge is possible so let's check whether the nearby heap blocks are healthy. Previous block:

0:016> !heap -x 04b0e038
List corrupted: (Blink->Flink = 00000000) != (Block = 04b0e060)
HEAP 00030000 (Seg 04a90000) At 04b0e058 Error: block list entry corrupted

Entry     User      Heap      Segment       Size  PrevSize  Unused    Flags
-----------------------------------------------------------------------------
04b0e038  04b0e040  00030000  04a90000        20        38         8  busy 

Apart for the usual message, the block appears a valid one. Note in particular that its size (20) matches the PrevSize of the following block. Following block:

0:016> !heap -x 04b0e068

List corrupted: (Blink->Flink = 00000000) != (Block = 04b0e060)
HEAP 00030000 (Seg 04a90000) At 04b0e058 Error: block list entry corrupted

Mmmhh..., no output for the next block. Could this be an indication that there is a problem with the next heap block? Since the debugger command was not of much use, let's dump out the block header manually, then interpret it by using the offsets that we figured out previously.

0:016> dd 04b0e068 L2
04b0e068  00020004 6f727245

So Size = 4 * 8 = 0x20, at offset 0, PrevSize = 2 * 8 = 0x10 at offset 2. These appear to be valid values. In particular, PrevSize matches with Size of the previous block.

The second byte of the block header, however, does not look good: Flags = 0x72 at offset 5 is not a valid combination of flags, and Unused = 0x72 at offset 6 is also invalid.

04b0e06c=04b0e068+4, therefore, seems to be the address where we first notice a corruption. In this case, it is useful to try and read the memory starting at that address in different formats, so as to detect possible patterns. In this particular case, we see that the bytes 0x45, 0x72, 0x72 and 0x6f that are at address 04b0e06c seem to fall in the range of valid ANSI characters so the first attempt is to read the memory as an ANSI string:

0:016> da 04b0e06c
04b0e06c  "Error: 1002"

Bingo!! We found a string where it should not have been, overwriting part of a heap block header. This block is next to the one we are freeing, so the crash occurs when ntdll!RtlpCoalesceFreeBlocks() inspects it to check if it can be merged with the previous one. It is also interesting to note that we came to this conclusion without the need to look into the code (disassembly) of ntdll!RtlFreeHeap or ntdll!RtlpCoalesceFreeBlocks.

Going further

The corruption is in a block next to the one we are freeing, so the call stack is not of much help because it refers to the freeing of a different block. This is one manifestation of the problem that I mentioned at the beginning of the article: the cause of the memory corruption occurred earlier and we are now only experiencing its symptoms. Backtracking to the source of the corruption is not easy. Nonetheless, let's have a look at the additional steps that we can take.

First, we can detect the extent of the corruption. Since the corrupted block size is 0x20, we can check whether the next block is valid:

0:016> dd 04b0e088 L2
04b0e088  00040004 030801e8

So yes, this appears to be a valid header (Size 0x20, PrevSize 0x20, matching the previous block's size, Flags = 1 and Unused = 8).

Second, we may analyze the application's code in search of possible issues with the way it handles errors (in particular, those coming from the data access layer, since "Error: 1002" comes from a database operation).

Should code analysis not be effective in identifying the problem, we may try to follow the chain of pointers in the process memory. The idea is that, in order to write to a memory address, you need to point to it in the first place. So chances are that, in the address space, we are still storing the address 04b0e06c somewhere. We can search for it with the s command:

s -d 0 L?80000000 04b0e06c

Yet another option is to search for the string itself (Error: 1002) in memory. This would cover the case where the string was copied from one place to another.

0:016> s -a 0 L?80000000 "Error: 1002"
04b0e06c  45 72 72 6f 72 3a 20 31-30 30 32 00 00 00 00 00  Error: 1002.....

We are not lucky in this case: the only instance we found is the one we already know of.

Note: for these searches to be effective, we would need a full dump. In this particular case the dump I was provided was a heap dump (.hdmp) taken by the error reporting tool. This dump contains heap information only, so the results of the searches are limited to heap memory.

Conclusions

The case study showed how to use information on the heap structure in order to identify the corruption which caused a process to terminate. Some of the cheap takeaways of the analysis above:

  • !heap -x is a very handy way to find the heap block an address belongs to and to check the integrity of the heap block
  • if !heap -x or other debugger command reports a corruption, that does not mean that the heap is corrupted: if any of the threads in the process are executing code in the heap manager, chances are that the heap data structures are in a transient, invalid state, which is not the result of any corruption
  • if !heap -x does not show a heap block because of a corruption, you can dump out its header manually
  • identifying the string (or other object) which corrupted the heap requires extra effort, but it usually pays off: with a bit of further analysis you may be able to find and fix the problem without resorting to more invasive troubleshooting tools
  • navigating the heap data structures to spot a corruption and get it resolved can be a big fun
Posted by Carlo Colombo | 1 Comments
Filed under: ,

Troubleshooting a DCOM issue: Case Study

Hi all,this is my first post after a LONG LONG time since I created the (so far empty) blog. I guess it is not by coincidence that last night it snowed in Milan Smile.

Let's get started with the case study at hand. In the Distributed Services team in Microsoft Developer Support we support technologies for developing distributed systems, including DCOM/COM+, Distributed Transactions, Windows Communication Foundation (WCF), Workflow Foundation (WF), CLR COM Interoperability, and more.

This case study refers to a DCOM problem which I have been handled recently. The purpose is not only to go into some details on how some pieces of DCOM work, but also to show a troubleshooting strategy for these types of issues.

Problem Definition

The problem can be summarized as follows: a client-server DCOM application works perfectly fine but, from time to time, an XML operation on the server fails with error -2147417856. This happens mainly when a long-running DCOM call is in progress from the client to the server, and only on some machines. Other machines running the same application are unaffected.

The operating system is Windows XP.

First Steps

First, HRESULT -2147417856 is hexadecimal 0x80010100 and symbolic RPC_E_SYS_CALL_FAILED, defined in the header file winerror.h:

// MessageId: RPC_E_SYS_CALL_FAILED
//
// MessageText:
//
// System call failed.
//
#define RPC_E_SYS_CALL_FAILED            _HRESULT_TYPEDEF_(0x80010100L)

 

The HRESULT's facility is 0x0001 or FACILITY_RPC, which means this error comes from the RPC layer. This tells us that we should focus on the lower-layers of the DCOM communication infrastructure instead of the application layer (customer code).

Second, we need to better understand what an "XML operation on the server fails" really means.  Upon further investigation it turns out that a method of a server-side DCOM object is receiving an IStream COM object from the client, and it is calling msxml's IXMLDOMDocument::load(), passing the IStream object as an argument. In pseudo-code:

HRESULT MyObj::Method1(IStream* pXml)
{
    IXMLDOMDocument* pDoc;
    // create the document object here
    // ...
    return pDoc->load(pXml);
}

It could therefore be that the problem is not so much in the msxml DOMDocument object, but instead in the IStream object which is being passed as an argument: the load() implementation will call methods on the IStream object in order to read XML from the stream and, most likely, will fail if any of those methods fails.

So let's modify the server's code implementation so that we make a preliminary call to the client's IStream object and check its return value:

HRESULT MyObj::Method1(IStream* pXml)
{
    IXMLDOMDocument* pDoc;
    // create the document object here
    // ...
    // make a method call to the IStream object and check its value
    HRESULT hr;
    if (FAILED(hr = pXml->Seek(0, STREAM_SEEK_SET, NULL))
    {
        // log the failure here
    }
    return pDoc->load(pXml);
}

With this change in place, the next occurrence of the problem tells us that, indeed, the Seek() method on the IStream object fails, returning RPC_E_SYS_CALL_FAILED. This is our current understanding of the issue:

Drawing1

 

Ok, I guess it is time to go back and understand a bit more about the client.

Going a bit deeper

IStream is a generic COM interface which has many implementations shipping out-of-the-box in COM, not mentioning that it could be implemented by any customer COM objects; so which object implements the IStream interface we are handing out as an argument in the COM call to the server?

It turns out the object implementing IStream is created on the client side by a call to CreateStreamOnHGlobal(), so this is an in-memory, native (unmanaged) stream. Also, the client always succeeds in creating the IStream object and in calling methods on it, it is only when the object is passed on to the server that, sometimes, method calls fail.

The big difference between method calls to the IStream object made from the client directly and method calls made from the server is that the first are intra-apartment method calls, whereas the second are cross-apartment method calls. Cross-apartment method calls always involve marshaling, the details of which depend on the type of COM apartments (for completeness sake, intra-apartment calls may also require marshaling; this is called lightweight marshaling because it does not involve a thread switch). The client DCOM apartment is a single-threaded apartment (STA), since it needs to handle the User Interface. DCOM relies on windows messages and windows message queues in order to dispatch DCOM calls to an STA, so the theory goes that the error condition could be due to a problem in this communication mechanism. Remember: RPC_E_SYS_CALL_FAILED is a FACILITY_RPC error so it makes sense to focus on the communication layer.

Ok, enough conjecturing for now, it's time to get our hands dirty with debugging and such.

First debugging

Let's attach the debugger to the client process and run another session, waiting for the problem to occur. One of the purposes of running a process under the debugger is to check for first-chance exceptions: these exceptions are raised in the process and handled, so they would go unnoticed in a normal run. However, since the debugger is notified of every exception occurring in the debuggee, irrespective of whether the process handles it or not, attaching a debugguer is a great way to check whether some handled exceptions occur in the process.

A run under the debugger shows that there are, indeed, some first-chance exceptions occurring in the client process, but they occur at times that are seemingly unrelated to the RPC_E_SYS_CALL_FAILED failure, in that we have not come yet to the point of calling our server and handing over the IStream object. So let's take a dump of the process when one such exception occurs and have a look.

This is the exception:

0:003> .exr -1
ExceptionAddress: 7c812a5b (kernel32!RaiseException+0x00000053)
   ExceptionCode: 80010100
  ExceptionFlags: 00000001
NumberParameters: 0

So the exception is exactly RPC_E_SYS_CALL_FAILED. It is being raised on thread 3, which has this call stack:

0:003> k200
ChildEBP RetAddr  
0f9ef4ac 77e624c2 kernel32!RaiseException+0x53
0f9ef4c4 77eb65f1 rpcrt4!RpcpRaiseException+0x24
0f9ef4d4 77eb4ead rpcrt4!NdrProxySendReceive+0x60
0f9ef8b0 77eb4e42 rpcrt4!NdrClientCall2+0x1fa
0f9ef8d0 77e4a453 rpcrt4!ObjectStublessClient+0x8b
0f9ef8e0 769b6191 rpcrt4!ObjectStubless+0xf
0f9ef968 76a1a7c6 ole32!CObjectContext::InternalContextCallback+0x122
0f9ef9b8 7a0bf7db ole32!CObjectContext::ContextCallback+0x85
0f9efa04 7a0c04b2 mscorwks!CtxEntry::EnterContextOle32BugAware+0x2b
0f9efb24 79ffb8a1 mscorwks!CtxEntry::EnterContext+0x322
0f9efb58 79ffb929 mscorwks!RCWCleanupList::ReleaseRCWListInCorrectCtx+0xc4
0f9efba8 79f8770b mscorwks!RCWCleanupList::CleanupAllWrappers+0xdb
0f9efbec 79ef32ee mscorwks!SyncBlockCache::CleanupSyncBlocks+0xec
0f9efdb0 79fb99b9 mscorwks!Thread::DoExtraWorkForFinalizer+0x40
0f9efdc0 79ef3207 mscorwks!WKS::GCHeap::FinalizerThreadWorker+0xca
0f9efdd4 79ef31a3 mscorwks!Thread::DoADCallBack+0x32a
0f9efe68 79ef30c3 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
0f9efea4 79fb9643 mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
0f9efecc 79fb960d mscorwks!ManagedThreadBase_NoADTransition+0x32
0f9efedc 79fba09b mscorwks!ManagedThreadBase::FinalizerBase+0xd
0f9eff14 79f95a2e mscorwks!WKS::GCHeap::FinalizerThreadStart+0xbb
0f9effb4 7c80b683 mscorwks!Thread::intermediateThreadProc+0x49
0f9effec 00000000 kernel32!BaseThreadStart+0x37

 

Note from the line in blue that this is the GC Finalizer thread which is doing its cleanup. The line in red tells us that the finalizer is doing cleanup of the Runtime Callable Wrappers (RCWs), the managed wrappers for a COM object reference. Last, the line in green means that, as part of the cleanup of the RCW, the thread is making an RPC call to another thread. Since cleanup of an RCW basically means calling IUnknown::Release() on the underlying COM object, this means that the RCW that we are cleaning up has a reference to a COM object which lives in an apartment other than the one which hosts the GC finalizer.

Let's look at the threads in the process from a CLR perspective:

0:003> .loadby sos mscorwks
0:003> !threads
ThreadCount: 3
UnstartedThread: 0
BackgroundThread: 3
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
                                      PreEmptive   GC Alloc           Lock
       ID OSID ThreadOBJ    State     GC       Context       Domain   Count APT Exception
   0    1  b08 000f52a8      4220 Enabled  00000000:00000000 000ee0c0     0 STA
   3    2  520 001057d0      b220 Enabled  00000000:00000000 000ee0c0     0 MTA (Finalizer)
   5    3  e50 000779a0   880b220 Enabled  00000000:00000000 000ee0c0     0 MTA (Threadpool Completion Port)

 

The output confirms that thread 3 is, indeed, running the GC Finalizer, and this thread is part of the process' Multi-Threaded Apartment (MTA). It also confirms that thread 0 in the client process is an STA, handling the user interface for the process. Indirectly, it also says that, if the COM object the finalizer is releasing lives in this same process, the apartment we are calling into has to be the STA hosted by thread 0, since there are no other apartments in this process (truth to be said, we would need to double-check this by looking at each and every thread in the process, not only to managed threads shown by the !threads extension). We can double-check this by looking at the RCW data structure. This step requires an in-depth look at the call stack. Let's unassemble the function in red in the call stack above up to the point where it calls mscorwks!RCWCleanupList::ReleaseRCWListInCorrectCtx():

0:003> u 79ffb920
mscorwks!RCWCleanupList::CleanupAllWrappers+0xd2:
79ffb920 8d45d8          lea     eax,[ebp-28h]
79ffb923 50              push    eax
79ffb924 e8b9bef8ff      call    mscorwks!RCWCleanupList::ReleaseRCWListInCorrectCtx (79f877e2)
79ffb929 8b4de0          mov     ecx,dword ptr [ebp-20h]
79ffb92c 3bcb            cmp     ecx,ebx
79ffb92e 740c            je      mscorwks!RCWCleanupList::CleanupAllWrappers+0xee (79ffb93c)
79ffb930 e8e579e7ff      call    mscorwks!CrstBase::Enter (79e7331a)
79ffb935 c745e401000000  mov     dword ptr [ebp-1Ch],1

 

So @ebp-28 holds the reference to the data structure  that we are passing to mscorwks!RCWCleanupList::ReleaseRCWListInCorrectCtx(). Being the ebp for this frame 0f9efba8, this address is

0:003> ?0f9efba8-28
Evaluate expression: 262077312 = 0f9efb80

In turn, this points to a data structure which has, at offset 8, the pointer to the RCW:

0:003> ? poi(poi(0f9efb80)+8)
Evaluate expression: 310740888 = 12858798

0x12858798, therefore, points to our RCW and the thread for the hosting STA is at offset 0x24 of the RCW data structure:

0:003> dd 12858798+74 L1
1285880c  000f52a8

Note that this is the managed object for thread 0, as reported by the !threads command above. This confirms that the RCW points to a COM object which lives in the STA (Thread 0) of this process.

Note: the offsets which we used here are specific of version 2.0 of the CLR, they are potentially subject to change.

So what does all this mean? Well, at this point we have evidence that cross-apartment calls to the STA of this process, at some point, fail. And this happens from at least 2 different "client" apartments: the Multi-Threaded Apartment (MTA) in the same process, or an apartment in a different process.

So we can come to a more refined formulation of the problem: at some point the main STA in the client process reaches a state in which it cannot receive DCOM method calls from other apartments. Bear with me and let's try to understand why.

Looking into the kernel

We know that COM method calls are delivered to STAs through window messages. More precisely, PostThreadMessage() is used to deliver the messages, so that the message is delivered asynchronously and the caller is not blocked.

So, for the method call to fail we would need PostThreadMessage() to fail. Let's look at what the documentation has to say on PostThreadMessage():

"There is a limit of 10,000 posted messages per message queue".

If we reach that limit, PostThreadMessage() fails. Since the message queues for each thread are in the kernel address space, we need to take and analyze a kernel dump in order to verify our theory. So let's configure the system so as to take a kernel dump when our application runs into the problem, and let's analyze the dump:

0: kd> !process 0 0

PROCESS 88be5020  SessionId: 0  Cid: 0d30    Peb: 7ffdf000  ParentCid: 0f30
    DirBase: 375f9000  ObjectTable: e4259840  HandleCount: 1446.
    Image: Test.exe 
0: kd> .process /r /p 88be5020
Implicit process is now 88be5020
0: kd> !process 88be5020
PROCESS 88be5020  SessionId: 0  Cid: 0d30    Peb: 7ffdf000  ParentCid: 0f30
    DirBase: 375f9000  ObjectTable: e4259840  HandleCount: 1446.
    Image: Tune.exe
    VadRoot 88b37608 Vads 1565 Clone 0 Private 161460. Modified 1581. Locked 0.
    DeviceMap e12178e8
    Token                             e4f81030
    ElapsedTime                       01:15:53.736
    UserTime                          00:01:20.953
    KernelTime                        00:02:07.781
    QuotaPoolUsage[PagedPool]         399032
    QuotaPoolUsage[NonPagedPool]      79756
    Working Set Sizes (now,min,max)  (180369, 50, 345) (721476KB, 200KB, 1380KB)
    PeakWorkingSetSize                180369
    VirtualSize                       1071 Mb
    PeakVirtualSize                   1071 Mb
    PageFaultCount                    478046
    MemoryPriority                    BACKGROUND
    BasePriority                      8
    CommitCharge                      181025
    DebugPort                         88ab7e00

        THREAD 891d6838  Cid 0d30.0d1c  Teb: 7ffde000 Win32Thread: e4ed52e0 RUNNING on processor 1 <= STA
...

Again we need a bit of internal information here. The posted-messages queue is at offset 0xd0 in the Win32Thread structure, and the number of messages in the queue is at offset 8 of the posted-messages queue structure:

0: kd> dd e4ed52e0+d0+8 L1
e4ed53b8  00002710
0: kd> ?00002710
Evaluate expression: 10000 = 00002710

Note: the offsets which we used here are specific of Windows XP and they are potentially subject to change.

So here we go: the message queue for posted messages of our STA is, indeed, full. This explains why window messages, including COM method calls, cannot be delivered to this thread. If you are curious, by now you'll already have come up with the next question: "Ok, the message queue is full but why did we end up in this situation?".

Fair enough, I agree this is a legitimate question, so let's try and find an explanation.

Inspecting the Windows Message Queue

Windows Message Queues are per-thread and they work like all the other queues: someone places items (window messages) in the queue, and someone else takes them away from the queue. If the rate of arrivals of items in the queue is higher than the rate of retrieval of items from the queue, it's only a matter of time before the queue fills up.

For Windows Messages, PeekMessage() without the PM_NOREMOVE flag and GetMessage() are the functions which remove messages from the message queue. So if the queue reaches its limits it means those functions are not being called fast enough, at least compared to the rate of message arrival.

Let's first of all find out which messages are left in the queue: the messages are in a list starting 8 bytes before the number of messages

0: kd> dd e4ed52e0+d0 L1
e4ed53b0  e4ddd110

Let's dump some of them:

0: kd> dl e4ed52e0+d0 30
e4ed53b0  e4ddd110 e1487008 00002710 00000000
e4ddd110  e43f2d70 00000000 00020300 00000113
e43f2d70  e4f93670 e4ddd110 00020300 00000113
e4f93670  e12a9878 e43f2d70 00020300 00000113
e12a9878  e4eefb90 e4f93670 00020300 00000113
e4eefb90  e4c18238 e12a9878 00020300 00000113
e4c18238  e5019370 e4eefb90 00010364 0000c105
e5019370  e12f1c30 e4c18238 00000000 0000c0fe
e12f1c30  e513f860 e5019370 00010364 0000c109
e513f860  e3fa7fd0 e12f1c30 00020300 00000113
e3fa7fd0  e1640050 e513f860 00010364 00000113
e1640050  e3bd8d68 e3fa7fd0 00020300 00000113
e3bd8d68  e5291230 e1640050 00020300 00000113
e5291230  e5417c90 e3bd8d68 00020300 00000113
e5417c90  e43cc900 e5291230 0003052e 00000113
e43cc900  e11eba90 e5417c90 00020300 00000113
e11eba90  e41b2730 e43cc900 00020300 00000113
e41b2730  e513f168 e11eba90 00020300 00000113
e513f168  e41b6c70 e41b2730 00020300 00000113
e41b6c70  e4d2ab08 e513f168 00020300 00000113
e4d2ab08  e52aeb88 e41b6c70 00020300 00000113
e52aeb88  e42d2468 e4d2ab08 00020300 00000113
e42d2468  e42394a8 e52aeb88 00010364 00000113
e42394a8  e5098700 e42d2468 00020300 00000113
e5098700  e3bb7e48 e42394a8 00020300 00000113
e3bb7e48  e5832350 e5098700 00020300 00000113
e5832350  e4eadbe8 e3bb7e48 00020300 00000113
e4eadbe8  e4dad2f0 e5832350 00020300 00000113
e4dad2f0  e4fd69a0 e4eadbe8 00020300 00000113
e4fd69a0  e4eee6c0 e4dad2f0 00020300 00000113
e4eee6c0  e3b9b300 e4fd69a0 00020300 00000113
e3b9b300  e542ac10 e4eee6c0 00020300 00000113
e542ac10  e43ef8f0 e3b9b300 00020300 00000113
e43ef8f0  e42fffd0 e542ac10 00010364 00000113
e42fffd0  e1211d40 e43ef8f0 00020300 00000113
e1211d40  e436c350 e42fffd0 00020300 00000113
e436c350  e4e08378 e1211d40 00020300 00000113
e4e08378  e15b1730 e436c350 00020300 00000113
e15b1730  e1485ba0 e4e08378 00020300 00000113
e1485ba0  e583c650 e15b1730 00020300 00000113
e583c650  e41ac758 e1485ba0 00020300 00000113
e41ac758  e42417d8 e583c650 00020300 00000113
e42417d8  e12b9410 e41ac758 00020300 00000113
e12b9410  e5859630 e42417d8 00020300 00000113
e5859630  e1249278 e12b9410 00010364 00000113
e1249278  e513e3e8 e5859630 00020300 00000113
e513e3e8  e53081f0 e1249278 00020300 00000113
e53081f0  e3bfe7a0 e513e3e8 00020300 00000113
...

So we can see that, statistically speaking, the vast majority of them are messages 0x113 and a quick lookup in the WinUser.h header file (it ships with Visual Studio and with the Platform SDK) shows that this is the WM_TIMER message:

#define WM_TIMER                0x0113

Afterthought

An afterthought that occurred to me is that there is an easier way to inspect the message queue: the STA can call PeekMessage() to retrieve messages from the queue and analyze them.

This would not be easy to do in this particular case because this was customer's code but in other circumstances this would be a much easier solution than taking a kernel dump and analyzing it. Another complication in this case is that, since the STA is making long-running calls to the server, it does not have the possibility to call PeekMessage() at this time. This is not completely true, because by implementing a custom message filter (more details in the next section) the client would still have a chance to inspect messages reaching the message queue. Moreover, it would still be possible for the client to wait for the long-running COM calls to complete before calling PeekMessage().

The client could use this code to inspect which messages are in their message queue:

while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
{
     // log the message somewhere
}

Note the PM_REMOVE flag: this is needed because, if we do not take the message out of the queue, we'll always be retrieving the first message at every iteration of the loop.

Back to our issue now: we still need to understand why those messages are not taken out of the message queue, don't we? So let's take our next step.

The COM Modal Loop

Let's go back to the user-mode dump and look at the call stack of the STA (Thread 0):

0:000> kb200
ChildEBP RetAddr  Args to Child              
00cdec60 7c94e9ab 7c8094e2 00000002 00cdec8c ntdll!KiFastSystemCallRet
00cdec64 7c8094e2 00000002 00cdec8c 00000001 ntdll!ZwWaitForMultipleObjects+0xc
00cded00 77cf95f9 00000002 00cded28 00000000 kernel32!WaitForMultipleObjectsEx+0x12c
00cded5c 769a2235 00000001 00cdeda4 000003e8 user32!RealMsgWaitForMultipleObjectsEx+0x13e
00cded84 769a235c 00cdeda4 000003e8 00cdedb4 ole32!CCliModalLoop::BlockFn+0x80
00cdedac 76a917a2 ffffffff 17162900 00cdeeac ole32!ModalLoop+0x5b
00cdedc8 76a911b6 00000000 00000000 00000000 ole32!ThreadSendReceive+0xa0
00cdede4 76a9109a 00cdeeac 17162900 00cdef08 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x13d
00cdeec4 769a2409 17162900 00cdefd8 00cdefc8 ole32!CRpcChannelBuffer::SendReceive2+0xb9
00cdeee0 769a23b2 00cdefd8 00cdefc8 17162900 ole32!CCliModalLoop::SendReceive+0x1e
00cdef4c 769b65b5 17162900 00cdefd8 00cdefc8 ole32!CAptRpcChnl::SendReceive+0x6f
00cdefa0 77eb4db5 00000001 00cdefd8 00cdefc8 ole32!CCtxComChnl::SendReceive+0x91
00cdefbc 77eb4ead 1281fc7c 00cdf004 0600016e rpcrt4!NdrProxySendReceive+0x43
00cdf398 77eb4e42 1b363c00 1b363334 00cdf3d0 rpcrt4!NdrClientCall2+0x1fa
00cdf3b8 77e4a453 0000000c 00000019 00cdf49c rpcrt4!ObjectStublessClient+0x8b
00cdf3c8 79f82703 1281fc7c 00cdf3f0 7a132f50 rpcrt4!ObjectStubless+0xf
00cdf49c 0e90b88a 000f52a8 00cdf4f0 e0588eb1 mscorwks!CLRToCOMWorker+0x196
00cdf4d8 15dbedc0 101d0efc 101e2450 00000000 CLRStub[StubLinkStub]@e90b88a
... customer code here

The ole32!CCliModalLoop::BlockFn() function on the call stack means that we are in an STA and we are making an outgoing COM method call. This is expected, since we know by now that the client process is making long-running calls to the server: here we are just in the middle of one of them.

During outgoing method calls from an STA, COM cannot block: it has to enter a message loop which checks and possibly dispatches messages, similarly to what the usual message loop does in a user-interface thread. This is necessary, for example, when the called COM method makes a callback to the caller (this is, by the way, what happens when our server, upon receiving the IStream object as an argument to a method call, calls methods on it): the callback to the STA is also delivered via a windows message so there had better be someone retrieving messages from the message queue, otherwise a deadlock would occur.

On the other hand, the modal loop (as we call the message loop used by COM during outgoing method calls) cannot dispatch all the messsages, because this would cause the user interface for the calling thread to be fully responsive. In particular, this would allow a user to repeatedly click on a button, thus starting an operation multiple times before the first one completes, and other such weird and counter-intuitive things that need to be avoided if we want to maintain consistency and predictability of the application.

So what does exactly the COM modal loop do? Without going into much detail, the loop checks whether the message should be handled by first calling PeekMessage() with the PM_NOREMOVE flag; if it is determined that the message should be processed, COM calls PeekMessage() with the PM_REMOVE flag to actually take the message out of the queue and dispatch it for processing. Otherwise, the message stays in the message queue and it will be processed sometime later when, upon returning from the COM method call, the standard message loop will be used to retrieve and dispatch messages. Since context information may be needed in order to do proper message dipatching when the outgoing is in progress, the COM modal loop allows for a customization point, in the form of a COM object implementing the standard IMessageFilter interface: when a message is in the queue during an outgoing method call, the COM modal loop calls IMessageFilter::MessagePending() and allows the customization point to kick in.

Note: as an interesting aside, some libraries that deal with the user interface often provide their own implementation of IMessageFilter. For example, MFC provides its own which displays the famous "Server Busy" message box. See COleMessageFilter for more details. So, before deciding to implement your own message filter, you should also consider whether another message filter is already installed and evaluate pros and cons of replacing it.

So how does the COM modal loop deal with WM_TIMER messages? WM_TIMER messages are not extracted and dispatched by the COM modal loop, so they stay in the queue and this explains why we found so many there.

Next question, I bet, is: "Why does this happen here and not in other cases?".

Well, by analyzing the rate at which the WM_TIMER messages are sent (time member of the public MSG structure) to the queue, it can be seen that messages reach the queue every 20 milliseconds. A check in the operating system code showed that these messages are part of the IME architecture. This, by the way, also explains why, everything else being equal, this problem was experienced only on some machines: they were the machines with Japanese user interface, where IME is used.

Working to a solution

Let's recap what we have learnt so far: the client makes long-running DCOM method calls to a server from a user-interface thread, WM_TIMER messages posted to the message queue by the IME accumulate in the queue during this time and eventually fill up the thread's message queue, since the COM modal loop does not process them. When we reach this situation, method calls to the STA (either callbacks from the client or RCW cleanup from another apartment in the same process) cannot be dispatched to the STA and the caller receives the RPC_E_SYS_CALL_FAILED failure HRESULT. So the problem is caused by a combination of several factors:

A. the client apartment being an STA
B. the client making long-running calls to the server, keeping the UI non-responsive in the meantime
C. the usage of windows messages as a communication mechanism for COM STAs
D. the fact that the message queue for a thread is limited to 10000 messages
E. IME using frequent WM_TIMER messages

Let's assess which options we have.

A. is unavoidable, since this thread needs to handle the user-interface of the application. Fancy solutions involving the use of COM in the client only in the MTA should not even be considered here, since they would require a massive redesign (and testing), with no guarantee that we won't end up in a similar issue anyway (after all, we may need a communication mechanism between the COM object and the UI thread anyway).

C. cannot be changed, since it is part of the COM infrastructure.

D. can be addressed by increasing the size of message queues from the default 10000 to a higher value by means of the registry value USERPostMessageLimit, as reported in the PostThreadMessage() documentation. Though alluring at first sight, this change is not advised: it is a systemwide change, for starting, so it affects the machine as a whole. Also, which value should be used? If the messages are not processed, even a higher value may not be enough, since it all depends on how long the outgoing COM call will take to complete. Last: its change requires writing rights to the registry key, which means undue requirements for the setup of our customer's application.

E. could be addressed by gaining a better understanding of when and how the IME sends those WM_TIMER messages and to design the user interface for that thread in a way that those messages never reach the message queue for that thread. Remember: WM_TIMER messages reach the message queue of this particular thread because there is at least one window owned by this thread which is involved in IME, so chances are that we can either replace this window with another (another window class), or configure its properties and styles so that it will no longer receive the WM_TIMER messages. This solution is also unadvisable because not very robust: which messages reach the message queue is something outside the control of the thread itself, other components may send this or other types of messages; or, an implementation detail change in the future may change the number and type of messages which reach our message queue.

B. can be addressed in two ways:

B.1: register a customized IMessageFilter implementation so as to dispatch the WM_TIMER message, or to "eat"  it (throw it away). This approach has disadvantages as well: in our case, the client application was written with MFC which provides its own message filter implementation. But, in general, this is the most viable and flexible option.

B.2 make the outgoing COM method calls shorter, so as to prevent the message queue to fill up in this period of time. In the case at hand, it turned out that the client was actually not making one single method call to the server, but instead malking a series of calls in a tight loop, with this pseudo-code:

while(condition)
{
      hr = p->Call();
}

So the change was relatively easy: insert a message loop in-between outgoing calls, so as to empty the queue, along these lines:

while(condition)
{
      hr = p->Call();
      while (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
      {
           TranslateMessage(&msg);
           DispatchMessage(&msg);
      }
}

 

Note that, since we should not block the thread if no message is available, the message loop uses PeekMessage(..., PM_REMOVE) instead of GetMessage().

This change alone effectively resolved the problem.

Note: how to implement the message loop above really depends on what you want to achieve. PeekMessage(&msg, NULL, 0, 0, PM_REMOVE) retrieves all the messages, which means the user interface for the program will be fully responsive "every now and then", that is, between method calls in the loop above. Another option is to retrieve and dispatch WM_TIMER messages only, with a call like this: PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_REMOVE). The downside here is that the processing of WM_TIMER messages may generate yet other messages which, with the call above, would not be taken from the queue, so the end result may be an even quicker fill-up of the message queue. Or, we may determine that we can safely get rid of the WM_TIMER messages sent by IME during the outgoing method call, in which case we may implement a message loop where PeekMessage(&msg, NULL, WM_TIMER, WM_TIMER, PM_REMOVE) is not followed by a call to DispatchMessage() - this effectively "throws away" the WM_TIMER messages. If this seems too extreme (hey, what happens to other useful WM_TIMER messages?), we may further refine the call to PeekMessage() so as to filter with the handle of the window which is the destination of the IME WM_TIMER messages: PeekMessage(&msg, hWnd, WM_TIMER, WM_TIMER, PM_REMOVE). Also note that those same considerations apply if a customized message filter is implemented (approach B.1 above).

 

Last but not least, I would like to mention that I worked together with my colleague Andrea to the resolution of this case.

Posted by Carlo Colombo | 1 Comments
Filed under: , ,
 
Page view tracker