We're now on Twitter @ http://twitter.com/ntdebugging. If you want to chime in with feedback or suggestions for future articles feel free to drop us a "tweet". I thought this was would give us a better venue to interact with you guys.
Cheers,
Ron Stock
My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft. Today’s blog will consist of a complete walkthrough of my recent analysis of a stop 0x50 along with the steps that led me to identify that this crash was caused by pool corruption. In this particular case, I found the answer before completing the actual debug.
To begin the analysis, I entered the !analyze –v command into the debugger and examined the output relating to the bugcheck. This is always my first step since it usually provides an overview of what type of problem occurred. In the output listed below, you can see that the system has bugchecked due to an attempt by kernel mode code to access the invalid address 0xffffff50.
PAGE_FAULT_IN_NONPAGED_AREA (50) Invalid system memory was referenced. This cannot be protected by try-except, it must be protected by a Probe. Typically the address is just plain bad or it is pointing at freed memory. Arguments: Arg1: ffffff50, memory referenced. Arg2: 00000000, value 0 = read operation, 1 = write operation. Arg3: 80846dd1, If non-zero, the instruction address which referenced the bad memory address. Arg4: 00000000, (reserved)
PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except,
it must be protected by a Probe. Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: ffffff50, memory referenced.
Arg2: 00000000, value 0 = read operation, 1 = write operation.
Arg3: 80846dd1, If non-zero, the instruction address which referenced the bad memory
address.
Arg4: 00000000, (reserved)
Before proceeding with our debugging, I would like to point out that invalid addresses such as the one listed above are often obtained when code subtracts values from a null pointer. To illustrate that this particular address 0xffffff50 could be obtained in such a manner, I have provided output from the .formats debugger command below. Notice that the hex vaule 0xffffff50 can be used to represent negative 176 as a decimal. It is a relatively common situation to arrive at a negative value by subtracting values from a null pointer (which is zero). Since addresses are actually unsigned, this is really just an invalid positive address.
0: kd> .formats ffffff50 Evaluate expression: Hex: ffffff50 Decimal: -176
0: kd> .formats ffffff50
Evaluate expression:
Hex: ffffff50
Decimal: -176
Let’s proceed with our investigation. The output of !analyze –v also provided us with the following:
TRAP_FRAME: f4009940 -- (.trap 0xfffffffff4009940)
This provided a trap frame address in the form of a DML (Debugger Markup Language) link that can be clicked on to quickly enter the command into the debugger.
Passing .trap the address of a trap frame will show the register state when the trap was generated. This will also set the context to the state when the trap was generated. This information was saved into the trap frame by the trap handler that executed after the trap occurred.
The trap frame address listed above could have instead been obtained by dumping out the stack using the command kv . !analyze –v is very good at displaying the correct trapframe, however I feel that it is a good practice to dump out the stack to verify that the trap frame you are entering is in fact the one that you are interested in. You can see in the output below that there are two trapframes listed. I am interested in the topmost one that matches the trap that led us to the bugcheck. I can see that analyze –v has listed the trapframe that we are interested in, so I entered it into the debugger.
0: kd> kv ChildEBP RetAddr Args to Child f40098d8 808692ab 00000050 ffffff50 00000000 nt!KeBugCheckEx+0x1b (FPO: [5,0,0]) f4009928 80836c2a 00000000 ffffff50 00000000 nt!MmAccessFault+0x813 (FPO: [4,10,4]) f4009928 80846dd1 00000000 ffffff50 00000000 nt!KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ f4009940) f40099b0 809426b2 a9131e08 00000012 861ce318 nt!ObReferenceObjectSafe+0x3 (FPO: [0,0,0]) f40099c8 808544d9 861ce318 0007d960 00000000 nt!PsGetNextProcess+0x6c (FPO: [1,1,0]) f4009a58 8094292f 01210048 0007d960 00000000 nt!ExpGetProcessInformation+0x36d (FPO: [SEH]) f4009d4c 80833bef 00000005 01210048 0007d960 nt!NtQuerySystemInformation+0x11e0 (FPO: [SEH]) f4009d4c 7c8285ec 00000005 01210048 0007d960 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f4009d64) 0: kd> .trap 0xfffffffff4009940
0: kd> kv
ChildEBP RetAddr Args to Child
f40098d8 808692ab 00000050 ffffff50 00000000 nt!KeBugCheckEx+0x1b (FPO: [5,0,0])
f4009928 80836c2a 00000000 ffffff50 00000000 nt!MmAccessFault+0x813 (FPO: [4,10,4])
f4009928 80846dd1 00000000 ffffff50 00000000 nt!KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ f4009940)
f40099b0 809426b2 a9131e08 00000012 861ce318 nt!ObReferenceObjectSafe+0x3 (FPO: [0,0,0])
f40099c8 808544d9 861ce318 0007d960 00000000 nt!PsGetNextProcess+0x6c (FPO: [1,1,0])
f4009a58 8094292f 01210048 0007d960 00000000 nt!ExpGetProcessInformation+0x36d (FPO: [SEH])
f4009d4c 80833bef 00000005 01210048 0007d960 nt!NtQuerySystemInformation+0x11e0 (FPO: [SEH])
f4009d4c 7c8285ec 00000005 01210048 0007d960 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f4009d64)
0: kd> .trap 0xfffffffff4009940
The output from the .trap command will be similar to the ouput from the r command. The r command will dump out the registers and the current instruction that caused the error. Listed below are the output of the two commands:
0: kd> .trap 0xfffffffff4009940 ErrCode = 00000000 eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000 eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0 nv up ei pl nz ac pe cy cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010217 nt!ObReferenceObjectSafe+0x3: 80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=???????? 0: kd> r Last set context: eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000 eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0 nv up ei pl nz ac pe cy cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010217 nt!ObReferenceObjectSafe+0x3: 80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=???????? ß This is the current instruction. We trapped on this instruction trying to dereference edx which contained ffffff50.
ErrCode = 00000000
eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000
eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0 nv up ei pl nz ac pe cy
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010217
nt!ObReferenceObjectSafe+0x3:
80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=????????
0: kd> r
Last set context:
80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=???????? ß This is the current instruction. We trapped on this instruction trying to dereference edx which contained ffffff50.
Next, we shall dump out the stack and try to get a feel for what happened based on the function names listed in the stack. Remember that the functions at the bottom of the stack called the functions above them. Here are the function names listed in the top four stack frames along with my annotation. In this case I read the stack starting at the bottom and working my way up instead of the traditional way of starting at the bad value and backtracing to the problem.
0: kd> kcL4
· nt!ObReferenceObjectSafe ß Finally, this function encountered the bad value and crashed. To observe this for yourself, please examine the output of the 'r' command listed above. You will notice that the current instruction at the time of the trap (the address in EIP) was unassembled. Above this instruction the function name and offset are listed as nt!ObReferenceObjectSafe+0x3. This means that the instruction we were executing when the trap occurred was located at offset 0x3 from the location that the symbol ObReferenceObjectSafe references.
· nt!PsGetNextProcess ß Based on the function name here, it appears that we were trying to traverse the linked list of active processes on the system.
· nt!ExpGetProcessInformation ß NtQuerySystemInformation() called this function. Based on the function name GetProcessInformation and the call listed above, this was obviously an attempt to get information about the processes running on the system
· nt!NtQuerySystemInformation ß We were trying to query system information
Normally, my next step would typically be to unassemble the current function with uf @$ip , locate the current instruction in the output by pasting the address in eip into the find diaglog box, and proceed to backtrace the source of the bad value until I had located the source. However, I have a bad habit of poking around for fun by dumping out various data while I am debugging. The function names on the stack seemed to indicate that we were working with the list of processes. The address we died on suggested that we may have encountered a null pointer. I couldn't resist pausing a moment to dump out the list of processes on the system to see if we had encountered a null pointer while traversing the list of processes on the system. After all, I've reviewed dumps before that were that simple. Even if I found a null pointer in the list, I would still have to perform the actual debug and walk the assembly to prove it and identify how we crashed. However, I felt it would be more exciting to work in the reverse of my normal methods. Besides, it only takes a moment once you know what you are doing.
Let me pause a moment to provide a little background information here which will later prove relevant in understanding my method of thinking and comprehending what went wrong. In the nt!_EPROCESS object there is a _LIST_ENTRY structure located at offset 0x98 called ActiveProcessLinks. This forms a linked list of all the active NT!_EPROCESS structures on the system. Let us first examine the _ LIST_ENTRY structure by reviewing the output below:
1: kd> dt nt!_LIST_ENTRY
+0x000 Flink : Ptr32 _LIST_ENTRY
+0x004 Blink : Ptr32 _LIST_ENTRY
As you can see above, a _LIST_ENTRY contains two pointers that create what is called a linked list. Blink represents a backwards link and Flink represents a forward link. A double linked list may be traversed in either direction by following the pointer to the next or previous entry in the list. Linked lists are used heavily in operating system programming. This particular linked list contains a pointer to the previous _EPROCESS object’s ActiveProcessLinks field (blink) and the next _EPROCESS object’s ActiveProcessLinks field (flink).. The following command will display the activeprocesslinks field of the nt!_EPROCESS structure.
0: kd> dt nt!_EPROCESS activeprocesslinks
+0x098 ActiveProcessLinks : _LIST_ENTRY ß the +0x098 lets us know that this field is located 0x98 from the start of each nt!_EPROCESS structure. All of these EPROCESS structures on the system are linked together using the flink/blink method. This allows us to enumerate all of the processes on the system by following the links in either direction.
The first entry of a linked list is commonly referred to as the list head. The list head is a _LIST_ENTRY structure that points to the first and last entry of the link list. Global variables are often used to store the location of the list head. Once you have the location of the head, you can then use it's flink/blink values to traverse the list in either direction. Nt!PsActiveProcessHead is a global variable that points to the beginning of the list of Active Processes that we discussed in the previous paragraph. If you traverse this list starting at the head, you will end up in the activeprocesslinks field of each process on the system. This will allow code to find each _EPROCESS structure on the system by simply traversing this list. If you are an true geek like me and are learning this for the first time, you are probably going to fire up your debugger and dump out the list of active processes on your own system before you reach the end of this article.
Please note that by following the flink/blink you will arrive at offset 0x098 from this structure, not the start of the structure. So the complete list consists of a _LIST_ENTRY field present in each EPROCESS structure + the _LIST_ENTRY located at the memory referenced by the global variable nt!PsActiveProcessHead. You can use the 'x' examine symbols command to display this variable. I included the wildcards below only to demonstrate the ability to use wildcards with these commands.
0: kd> x nt!*PsActiveProcessHead* 808b5be8 nt!PsActiveProcessHead = <no type information> 0: kd> dt nt!*PsActiveProcessHead* 808b5be8 ntkrnlmp!PsActiveProcessHead
0: kd> x nt!*PsActiveProcessHead*
808b5be8 nt!PsActiveProcessHead = <no type information>
0: kd> dt nt!*PsActiveProcessHead*
808b5be8 ntkrnlmp!PsActiveProcessHead
Back to our investigation. At the point where we left off, we had not yet dug into the code to identify exactly what happened. We instead took an educated guess and decided to dump out the list of processes on the system. If it is found that we have a null value, this may very well allow us to work in reverse of our normal debugging. Armed with the above background information, we shall now dump out the list. If we get lucky, this will expedite the review of this memory dump by telling us what may have gone wrong before we even start digging into the code.
The beginning of the list sounds like a good place to start traversing the entries. The value highlighted in the global variable above marks the address where the linked list starts. The following command will dump out this location and provide you with the flink and blink. The switch /c1 was used in order to limit the output to 1 column for easy blog annotation. For more information on the /c1 switch, type .hh dd into the debugger command interpreter and then press enter once the help file window appears.
0: kd> dd /c1 nt!PsActiveProcessHead L2 808b5be8 8a78c800 ß this is the forward link (flink) 808b5bec 85ddd510 ß this is the backwards link (blink)
0: kd> dd /c1 nt!PsActiveProcessHead L2
808b5be8 8a78c800 ß this is the forward link (flink)
808b5bec 85ddd510 ß this is the backwards link (blink)
Now that we have the address that the list starts at, we need a way to dump it out. Windbg provides many ways to automatically walk a linked list. You can use dt, !list, or the dl commands . In this case, let's use the command 'dl' due to its simplicity . More information about this command may be obtained from the help file that is included with windbg.
0: kd> dl 0x8a78c800 ff 2 ß Here is an example of how to use this command to traverse the linked list starting at address 0x8a78c800. The value ff represents the maximum number of links to display. The value 0x8a78c800 was obtained from the flink field _LIST_ENTRY present at the global variable listed above. The value 2 represents the number of values to dump from each address. Since this linked list contains a forward link (flink) and backwards link (blink), we should pass in a value of 2 to dump both addresses.
As displayed below, I decided to use the global nt!PsActiveProcessHead instead of specifying an address to the command dl. Here is the beginning of the output as displayed by the debugger.
0: kd> dl nt!PsActiveProcessHead ff 2 808b5be8 8a78c800 85ddd510 8a78c800 89af0370 808b5be8 888745a0 888492e0 89af0370 888492e0 89b17708 888745a0 89b17708 89b4e3d0 888492e0 ... … Ommitting the rest of the output
0: kd> dl nt!PsActiveProcessHead ff 2
808b5be8 8a78c800 85ddd510
8a78c800 89af0370 808b5be8
888745a0 888492e0 89af0370
888492e0 89b17708 888745a0
89b17708 89b4e3d0 888492e0
... … Ommitting the rest of the output
The plain output displayed above represents the type of output that you will see when entering the dl command into the debugger. In order to better illustrate the patterns and show the relationship between the flink and blink values, I have bolded, underlined, highlighted, and commented various portions of the output and listed it below. Note the annotations that demonstrate how each member of the linked list points to the next and previous item. You can follow the list all the way down until you see we do in fact end up at a zeroed out address.
808b5be8 8a78c800 85ddd510 <-- Notice that the dl command first displays address 808b5be8 which is the address referenced by the symbol information for the global variable nt!PsActiveProcessHead that we provided. The next two values are the flink and the blink. I have highlighted the flink above. Notice how this flink is pointing to the next memory location listed below.
8a78c800 89af0370 808b5be8 <-- In this line of output, notice that the dl command has followed the flink to address 8a78c800 (which I have again highlighted) and is dumping out the new flink and blink pair. I have underlined the blink. Notice that the blink points back to the address of the previous flink/blink pair.
89af0370 888745a0 8a78c800 Notice the patterns displayed, follow the output all the way to the bottom, and see if you notice anything that may be a problem.
888745a0 888492e0 89af0370 888492e0 89b17708 888745a0 89b17708 89b4e3d0 888492e0 89b4e3d0 8883fe20 89b17708 8883fe20 89ab73e8 89b4e3d0 89ab73e8 887d7650 8883fe20 887d7650 888906a8 89ab73e8 888906a8 89483408 887d7650 89483408 88839d40 888906a8
89b4e3d0 8883fe20 89b17708
8883fe20 89ab73e8 89b4e3d0
89ab73e8 887d7650 8883fe20
887d7650 888906a8 89ab73e8
888906a8 89483408 887d7650
89483408 88839d40 888906a8
… Ommitting output to reduce length. The links were all valid…
85ff6e20 8708be20 8663a278 8708be20 87176940 85ff6e20 87176940 861ce3b0 8708be20 861ce3b0 860010b8 87176940 860010b8 00000000 00000000 <-- Wait...This doesn't look correct. The blink should be 861ce3b0, not zero. The previous list entry's flink took us here to address 860010b8. It appears that our theory might just be correct. Observing this, I suspected that memory location 860010b8 has been overwritten with zeros. It is also possible, however, that the value in the flink above was incorrect causing us to end up at some random address that contained zeros. To know for sure I dumped the list backwards to see if the next flink/blink pair had a blink pointing to 860010b8.
85ff6e20 8708be20 8663a278
8708be20 87176940 85ff6e20
87176940 861ce3b0 8708be20
861ce3b0 860010b8 87176940
860010b8 00000000 00000000 <-- Wait...This doesn't look correct. The blink should be 861ce3b0, not zero. The previous list entry's flink took us here to address 860010b8. It appears that our theory might just be correct. Observing this, I suspected that memory location 860010b8 has been overwritten with zeros. It is also possible, however, that the value in the flink above was incorrect causing us to end up at some random address that contained zeros. To know for sure I dumped the list backwards to see if the next flink/blink pair had a blink pointing to 860010b8.
0: kd> dlb nt!PsActiveProcessHead ff 2ß This command dumps the list backwards. Below you can see the dlb command walking the linked list by starting at the list head and moving backwards (the list should make a circle). Don't be confused by the different patterns below. This is simply due to the fact that the list is being walked backward. The output displays the three value in the same order address flink blink.
0: kd> dlb nt!PsActiveProcessHead ff 2 808b5be8 8a78c800 85ddd510 85ddd510 808b5be8 85e58e20 85e58e20 85ddd510 8646ce20 8646ce20 85e58e20 85d92c08 85d92c08 8646ce20 8616ce20
0: kd> dlb nt!PsActiveProcessHead ff 2
85ddd510 808b5be8 85e58e20
85e58e20 85ddd510 8646ce20
8646ce20 85e58e20 85d92c08
85d92c08 8646ce20 8616ce20
872ba380 86e456a0 86c8da10 86c8da10 872ba380 8692f708 8692f708 86c8da10 86d630b8 86d630b8 8692f708 8673eae0 8673eae0 86d630b8 87223a98 87223a98 8673eae0 860010b8 860010b8 00000000 00000000 ß We arrived at the same address of 860010b8. So when walking the list both forward and backward we encounter pointers that take us to this address. This means that there should have been a valid flink/blink pair (which would be a valid nt!_EPROCESS activeprocesslinks _LIST_ENTRY ) at this address. This memory appears to have instead been zeroed out. It’s flink pointer should be 87223a98
872ba380 86e456a0 86c8da10
86c8da10 872ba380 8692f708
8692f708 86c8da10 86d630b8
86d630b8 8692f708 8673eae0
8673eae0 86d630b8 87223a98
87223a98 8673eae0 860010b8
860010b8 00000000 00000000 ß We arrived at the same address of 860010b8. So when walking the list both forward and backward we encounter pointers that take us to this address. This means that there should have been a valid flink/blink pair (which would be a valid nt!_EPROCESS activeprocesslinks _LIST_ENTRY ) at this address. This memory appears to have instead been zeroed out. It’s flink pointer should be 87223a98
When dumping the list out forward using dl, the line of output representing 860010b8 should have contained these values below, however it did not.
860010b8 87223a98 861ce3b0
So what happed? In dumping out the area around the address, you can see it has been zeroed
0: kd> dd 860010b8-200 860010b8+200 86000eb8 00000000 00000000 00000000 00000000 86000ec8 00000000 00000000 00000000 00000000 86000ed8 00000000 00000000 00000000 00000000 86000ee8 00000000 00000000 00000000 00000000 … Ommitting output to reduce length. Entire range contained all zeros… 86001088 00000000 00000000 00000000 00000000 86001098 00000000 00000000 00000000 00000000 860010a8 00000000 00000000 00000000 00000000 860010b8 00000000 00000000 00000000 00000000 860010c8 00000000 00000000 00000000 00000000 860010d8 00000000 00000000 00000000 00000000 860010e8 00000000 00000000 00000000 00000000 … Ommitting output to reduce length. Entire range contained all zeros… 86001288 00000000 00000000 00000000 00000000 86001298 00000000 00000000 00000000 00000000 860012a8 00000000 00000000 00000000 00000000 860012b8 00000000
0: kd> dd 860010b8-200 860010b8+200
86000eb8 00000000 00000000 00000000 00000000
86000ec8 00000000 00000000 00000000 00000000
86000ed8 00000000 00000000 00000000 00000000
86000ee8 00000000 00000000 00000000 00000000
… Ommitting output to reduce length. Entire range contained all zeros…
86001088 00000000 00000000 00000000 00000000
86001098 00000000 00000000 00000000 00000000
860010a8 00000000 00000000 00000000 00000000
860010b8 00000000 00000000 00000000 00000000
860010c8 00000000 00000000 00000000 00000000
860010d8 00000000 00000000 00000000 00000000
860010e8 00000000 00000000 00000000 00000000
86001288 00000000 00000000 00000000 00000000
86001298 00000000 00000000 00000000 00000000
860012a8 00000000 00000000 00000000 00000000
860012b8 00000000
Out of curiosity, I ran the !address command to see what range this address falls into.
0: kd> !address 860010b8
82b7e000 - 07c82000
Usage KernelSpaceUsageNonPagedPool
Next, we dump the pool information out. Note the corruption indication listed below.
0: kd> !pool 860010b8 Pool page 860010b8 region is Nonpaged pool 86001000 is not a valid large pool allocation, checking large session pool... 86001000 is freed (or corrupt) pool Bad allocation size @86001000, zero is invalid *** *** An error (or corruption) in the pool was detected; *** Attempting to diagnose the problem. *** *** Use !poolval 86001000 for more details. *** 0: kd> !poolval 86001000 Pool page 86001000 region is Nonpaged pool Validating Pool headers for pool page: 86001000 Pool page [ 86001000 ] is __inVALID. Analyzing linked list... [ 86001000 --> 86001310 (size = 0x310 bytes)]: Corrupt region Scanning for single bit errors... None found
0: kd> !pool 860010b8
Pool page 860010b8 region is Nonpaged pool
86001000 is not a valid large pool allocation, checking large session pool...
86001000 is freed (or corrupt) pool
Bad allocation size @86001000, zero is invalid
***
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
*** Use !poolval 86001000 for more details.
0: kd> !poolval 86001000
Pool page 86001000 region is Nonpaged pool
Validating Pool headers for pool page: 86001000
Pool page [ 86001000 ] is __inVALID.
Analyzing linked list...
[ 86001000 --> 86001310 (size = 0x310 bytes)]: Corrupt region
Scanning for single bit errors...
None found
So now that we have opened the dump, set the trap frame, dumped the stack, then found the answer. Now, we need to do the actual debug of the assembly code to both prove our theory and determine how this caused us to die. Let's revisit the activeprocesslinks field that we discussed before.
0: kd> dt nt!_EPROCESS activeprocesslinks +0x098 ActiveProcessLinks : _LIST_ENTRY ß the linked list we were following is located 0x98 from the start of the nt!_EPROCESS. The previous _LIST_ENTRY provided us with the address of this field by in the next process object. To get the start of the actual process object, we need to subtract 0x98 from that value. For example, if you were passed in a null address instead of a valid pointer and subtracted 0x98 you would get:
+0x098 ActiveProcessLinks : _LIST_ENTRY ß the linked list we were following is located 0x98 from the start of the nt!_EPROCESS. The previous _LIST_ENTRY provided us with the address of this field by in the next process object. To get the start of the actual process object, we need to subtract 0x98 from that value. For example, if you were passed in a null address instead of a valid pointer and subtracted 0x98 you would get:
0: kd> ? 0x0-0x098 Evaluate expression: -152 = ffffff68 ß If we start with a null value and backup 0x98 we get ffffff68
0: kd> ? 0x0-0x098
Evaluate expression: -152 = ffffff68 ß If we start with a null value and backup 0x98 we get ffffff68
As you can see below, that very value happens to be present in ECX. ECX should instead be the address of our process object. However,since our liked list pointer was null, we instead were following an invalid address. It’s all downhill from here:
0: kd> recx ß What’s in ECX? Last set context: ecx=ffffff68 ß it’s the offset from the null pointer
0: kd> recx ß What’s in ECX?
ecx=ffffff68 ß it’s the offset from the null pointer
Let’s keep digging to verify this by looking back in the assembly and walking through what happened in the final moments of the crash. To do this, I unassembled backwards using the ub command. Examination of the output of this command (listed below but limited to only one command for blog simplicity) will show that the code took the invalid address ffffff68 that should have been a pointer to the process object and subtracted 0x18. This gave the us the invalid address ffffff50 that we died on. The . represents the current instruction pointer and the L1 tells us to only display 1 unassembled assembly language instruction.
0: kd> ub . L1 nt!ObReferenceObjectSafe 80846dce 8d51e8 lea edx,[ecx-18h] ß we took the process object pointer in ecx and subtracted 0x18 from it and put it into edx to dereference 0: kd> redx edx=ffffff50 0: kd> ? ffffff68-0x18 Evaluate expression: -176 = ffffff50
0: kd> ub . L1
nt!ObReferenceObjectSafe
80846dce 8d51e8 lea edx,[ecx-18h] ß we took the process object pointer in ecx and subtracted 0x18 from it and put it into edx to dereference
0: kd> redx
edx=ffffff50
0: kd> ? ffffff68-0x18
Evaluate expression: -176 = ffffff50
Next, we tried to dereference this invalid pointer. Note that @$ip is another way (just like .)to point to the instruction pointer.
0: kd> u @$ip L1 nt!ObReferenceObjectSafe+0x3: 80846dd1 8b0a mov ecx,dword ptr [edx] ß edx contains the invalid value
0: kd> u @$ip L1
80846dd1 8b0a mov ecx,dword ptr [edx] ß edx contains the invalid value
You can also use the r command to dump out all of the registers and display this failed instruction as shown below.
0: kd> r Last set context: eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000 eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0 nv up ei pl nz ac pe cy cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010217 nt!ObReferenceObjectSafe+0x3: 80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=???????? ß we trapped trying to dereference the bad address that edx was directing us to
80846dd1 8b0a mov ecx,dword ptr [edx] ds:0023:ffffff50=???????? ß we trapped trying to dereference the bad address that edx was directing us to
So, why would we be backing up from this address? Each process object has an Object Header that precedes it. The following command ?? uses the current Expression Evaluator (C++ is the default) to run the sizeof() function against the nt!_OBJECT_HEADER symbol information and output the size of the an object header.
0: kd> ?? sizeof (nt!_OBJECT_HEADER) unsigned int 0x20
0: kd> ?? sizeof (nt!_OBJECT_HEADER)
unsigned int 0x20
You can dump out this header to see where the start of the process object is with the following command
0: kd> dt nt!_OBJECT_HEADER +0x000 PointerCount : Int4B ß the code was accessing the offset 0 of the Object Header which is the PointerCount field +0x004 HandleCount : Int4B +0x004 NextToFree : Ptr32 Void +0x008 Type : Ptr32 _OBJECT_TYPE +0x00c NameInfoOffset : UChar +0x00d HandleInfoOffset : UChar +0x00e QuotaInfoOffset : UChar +0x00f Flags : UChar +0x010 ObjectCreateInfo : Ptr32 _OBJECT_CREATE_INFORMATION +0x010 QuotaBlockCharged : Ptr32 Void +0x014 SecurityDescriptor : Ptr32 Void +0x018 Body : _QUAD ß The process object is located here at offset 0x18. This means that backing up 0x18 would put you at offset 0x0 of the object header which is the Pointer Count field. So let’s see if we can locate any stack data to validate this. Dumping the stack again, I observe the address of the process object. 0: kd> kvL2 ChildEBP RetAddr Args to Child f40099b0 809426b2 a9131e08 00000012 861ce318 nt!ObReferenceObjectSafe+0x3 (FPO: [0,0,0]) f40099c8 808544d9 861ce318 0007d960 00000000 nt!PsGetNextProcess+0x6c (FPO: [1,1,0]) Let’s dump out this process field. Wow! Look what we have here. The address of the zeroed list entry from above. 0: kd> dt nt!_EPROCESS 0x861ce318 ActiveProcessLinks +0x098 ActiveProcessLinks : _LIST_ENTRY [ 0x860010b8 - 0x87176940 ] 0: kd> dd 0x860010b8 L2 860010b8 00000000 00000000
0: kd> dt nt!_OBJECT_HEADER
+0x000 PointerCount : Int4B ß the code was accessing the offset 0 of the Object Header which is the PointerCount field
+0x004 HandleCount : Int4B
+0x004 NextToFree : Ptr32 Void
+0x008 Type : Ptr32 _OBJECT_TYPE
+0x00c NameInfoOffset : UChar
+0x00d HandleInfoOffset : UChar
+0x00e QuotaInfoOffset : UChar
+0x00f Flags : UChar
+0x010 ObjectCreateInfo : Ptr32 _OBJECT_CREATE_INFORMATION
+0x010 QuotaBlockCharged : Ptr32 Void
+0x014 SecurityDescriptor : Ptr32 Void
+0x018 Body : _QUAD ß The process object is located here at offset 0x18. This means that backing up 0x18 would put you at offset 0x0 of the object header which is the Pointer Count field.
So let’s see if we can locate any stack data to validate this. Dumping the stack again, I observe the address of the process object.
0: kd> kvL2
Let’s dump out this process field. Wow! Look what we have here. The address of the zeroed list entry from above.
0: kd> dt nt!_EPROCESS 0x861ce318 ActiveProcessLinks
+0x098 ActiveProcessLinks : _LIST_ENTRY [ 0x860010b8 - 0x87176940 ]
0: kd> dd 0x860010b8 L2
860010b8 00000000 00000000
So to summarize the problem, we crashed because a memory address that should have contained valid flink/blink pointers to a process object contained the invalid value 00000000. We expected this value to be pointing to a valid nt!_EPROCESS ActiveProcessLinks field instead of containing a null pointer (zero). After obtaining this null pointer, we then subtracted 0x098 which normally would have brought us to the start of the process object. Instead we ended up with invalid pointer 0xffffff68. Finally, we subtracted another 0x18 which should have brought us to the PointerCount field of the object header. This gave us address 0xffffff50 instead of a valid object header address. We then dereferenced this invalid address which led us to the bugcheck.
This crash was due to pool corruption, as shown in the output of the !pool command above. This blog has been a walkthrough of debugging a memory dump and identifying how pool corruption led up to a crash. Identification of the reason for the corruption is a separate matter. These types of issues are often difficult to troubleshoot due to the fact that any driver in kernel mode could have corrupted this memory region. To further complicate matters, the actual corruption of this memory may have taken place millions of CPU cycles before the time this code stumbled across the corrupted region leading up to the crash that caused this memory dump to be initiated. A memory dump is simply a snapshot of information relating to the state of a system during one particular moment in time.
In this case, in order to identify the source of the pool corruption we need to use Special Pool. Special Pool will use guard pages to catch a buffer overrun or underrun and should provide us with a dump that shows the code that causes the corruption. You can find more information on Special Pool in KB188831. Also, in some situations memory corruption can be caused by a driver overflowing on a DMA transfer causing corruption to physical pages rather than virtual pages as we see in typical pool corruption.
Usually, a debug such as this one would have started with analysis of the function at the top of the stack and worked down the stack reading the assembly to definitively identify the source of the bad pointer. However in this case, we were lucky enough to find the answer right away and then validate it with conventional debugging methods by reversing in reverse.
Hello, my name is Venkatesh Ganga. I’m a Senior Escalation Engineer on the Microsoft Platform Global Escalation Services team. In this blog I would like to talk about an interesting WMI issue I worked a few months ago. The idea behind this blog is to discuss how WMI works under the hood. In this article I show a few call stacks but the focus of the article is really to point out WMI concepts.
The issue was interesting. The customer encountered the error, "Provider cannot perform the requested operation" for all WMI queries to any provider on some of his Windows XP machines.
Figure 1:
Moreover, nothing happened when he clicked on other tabs like Logging, Backup/Restore, etc. Running WMIDIAG showed the message - “24 error(s) 0x80041024 - (WBEM_E_PROVIDER_NOT_CAPABLE) Provider cannot perform the requested operation“.
All the standard WMI troubleshooting was done before I was engaged, like rebuilding the WMI repository, checking DCOM permissions, etc. But nothing easily resolved the issue so I had the customer send a machine image to Microsoft for a local repro of the problem. Initially I didn’t know where to start with the issue because there were multiple symptoms. I decided to trace the execution of the inbox provider CIMWin32.dll by issuing the WMI query “Select * from Win32_Processor”, next find where it fails, and then go from there.
WMI maintains each provider’s physical implementation information such as the CLSID of the provider’s COM object, the hosting model, etc. in an instance of the __Win32Provider system class. Each provider can support one or more types mentioned below. Based on the type(s) the provider supports, WMI will create an instance of each supported type.
· __ClassProviderRegistration
· __EventConsumerProviderRegistration
· __EventProvideRegistration
· __InstanceProviderRegistration
· __MethodProviderRegistration
· __ObjectProviderRegistration
· __PropertyProviderRegisration
For example, looking at the Windows inbox .mof file for the Cimwin32 provider, we can quickly determine that the provider registers for two types - the Instance Provider, and the Method Provider as Figure 2 displays. The inbox mof files are usually stored in c:\Windows\System32\wbem
Figure 2:
In Memory at Runtime
Since our test query “select * from Win32_Processor” retrieves the instances of Win32_Processor class, WMI checks the CIMWin32’s __InstanceProviderRegistration object to see if the SupportsEnumeration property is set. While debugging the customer’s image, I found the runtime object WMI maintains for CIMWin32 provider, set to “False” for the ”SupportsEnumeration” property.
WMI maintains an in-memory structure for each provider and fills that structure out with information from the repository. The CreateInstanceEnumAsync checks the provider’s in-memory structure to determine whether it supports enumeration. From my debugging on the customer’s system, I determined the provider did not support enumeration.
Here's a snippet from my debugging-
0:003> kL ChildEBP RetAddr 0091f560 77e7a1ac wmiprvse!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0x21 0091f588 77ef421a rpcrt4!Invoke+0x30 0091f998 77ef4bf3 rpcrt4!NdrStubCall2+0x297 0091f9f0 756bd7fe rpcrt4!CStdStubBuffer_Invoke+0xc6 0091fa04 77600c31 fastprox!CBaseStublet::Invoke+0x22 0091fa44 77600bdb ole32!SyncStubInvoke+0x33 0091fa8c 7750f237 ole32!StubInvoke+0xa7 0091fb64 7750f15c ole32!CCtxComChnl::ContextInvoke+0xe3 0091fb80 77600b11 ole32!MTAInvoke+0x1a 0091fbb0 776009bc ole32!AppInvoke+0x9c 0091fc84 77600715 ole32!ComInvokeWithLockAndIPID+0x2e0 0091fcd0 77e79c75 ole32!ThreadInvoke+0x1cd 0091fd04 77e79bda rpcrt4!DispatchToStubInC+0x38 0091fd58 77e79b06 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x113 0091fd7c 77e89f9c rpcrt4!RPC_INTERFACE::DispatchToStub+0x84 0091fdbc 77e89fdd rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0xc0 0091fdfc 77e7be65 rpcrt4!LRPC_SCALL::DealWithRequestMessage+0x2cd 0091fe20 77e76794 rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d 0091ff80 00000000 rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0x28f
0:003> kL
ChildEBP RetAddr
0091f560 77e7a1ac wmiprvse!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0x21
0091f588 77ef421a rpcrt4!Invoke+0x30
0091f998 77ef4bf3 rpcrt4!NdrStubCall2+0x297
0091f9f0 756bd7fe rpcrt4!CStdStubBuffer_Invoke+0xc6
0091fa04 77600c31 fastprox!CBaseStublet::Invoke+0x22
0091fa44 77600bdb ole32!SyncStubInvoke+0x33
0091fa8c 7750f237 ole32!StubInvoke+0xa7
0091fb64 7750f15c ole32!CCtxComChnl::ContextInvoke+0xe3
0091fb80 77600b11 ole32!MTAInvoke+0x1a
0091fbb0 776009bc ole32!AppInvoke+0x9c
0091fc84 77600715 ole32!ComInvokeWithLockAndIPID+0x2e0
0091fcd0 77e79c75 ole32!ThreadInvoke+0x1cd
0091fd04 77e79bda rpcrt4!DispatchToStubInC+0x38
0091fd58 77e79b06 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x113
0091fd7c 77e89f9c rpcrt4!RPC_INTERFACE::DispatchToStub+0x84
0091fdbc 77e89fdd rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0xc0
0091fdfc 77e7be65 rpcrt4!LRPC_SCALL::DealWithRequestMessage+0x2cd
0091fe20 77e76794 rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d
0091ff80 00000000 rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0x28f
Repository
In contrast, when we checked the customer’s system, we determined the repository showed the provider’s “SupportEnumeration” property was set to true. Verification steps below:
1. Run wbemtest and connect to root\cimv2
2. Enter the following query-
select * from __InstanceProviderRegistration where provider="__Win32Provider.Name=\"CIMWin32\""
3. After running the query, you should see one object in the Query Result window. Double click it and look at the properties. On a working system you will see the SupportEnumeration property is set to “True” for the CimWin32 provider, as Figure 3 shows.
Figure 3:
Discrepancy Between the In-Memory and Repository
Now we know WMI is not initializing the provider correctly because there is a discrepancy between the in-memory objects it maintains, and the repository.
So I started debugging the CIMWin32 provider initialization where WMI fills the in-memory structure. During provider initialization, WMI makes a REFERENCES OF query to get all types CIMWin32 supports. On our customer’s machine, this query returned zero objects, whereas on a working system it returned two objects (one for __InstanceProviderRegistration and other for __MethodProviderRegistration). Here's a screenshot of this action.
"references of {__Win32Provider.Name="CimWin32"}"
Here is a stack of how we found the exact “REFERENCES OF“ query being issued.
0:007> kL ChildEBP RetAddr 0164fa38 59841a6c wbemcore!CWbemNamespace::ExecQuery 0164fa88 59841c41 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepositoryUsingQuery+0x47 0164faac 59841d31 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepository+0x42 0164fac8 5980f056 wmiprvsd!CServerObject_ProviderRegistrationV1::Load+0x33 0164fba8 762e9cfd wmiprvsd!CServerObject_BindingFactory::GetProvider+0x1f5 0164fc24 76301d1e wbemcore!CWbemNamespace::DynAux_ExecQueryExtendedAsync+0x81 0164fce0 76302912 wbemcore!CQueryEngine::ExecComplexQuery+0x1f4 0164fdec 763032bd wbemcore!CQueryEngine::ExecQlQuery+0x37 0164fe90 762fc769 wbemcore!CQueryEngine::ExecQuery+0x228 0164feac 762cef24 wbemcore!CAsyncReq_ExecQueryAsync::Execute+0x19 0164fed8 762ced4e wbemcore!CCoreQueue::pExecute+0x3c 0164ff08 762f25cb wbemcore!CCoreQueue::Execute+0x18 0164ff50 762cee89 wbemcore!CWbemQueue::Execute+0xf6 0164ff84 762cf055 wbemcore!CCoreQueue::ThreadMain+0x111 0164ffb4 7c80b683 wbemcore!CCoreQueue::_ThreadEntry+0x45 0164ffec 00000000 kernel32!BaseThreadStart+0x37 0:044> du 04f57424 04f57424 "references of {__Win32Provider.N" 04f57464 "ame="CIMWin32"}"
0:007> kL
0164fa38 59841a6c wbemcore!CWbemNamespace::ExecQuery
0164fa88 59841c41 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepositoryUsingQuery+0x47
0164faac 59841d31 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepository+0x42
0164fac8 5980f056 wmiprvsd!CServerObject_ProviderRegistrationV1::Load+0x33
0164fba8 762e9cfd wmiprvsd!CServerObject_BindingFactory::GetProvider+0x1f5
0164fc24 76301d1e wbemcore!CWbemNamespace::DynAux_ExecQueryExtendedAsync+0x81
0164fce0 76302912 wbemcore!CQueryEngine::ExecComplexQuery+0x1f4
0164fdec 763032bd wbemcore!CQueryEngine::ExecQlQuery+0x37
0164fe90 762fc769 wbemcore!CQueryEngine::ExecQuery+0x228
0164feac 762cef24 wbemcore!CAsyncReq_ExecQueryAsync::Execute+0x19
0164fed8 762ced4e wbemcore!CCoreQueue::pExecute+0x3c
0164ff08 762f25cb wbemcore!CCoreQueue::Execute+0x18
0164ff50 762cee89 wbemcore!CWbemQueue::Execute+0xf6
0164ff84 762cf055 wbemcore!CCoreQueue::ThreadMain+0x111
0164ffb4 7c80b683 wbemcore!CCoreQueue::_ThreadEntry+0x45
0164ffec 00000000 kernel32!BaseThreadStart+0x37
0:044> du 04f57424
04f57424 "references of {__Win32Provider.N"
04f57464 "ame="CIMWin32"}"
The repository driver, repdrvfs.dll, works on the query above. The references query failed with error 80041017 (WBEM_E_INVALID_QUERY) due to an apostrophe in the computer name. While parsing the query, the repository driver considers anything between the apostrophes, or quotes, as a name. This apostrophe in the computer name misleads WMI to look for a closing apostrophe, which is not there, subsequently the references query fails with error.
0:005> kLChildEBP RetAddr 05cbfec8 75214a74 repdrvfs!CNamespaceHandle::ExecReferencesQuery+0x1dd 05cbff88 75215518 repdrvfs!CNamespaceHandle::ExecQuerySink+0xc2 05cbffac 7520e11c repdrvfs!CExecQueryObject::Execute+0x28 05cbffb4 7c80b683 repdrvfs!A51FiberBase+0xd 05cbffec 7c82ffa9 kernel32!BaseThreadStart+0x37 05cbffec 00000000 kernel32!BaseFiberStart+0x17 0:005> du 00b56218 00b56218 "references of {\\COMPUTERNAME'User1"00b56258 "\ROOT\cimv2:__Win32Provider.Name"00b56298 "="CIMWin32"}" 0:005>r
0:005> kLChildEBP RetAddr
05cbfec8 75214a74 repdrvfs!CNamespaceHandle::ExecReferencesQuery+0x1dd
05cbff88 75215518 repdrvfs!CNamespaceHandle::ExecQuerySink+0xc2
05cbffac 7520e11c repdrvfs!CExecQueryObject::Execute+0x28
05cbffb4 7c80b683 repdrvfs!A51FiberBase+0xd
05cbffec 7c82ffa9 kernel32!BaseThreadStart+0x37
05cbffec 00000000 kernel32!BaseFiberStart+0x17
0:005> du 00b56218 00b56218 "references of {\\COMPUTERNAME'User1"00b56258 "\ROOT\cimv2:__Win32Provider.Name"00b56298 "="CIMWin32"}"
0:005>r
WMI makes a similar REFERENCES OF query for all the providers to get the types they support. But they all fail for the same reason. Hence, WMI queries to all the providers fail. We renamed the computer name without an apostrophe and rebooted the box. After the rename, WMI initialized properly, and the WMI queries started working again. An apostrophe is not a valid character for the computer name. Unfortunately, we never found out how the computer name got set with an apostrophe. If we try to set the computer name through GUI, it throws an error saying the computer name has invalid characters. Hopefully through this experience, you have a better understanding of how WMI implements providers to help you better troubleshoot your next WMI issue.
Here are the main takeaways from this article-
1. WMI maintains an instance of __Win32Provider system class for each provider. Based on the provider’s registration (see .mof file above) it maintains one or more registration system classes such as __InstanceProviderRegistration.
2. WMI uses the REFERENCES OF query to map the instance of __Win32Provider to the providers registration types (e.g. __InstanceProviderRegistration).
3. To view the various providers registered on your system, connect to root\Cimv2 and issue the following query: “Select * from __Win32Provider”.
4. And to get the types that each provider supports, issue the following query:references of {__Win32Provider.Name="<PROVIDER_NAME>"}
Hi, my name is Todd Webb, I am an Escalation Engineer from the Global Escalation Services OEM Team.
I recently worked a case where we were trying to troubleshoot an issue and the system would hang with USB debugging enabled. This was a problem since USB debugging was the only kernel debugging option available on this particular hardware setup. I needed a quick and easy way to find the source of the problem without using many of the standard methods. Basically I had to debug a debugger issue with no access to a debugger. I knew the problem was likely USB related, but had no other information. The method I used to troubleshoot this issue was to write some macros that would make a call to HalMakeBeep to generate audible indications of where we were in the code. The primary reason for using this method was I could quickly and easily move the location in the code where I made the calls as I narrowed down where the problem occurred.
The alternative would have been to add a bunch of logging and find some way to force a dump. My initial assessment was this would be more time consuming and less flexible then the beep code method. Others may have other methods to do this, but this just gives people another method of debugging driver start-up problems on systems that are not debuggable.
The main beep macro allows you to pass in a frequency and duration in milliseconds for a single beep. Then based on the example below you can extend the single beep macro to create a series of beeps.
Here is an example of the macros I used:
//
// beep macros for troubleshooting without debugger
#define EHCI_BEEP1(f, d) \
{ \
LARGE_INTEGER duration; \
duration.QuadPart = Int32x32To64(d, -10000); \
HalMakeBeep(f); \
KeDelayExecutionThread(KernelMode, FALSE, &duration); \
HalMakeBeep(0); \
duration.QuadPart = Int32x32To64(500, -10000); \
}
#define EHCI_BEEP2(f, d) \
EHCI_BEEP1(f, d); \
#define EHCI_BEEP3(f, d) \
EHCI_BEEP2(f, d); \
Here is an example of a long beep followed by three short beeps:
// debug - 1 long : 3 short
EHCI_BEEP1(600, 1000);
EHCI_BEEP3(600, 500);
Trey Nash here again, and I would like to discuss a scenario we are all too familiar with. You’ve worked your tail off for the past year, and for the past couple of months you even worked evenings and weekends. Management rewarded your team with two weeks off in appreciation of your efforts. But now that you’re back in the office, you’re hearing rumors percolating up from your tech support department that there are some cases where your application is crashing in the field for mysterious reasons. What do you do?
The application happens to have been built with an unhandled exception filter registered via the AppDomain.UnhandledException event. Therefore, you at least know that the application is failing with an InvalidCastException, but you cannot imagine why this is happening.
Wouldn’t it be nice if you could live debug on the affected system? Unless you work on-site for your customer or your software is on a laptop and your customer is willing to send it to you, then I doubt you will get this opportunity. What you need is a tool to capture the state of your application while it is failing. Then the customer could capture this information and send it to you.
Enter ADPlus. ADPlus is a free tool in the Debugging Tools for Windows package that scripts the CDB debugger allowing you to capture dumps for one or multiple processes on a system. It also offers the following advantages:
· ADPlus can monitor desktop applications, services, etc.
· ADPlus can monitor multiple processes on the system. When it collects a dump of those processes, it freezes and dumps them simultaneously. This is essential for tracking down problems with inter-process communications.
· ADPlus supports xcopy deployment meaning the customer does not need to install anything via Windows Installer, etc. This minimizes configuration changes on the machine, and that is music to customers’ ears.
Note: Although ADPlus is xcopy installable, you still have to install the Debugging Tools for Windows package via Windows Installer as that is the only way Microsoft distributes it. However, once you have installed Debugging Tools for Windows once, you can xcopy deploy ADPlus or the entire Debugging Tools for Windows package to another machine. In fact, during development, I find it extremely handy to check in the development tools into the source repository. Debugging Tools for Windows supports this by virtue of the fact that it is xcopy installable.For those of you that are savvy with Windows Installer, you can perform an admin install using the msi for Debugging Tools for Windows and that will allow you to extract the files without actually installing the package on the machine, for example:msiexec /a dbg_x86_6.11.1.404.msi
With all of that said, let’s see how ADPlus can help you diagnose problems with .NET applications.
In the rest of this post, I will reference the C# 3.0 sample application that I have put together to illustrate using ADPlus to capture an unhandled exception in a .NET application. The code is listed below:
using System;
using System.Linq;
using System.Runtime.Serialization;
class A
{
public void SaySomething() {
Console.WriteLine( "Yeah, Peter...." );
throw new BadDesignException();
class B : A
class C
class EntryPoint
static void Main() {
DoSomething();
static void DoSomething() {
Func<int, object> generatorFunc = (x) => {
if( x == 7 ) {
return new C();
} else {
return new B();
};
var collection = from i in Enumerable.Range( 0, 10 )
select generatorFunc(i);
// Let's iterate over each of the items in the collection
// ASSUMING THEY ARE ALL DERIVED FROM A !!!!
foreach( var item in collection ) {
A a = (A) item;
try {
a.SaySomething();
catch( BadDesignException ) {
// Swallow these here. The programmer chose to
// use exceptions for normal control flow which
// is *very* poor design.
public class BadDesignException : Exception
public BadDesignException() { }
public BadDesignException( string msg ) : base( msg ) { }
public BadDesignException( string msg, Exception x ) : base( msg, x ) { }
protected BadDesignException( SerializationInfo si,
StreamingContext ctx )
:base( si, ctx ) { }
You can compile this sample code easily by putting it in a file, such as test.cs, and then from either a Visual Studio Command Prompt or from a Windows SDK Command Shell, execute the following:
csc /debug+ test.cs
Note: The code is contrived and there are many bad things about this code from a design/coding standpoint, but that is intentional for the sake of illustration. For example, one may want to re-think introducing a collection that contains references to System.Object.The code above also uses features that are new to C# 3.0 (for brevity) including lambda expressions and LINQ. If you would like to become more familiar with them, visit the C# website on MSDN or reference one of the excellent books on C# such as Pro LINQ or Accelerated C# 2008.
The notable section of the code to focus on is the foreach loop within the EntryPoint.Main() method. In that foreach loop, we are iterating over a collection of objects where we are assuming that they all derive from type A. Thus the code attempts to cast all instances to a reference of type A and since I have intentionally put an instance of type C within that collection, it will fail at some point with an exception of type System.InvalidCastException.
Using ADPlus in crash mode, we can capture the exception in the customer’s environment. To illustrate this in action, let’s launch ADPlus in crash mode and monitor the test.exe example application built above by using the following command line:
adplus -crash -o c:\temp\test -FullOnFirst -sc c:\temp\test\test.exe
Note: I built and tested this code in a directory named c:\temp\test on my machine as I wrote this. Therefore, you will see references to it throughout this post. Incidentally, I have found ADPlus is a lot easier to use if you feed it fully qualified paths. If you get mysterious errors or behavior, and you are using relative paths with ADPlus, try fully qualifying the paths before you beat your head against the wall for too long trying to figure out what could be going wrong.
ADPlus is easier to launch if you add the directory where the Debugging Tools for Windows was installed to your PATH environment variable. The command line above assumes this has been done. Now, let me explain the command line options that I used above. I highly recommend that you become familiar with all of the ADPlus command line options.
· -crash launches ADPlus in crash mode. This is the mode you want to use if your application is failing because of an unhandled exception.
· -o c:\temp\test tells ADPlus that I want the output to be placed in c:\temp\test, which is the directory in which I built the test.exe application.
· -FullOnFirst is very important for managed applications.. This tells ADPlus to grab a full process dump on first chance exceptions. It is essential that you capture full dumps for managed applications, otherwise, all of the necessary data regarding the execution engine and the managed heap will be absent from the dump making it impossible to debug effectively.
· -sc c:\temp\test.exe is only one of the ways you can point ADPlus to an application to monitor. In this case, we’re instructing ADPlus to tell the debugger to explicitly launch the application. Had it been a service, or if the application you want to monitor is already running, we would probably have used –p to attach to its PID or –pn to attach to the process by name. Notice that I provided the full path to the application.
After you launch ADPlus, you are presented with the following dialog unless you use the –quiet option.
Once the application is finished executing, go to the directory that you specified in the ADPlus –o command line option and you should see a subdirectory named similarly to what you see in the previous dialog snapshot. For the instance of test.exe I just executed, that directory is named Crash_Mode__Date_05-11-2009__Time_21-12-54PM. Under that directory, there are quite a few files that I have listed below:
C:\temp\test\Crash_Mode__Date_05-11-2009__Time_21-12-54PM>dir /bADPlus_report.txtCDBScriptsPID-0__Spawned0__1st_chance_CPlusPlusEH__full_15ac_2009-05-11_21-13-01-332_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-55-482_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-56-527_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-57-370_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-103_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-867_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-59-663_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-00-505_0eac.dmpPID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-02-689_0eac.dmpPID-0__Spawned0__1st_chance_Process_Shut_Down__full_15ac_2009-05-11_21-13-27-743_0eac.dmpPID-0__Spawned0__2nd_chance_NET_CLR__full_15ac_2009-05-11_21-13-04-140_0eac.dmpPID-0__Spawned0__Date_05-11-2009__Time_21-12-54PM.logProcess_List.txt
For each first chance exception, a dump file (.dmp) has been collected. Notice that the dump files can tend to be very large because of the –FullOnFirst option. You can subsequently load these dump files into either Windbg (or its variants) or the Visual Studio Debugger. Personally, I prefer Windbg because I can then load the SOS extension along with the SOSEX extension and dig into the state of the application and the CLR.
Using the ADPlus default configuration as we have above, you can see that ADPlus generated dumps for one first-chance C++ exception, eight first-chance CLR exceptions, one second-chance CLR exception and one dump collected during process shut down.
If you look in your Debugging Tools for Windows directory, you’ll notice that ADPlus is really just a very complex VBScript. It generates quite a bit of useful information along with any problem dump files. ADPlus_report.txt reports the configuration for ADPlus for this run. This is handy if you need to know what it will do for a specific type of exception.
Process_list.txt is generated from executing tlist.exe, another tool that comes with Debugging Tools for Windows.
And finally, the CDBScripts subdirectory contains a .cfg file which is the exact debugger script ADPlus generated and subsequently fed to CDB to get the job done. On my machine, when running against the sample application using the command line from the previous section, this file is named PID‑0__Spawned0.cfg. If you’re ever curious or need to know exactly what ADPlus instructed the debugger to do, this file is the source.
Note: The reason the name contains a zero PID is because we used the –sc option to launch the application. Had we used the –p or –pn option, the PID in the filename would not be zero.
I don’t recommend executing this debugger script in a live debugger as doing so could overwrite the data that you already have collected. Instead, if you need to couple ADPlus with live debugging, you should use the –gs option which I will describe shortly.
You’ll notice that in the previous run of ADPlus, quite a few dump files were generated and each one was fairly large. In reality, there are times where a first chance exception does not always indicate a fatal situation. In Accelerated C# 2008, I go into detail regarding how it is poor design practice to implement any sort of control flow or otherwise reasonably expected behavior using exceptions. This is because exceptions are supposed to be truly exceptional events! For more juicy details on how expensive exceptions are, I invite you to read an excellent blog post by Tess Ferrandez on the topic. At any rate, you may encounter situations where you get a lot of dumps for first chance exceptions you are not interested in as this example shows..
To alleviate this situation, you can create an ADPlus configuration file coupled with the !StopOnException command provided by the SOS extension to instruct ADPlus to filter out only the exceptions you’re interested in. To do this, I created a configuration file named filter.config with the following contents:
<ADPlus>
<!-- Configuring ADPlus to log only exceptions we're interested in -->
<Exceptions>
<Config>
<!-- This is for the CLR exception -->
<Code> clr </Code>
<Actions1> Log </Actions1>
<CustomActions1> .loadby sos mscorwks; !StopOnException System.InvalidCastException 1; j ($t1 = 1) '.dump /ma /u c:\dumps\InvalidCastException.dmp; gn' ; 'gn' </CustomActions1>
<ReturnAction1> VOID </ReturnAction1>
<Actions2> Log </Actions2>
</Config>
</Exceptions>
</ADPlus>
Note: The <CustomActions1> element above is meant to be in one line in the config file.
The <CustomActions1> element is the element of interest in this configuration file. This element allows you to specify which commands the debugger should execute on first chance exceptions. Within this element, you can put any valid debugger commands (except windbg GUI-related commands). If you need to execute multiple commands, as I have above, you simply separate them with semicolons. In the <CustomActions1> element shown above, I first load the SOS extension using the .loadby command. Then, I use the !StopOnException command in predicate mode to set the $t1 pseudo-register to 1 if the exception is of type System.InvalidCastException and 0 otherwise. Then, the following j command is used to create a full dump if $t1 is 1 and do nothing otherwise. The gn command in both paths of the j command tells the debugger to go unhandled so that the exception is propagated up rather than swallowed by the debugger. If you were to go handled, thus swallowing the exception, the exception handlers in the code would never see it and the debugger would alter the behavior of the application. And finally, note that the .dump command used to create the dump indicates the path where the dump will be stored. In this case, I am placing it in the c:\dumps directory on my machine.
Now you can provide this configuration file to ADPlus using the following command:
adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -sc c:\temp\test\test.exe
Notice that there are fewer dumps collected. Along with the dump for the InvalidCastException, it also captures a C++ exception as well as when the application shuts down. If you open the C++ exception dump in the debugger and inspect the stack, it shows that the CLR is in the process of generating the InvalidCastException. The CLR catches the C++ exception and converts it into a managed exception. The C++ exception dump was generated because I left the –FullOnFirst option in the previous command line. You can eliminate the C++ exception dump by removing -FullOnFirst.
I already hinted at the –gs ADPlus command line option earlier in this post. This option allows you to create the debugger scripts ADPlus creates without actually running them in the debugger. For example, if you execute the following command (I executed mine from my c:\temp\test directory):
adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -gs livedebug
You will notice that ADPlus does not actually launch any debugger or your application. Rather, it creates a subdirectory named livedebug. When you go into that directory, you’ll notice it looks similar in layout to the crash directories created in the previous demonstration. On my machine, I end up with the following two files:
C:\temp\test\livedebug\ADPlus_report.txtC:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg
The PID-0__livedebug.cfg file is actually a debugger script file containing debugger commands. All we have to do now is launch our test application in the debugger and then execute this script. Within my c:\temp\test directory, I can launch the debugger using the following command:
windbg test.exe
Once inside the debugger, I can invoke the ADPlus debugger script by executing the following $$< command:
$$<C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg
Once you execute the $$< command, you will notice that the script takes over and performs the same actions as ADPlus does when run conventionally from the command line.
As a further experiment, edit the filter.config file and remove the gn commands. Now the debugger will wait for user input after executing the custom commands rather than continuing execution of the application. This could be handy if you want the opportunity to perform debugging by hand if ADPlus encounters a certain situation.
Throughout this post, I have been focusing on first chance exceptions for the sake of illustration. However, many times, you are only interested in the truly unhandled exceptions. In that case, you would want to capture only second chance exceptions. You certainly would want to capture first chance exceptions if you needed to capture the state at the exact point the exception was thrown and before the OS searches for any handlers. Additionally, if you suspect that the application you are debugging may be handling an exception inappropriately (maybe even blindly swallowing it), then you certainly want to catch first chance exceptions in that case.
In this blog post I have introduced you to ADPlus and the utility it provides when troubleshooting problems in the field. ADPlus lends itself well to capturing exceptional situations in the field since it requires no configuration changes on the affected machine thus making it an easy pill for your customers to swallow. You may also find this very useful when working with your Quality Assurance team during the development phase. For example, how many times have you encountered a situation where a problem only presents itself on a dusty old rarely-used machine in the back corner of a random QA engineer’s office? How many times, in such situations, have you felt like the only way to trouble shoot the problem effectively is to install the Visual Studio debugger and start working on that machine? Furthermore, what if the problem only happens on that dusty old machine about once a week. ADPlus can help you avoid that madness by providing an easy mechanism for capturing full process dumps on the troubled machine so you can then take those dumps to your trusty development machine for further debugging and analysis.
Have fun out there!
In this video Michael introduces the sample based profiling tool named XPERF. He also demonstrates how to use the tool to uncover high DPC’s, and Disk I/O’s. If you would like to see more XPERF blogs on ntdebugging please send your feedback using the email link at the top of the blog site. Feel free to leave comments too.
Location of the video: http://ittv.net/DesktopModules/UltraVideoGallery/uvg.swf?vId=419&portalId=0
BTW - I’m working on making a downloadable version for off-line viewing. Stay Tuned.
Link to Microsoft Windows Performance Toolkit
http://msdn.microsoft.com/en-us/performance/default.aspx
Link to XPerf version 4.5. This ships with the Win7 SDK below.
http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=6db1f17f-5f1e-4e54-a331-c32285cdde0c
We look forward to your feedback!