• Ntdebugging Blog

    Debugging Without a Debugger

    • 3 Comments

    Hi, my name is Todd Webb, I am an Escalation Engineer from the Global Escalation Services OEM Team.

     

    I recently worked a case where we were trying to troubleshoot an issue and the system would hang with USB debugging enabled. This was a problem since USB debugging was the only kernel debugging option available on this particular hardware setup. I needed a quick and easy way to find the source of the problem without using many of the standard methods.  Basically I had to debug a debugger issue with no access to a debugger.  I knew the problem was likely USB related, but had no other information.  The method I used to troubleshoot this issue was to write some macros that would make a call to HalMakeBeep to generate audible indications of where we were in the code.  The primary reason for using this method was I could quickly and easily move the location in the code where I made the calls as I narrowed down where the problem occurred. 

     

    The alternative would have been to add a bunch of logging and find some way to force a dump.  My initial assessment was this would be more time consuming and less flexible then the beep code method.  Others may have other methods to do this, but this just gives people another method of debugging driver start-up problems on systems that are not debuggable.

     

    The main beep macro allows you to pass in a frequency and duration in milliseconds for a single beep.  Then based on the example below you can extend the single beep macro to create a series of beeps.

     

    Here is an example of the macros I used:

     

    //

    // beep macros for troubleshooting without debugger

    //

     

    #define EHCI_BEEP1(f, d)                                              \

        {                                                                 \

            LARGE_INTEGER duration;                                       \

            duration.QuadPart = Int32x32To64(d, -10000);                  \

            HalMakeBeep(f);                                               \

            KeDelayExecutionThread(KernelMode, FALSE, &duration);         \

            HalMakeBeep(0);                                               \

            duration.QuadPart = Int32x32To64(500, -10000);                \

            KeDelayExecutionThread(KernelMode, FALSE, &duration);         \

        }

     

    #define EHCI_BEEP2(f, d)                                              \

        {                                                                 \

            EHCI_BEEP1(f, d);                                             \

            EHCI_BEEP1(f, d);                                             \

        }

         

    #define EHCI_BEEP3(f, d)                                              \

        {                                                                 \

            EHCI_BEEP2(f, d);                                             \

            EHCI_BEEP1(f, d);                                             \

        }

     

    Here is an example of a long beep followed by three short beeps:

     

        //

        // debug - 1 long : 3 short

        //

        EHCI_BEEP1(600, 1000);

        EHCI_BEEP3(600, 500);

     

    Share this post :
  • Ntdebugging Blog

    Reversing in Reverse: Linked-List Pool Corruption, a Complete Walkthrough (Part 1)

    • 3 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today’s blog will consist of a complete walkthrough of my recent analysis of a stop 0x50 along with the steps that  led me to  identify that  this crash was caused by pool corruption. In this particular case, I found the answer before completing the actual debug.

     

    To begin the analysis, I entered the !analyze –v command into the debugger and examined the output relating to the bugcheck.  This is always my first step since it usually provides  an overview of what type of problem occurred. In the output listed below, you can see that the system has bugchecked due to an attempt by kernel mode code to access the invalid address 0xffffff50. 

     

     PAGE_FAULT_IN_NONPAGED_AREA (50)

    Invalid system memory was referenced.  This cannot be protected by try-except,

    it must be protected by a Probe.  Typically the address is just plain bad or it

    is pointing at freed memory.

    Arguments:

    Arg1: ffffff50, memory referenced.

    Arg2: 00000000, value 0 = read operation, 1 = write operation.

    Arg3: 80846dd1, If non-zero, the instruction address which referenced the bad memory

                    address.

    Arg4: 00000000, (reserved)

     

    Before proceeding with our debugging, I would like to point out that invalid addresses such as the one listed above are often obtained when code subtracts values from a null pointer.  To illustrate that  this particular address 0xffffff50 could be obtained in such a manner, I have provided output from the .formats debugger command below.  Notice that the  hex vaule 0xffffff50  can be used to represent negative 176 as a decimal.  It is a relatively common situation to arrive at a negative value by subtracting values from a null pointer (which is zero).   Since addresses are actually unsigned, this is really just an invalid positive address. 

     

    0: kd> .formats ffffff50

    Evaluate expression:

      Hex:     ffffff50

      Decimal: -176

     

     

    Let’s proceed with our investigation.  The output of !analyze –v also provided us with the following:

     

    TRAP_FRAME:  f4009940 -- (.trap 0xfffffffff4009940)

     

     This provided a  trap frame address in the form of a DML (Debugger Markup Language)  link that can be clicked on to quickly enter the command into the debugger.

     

    Passing .trap the address of a trap frame will show the register state when the trap was generated.  This will also set the context to the state when the trap was generated.  This information was saved into the trap frame by the trap handler that executed after  the trap occurred.

     

     The trap frame address listed above could have instead been obtained by dumping out the stack using the command  kv .  !analyze –v is very good at displaying  the correct trapframe, however I feel that it is a good practice to dump out the stack to verify that the trap frame you are entering is in fact the one that you are interested in. You can see in the output below that there are two trapframes listed. I am interested in the topmost one that matches the trap that led us to the bugcheck. I can see that analyze –v has listed the trapframe that we are interested in, so I entered it into the debugger.

     

    0: kd> kv

    ChildEBP RetAddr  Args to Child             

    f40098d8 808692ab 00000050 ffffff50 00000000 nt!KeBugCheckEx+0x1b (FPO: [5,0,0])

    f4009928 80836c2a 00000000 ffffff50 00000000 nt!MmAccessFault+0x813 (FPO: [4,10,4])

    f4009928 80846dd1 00000000 ffffff50 00000000 nt!KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ f4009940)

    f40099b0 809426b2 a9131e08 00000012 861ce318 nt!ObReferenceObjectSafe+0x3 (FPO: [0,0,0])

    f40099c8 808544d9 861ce318 0007d960 00000000 nt!PsGetNextProcess+0x6c (FPO: [1,1,0])

    f4009a58 8094292f 01210048 0007d960 00000000 nt!ExpGetProcessInformation+0x36d (FPO: [SEH])

    f4009d4c 80833bef 00000005 01210048 0007d960 nt!NtQuerySystemInformation+0x11e0 (FPO: [SEH])

    f4009d4c 7c8285ec 00000005 01210048 0007d960 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f4009d64)

     

    0: kd> .trap 0xfffffffff4009940

     

    The output from the .trap command will be similar to the ouput from the r command. The  r command will dump out the registers and the current instruction that caused the error. Listed below are the output of the two commands:

     

    0: kd> .trap 0xfffffffff4009940

    ErrCode = 00000000

    eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000

    eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0         nv up ei pl nz ac pe cy

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010217

    nt!ObReferenceObjectSafe+0x3:

    80846dd1 8b0a            mov     ecx,dword ptr [edx]  ds:0023:ffffff50=????????

     

    0: kd> r

    Last set context:

    eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000

    eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0         nv up ei pl nz ac pe cy

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010217

    nt!ObReferenceObjectSafe+0x3:

    80846dd1 8b0a            mov     ecx,dword ptr [edx]  ds:0023:ffffff50=????????  ß This is the current instruction. We trapped on this instruction trying to dereference edx which contained ffffff50.

     

    Next, we shall  dump out the stack and try to get a feel for what happened based on the function names listed in the stack. Remember that the functions at the bottom of the stack called the functions above them.  Here are the function names listed in the  top four stack frames along with my annotation. In this case I read the stack starting at the bottom and working my way up instead of the traditional way of starting at the bad value and backtracing to the problem.

     

    0: kd> kcL4

    ·         nt!ObReferenceObjectSafe ß Finally, this function encountered the bad value and crashed. To observe this for yourself, please examine  the output of the 'r' command listed above.  You will notice that the current instruction at the time of the trap (the address in EIP) was unassembled. Above this instruction the function name and offset are listed  as nt!ObReferenceObjectSafe+0x3.  This means that the instruction we were executing when the trap occurred was located at offset 0x3 from the location that the symbol ObReferenceObjectSafe references.

    ·         nt!PsGetNextProcess ß Based on the function name here, it appears that we were trying to traverse the linked list of active processes on the system.

    ·         nt!ExpGetProcessInformation ß NtQuerySystemInformation() called this function. Based on the function name GetProcessInformation and the call listed above,  this was obviously an attempt to get information about the processes running on the system

    ·         nt!NtQuerySystemInformation ß We were trying to query system information

     

                    Normally, my next step would typically be to unassemble the current function with  uf @$ip , locate the current instruction in the output by pasting the address in eip into the find diaglog box, and proceed to backtrace the source of the bad value until I had located the source. However, I have a bad habit of poking around for fun by dumping out various data while I am debugging. The  function names on the stack seemed to indicate that we were working with the list of processes. The address we died on suggested that we may have encountered a null pointer. I couldn't resist pausing a moment to dump out the list of processes on the system to see if we had encountered a null pointer while traversing the list of processes on the system. After all, I've reviewed dumps before that were that simple. Even if I found a null pointer in the list, I would still have to perform the actual debug and walk the assembly to prove it and identify how we crashed.  However, I felt it would be more exciting to work in the reverse of my normal  methods. Besides, it only takes a moment once you know what you are doing.

     

    Let me pause a moment to provide a little background information here which will later prove relevant in understanding my method of thinking and comprehending what went wrong. In the nt!_EPROCESS  object there is a _LIST_ENTRY structure  located at offset 0x98 called ActiveProcessLinks. This forms a linked list of all the active NT!_EPROCESS structures on the system.  Let us first examine the _ LIST_ENTRY structure by reviewing the output below:

     

    1: kd> dt nt!_LIST_ENTRY

       +0x000 Flink            : Ptr32 _LIST_ENTRY

       +0x004 Blink            : Ptr32 _LIST_ENTRY

     

    As you can see above, a _LIST_ENTRY contains two pointers that create what is called a linked list. Blink represents a backwards link and Flink represents a forward link.  A double linked list may be traversed in either direction by following the pointer to the next or previous entry in the list. Linked lists are used heavily in operating system programming.  This particular  linked list contains a pointer to the previous _EPROCESS object’s ActiveProcessLinks field (blink) and the next _EPROCESS object’s ActiveProcessLinks field (flink).. The following command will display the activeprocesslinks field of the nt!_EPROCESS structure.

    0: kd> dt nt!_EPROCESS activeprocesslinks

       +0x098 ActiveProcessLinks : _LIST_ENTRY  ß the +0x098 lets us know that this field is located 0x98 from the start of each  nt!_EPROCESS structure.  All of these EPROCESS structures on the system are linked together using the flink/blink method. This allows us to enumerate all of the processes on the system by following the links in either direction.

     

    The first entry of a linked list is commonly referred to as the list head. The  list head is a _LIST_ENTRY structure that points to the first and last entry of the link list. Global variables are often used to store the location of the list head. Once you have the location of the head, you can then use it's flink/blink values to traverse the list in either direction.   Nt!PsActiveProcessHead is a global variable that points to the beginning of the list of Active Processes that we discussed in the previous paragraph. If you traverse this list starting at the head, you will end up in the activeprocesslinks field of each process on the system. This will allow code to find each _EPROCESS structure on the system by simply traversing this list. If you are an true geek like me and are  learning this for the first time, you are probably going to  fire up your debugger and dump out the list of active processes on your own system before you reach the end of this article.

    Please note that by following the flink/blink you will arrive at offset 0x098 from this structure, not the start of the structure. So the complete list consists of a _LIST_ENTRY field present in each EPROCESS structure + the _LIST_ENTRY located at the memory referenced by the global variable nt!PsActiveProcessHead. You can use the 'x' examine symbols command to display this variable. I included the wildcards below only to demonstrate the ability to use wildcards with these commands.

     

    0: kd> x nt!*PsActiveProcessHead*

    808b5be8 nt!PsActiveProcessHead = <no type information>

    0: kd> dt nt!*PsActiveProcessHead*

    808b5be8  ntkrnlmp!PsActiveProcessHead

     

    Back to our investigation.  At the point where we left off, we had not yet dug into the code to identify exactly what happened. We instead took an educated guess and decided to dump out the list of processes on the system. If it is found that we have a null value, this may very well allow us to work in reverse of our normal debugging.   Armed with the above background  information, we shall now dump out the list. If we get lucky, this will expedite the review of this memory dump by telling us what may have gone wrong before we even start digging into the code.

     

    The beginning of the list sounds like a good place to start traversing  the entries. The value highlighted in the global variable above marks the address where the linked list starts. The following command will dump out this location and provide you with the flink and blink. The switch /c1 was used in order to limit the output to 1 column for easy blog annotation. For more information on the /c1 switch, type .hh dd into the debugger command interpreter and then press enter once the help file window appears.

     

    0: kd> dd /c1 nt!PsActiveProcessHead L2

    808b5be8  8a78c800  ß this is the forward link (flink)

    808b5bec  85ddd510  ß this is the backwards link (blink)

     

                    Now that we have the address that the list starts at, we need a way to dump it out. Windbg provides many ways  to automatically walk a linked list. You can use dt, !list, or the dl commands . In this case, let's use the  command 'dl' due to its simplicity . More information about this command may be obtained from the help file that is included with windbg.

     

    0: kd> dl 0x8a78c800 ff 2   ß Here is an example of how to use this command to traverse the linked list starting at address 0x8a78c800. The value ff represents the maximum number of links to display. The value 0x8a78c800 was obtained from the flink field _LIST_ENTRY present at the global variable listed above. The value 2 represents the number of values to dump from each address. Since this linked list contains a forward link (flink) and backwards link (blink), we should pass in a value of 2 to dump both addresses.

     

    As displayed below, I decided to use the global nt!PsActiveProcessHead instead of specifying an address to the command dl.  Here is the beginning of the output as displayed by the debugger.

     

    0: kd> dl nt!PsActiveProcessHead ff 2

    808b5be8  8a78c800 85ddd510 

    8a78c800  89af0370 808b5be8  

    888745a0  888492e0 89af0370

    888492e0  89b17708 888745a0

    89b17708  89b4e3d0 888492e0

    ... … Ommitting the rest of the output

     

                    The plain output displayed above represents the type of output that you will see when entering the dl command into the debugger. In order to better illustrate the patterns and show the relationship between the flink and blink values,  I have bolded, underlined, highlighted, and commented various portions of the output and listed it below. Note the annotations that demonstrate how each member of the linked list points to the next and previous item. You can follow the list all the way down until you see we do in fact end up at a zeroed out address.

     

    0: kd> dl nt!PsActiveProcessHead ff 2

    808b5be8  8a78c800 85ddd510  <-- Notice that the dl command first displays  address  808b5be8 which is the address referenced by the symbol information for the global variable nt!PsActiveProcessHead that we provided. The next two values are the flink and the blink. I have highlighted the flink above. Notice how this flink is pointing to the next memory location listed below.

     

    8a78c800  89af0370 808b5be8   <--  In this line of output, notice that the dl command has followed the flink to address 8a78c800  (which I have again highlighted)  and  is dumping out the new  flink and blink pair.  I have underlined the blink. Notice that the blink  points back to the address of the previous flink/blink pair.  

     

    89af0370  888745a0 8a78c800  Notice the  patterns displayed, follow the output  all the way to the bottom, and see if you notice anything that may be a problem.

     

    888745a0  888492e0 89af0370

    888492e0  89b17708 888745a0

    89b17708  89b4e3d0 888492e0

    89b4e3d0  8883fe20 89b17708

    8883fe20  89ab73e8 89b4e3d0

    89ab73e8  887d7650 8883fe20

    887d7650  888906a8 89ab73e8

    888906a8  89483408 887d7650

    89483408  88839d40 888906a8

    … Ommitting output to reduce length. The links were all valid…

     

    85ff6e20  8708be20 8663a278

    8708be20  87176940 85ff6e20

    87176940  861ce3b0 8708be20

    861ce3b0  860010b8 87176940

    860010b8  00000000 00000000   <-- Wait...This doesn't look correct. The blink should be 861ce3b0, not zero. The previous list entry's flink took us here to address 860010b8. It appears that our theory might just be correct. Observing this,  I suspected that memory location  860010b8 has been overwritten with zeros. It is also possible, however, that the value in the flink above was incorrect causing us to end up at some random address that contained zeros. To know for sure I dumped the list backwards to see if the next flink/blink pair had a blink pointing to 860010b8.

     

    0: kd> dlb nt!PsActiveProcessHead ff 2ß This command dumps the list backwards. Below you can see the dlb command walking the linked list by starting at the list head and moving backwards (the list should make a circle). Don't be confused by the different patterns below. This is simply due to the fact that the list is being walked backward. The output displays the three value in the same order  address   flink  blink.

     

    0: kd> dlb nt!PsActiveProcessHead ff 2

    808b5be8  8a78c800 85ddd510 

    85ddd510  808b5be8 85e58e20

    85e58e20  85ddd510 8646ce20

    8646ce20  85e58e20 85d92c08

    85d92c08  8646ce20 8616ce20

    … Ommitting output to reduce length. The links were all valid…

    872ba380  86e456a0 86c8da10

    86c8da10  872ba380 8692f708

    8692f708  86c8da10 86d630b8

    86d630b8  8692f708 8673eae0

    8673eae0  86d630b8 87223a98

    87223a98  8673eae0 860010b8  

    860010b8  00000000 00000000    ß We arrived at the same address of 860010b8. So when walking the list both forward and backward we encounter pointers that take us to this address. This means that there should have been a valid flink/blink pair (which would be a valid nt!_EPROCESS activeprocesslinks _LIST_ENTRY ) at this address.  This memory appears to have instead been zeroed out. It’s flink pointer should be 87223a98 

     

    When dumping the list out forward using dl, the line of output representing 860010b8 should have contained these values below, however it did not.

    860010b8  87223a98  861ce3b0 

     

    So what happed? In dumping out the area around the address, you can see it has been zeroed

     

    0: kd> dd 860010b8-200 860010b8+200

    86000eb8  00000000 00000000 00000000 00000000

    86000ec8  00000000 00000000 00000000 00000000

    86000ed8  00000000 00000000 00000000 00000000

    86000ee8  00000000 00000000 00000000 00000000

    … Ommitting output to reduce length. Entire range contained all zeros…

    86001088  00000000 00000000 00000000 00000000

    86001098  00000000 00000000 00000000 00000000

    860010a8  00000000 00000000 00000000 00000000

    860010b8  00000000 00000000 00000000 00000000

    860010c8  00000000 00000000 00000000 00000000

    860010d8  00000000 00000000 00000000 00000000

    860010e8  00000000 00000000 00000000 00000000

    … Ommitting output to reduce length. Entire range contained all zeros…

    86001288  00000000 00000000 00000000 00000000

    86001298  00000000 00000000 00000000 00000000

    860012a8  00000000 00000000 00000000 00000000

    860012b8  00000000

     

    Out of curiosity, I ran the !address command to see what range this address falls into.

    0: kd> !address 860010b8

      82b7e000 - 07c82000                          

              Usage       KernelSpaceUsageNonPagedPool

     

    Next, we dump the pool information out. Note the corruption indication listed below.

     

    0: kd> !pool 860010b8

    Pool page 860010b8 region is Nonpaged pool

    86001000 is not a valid large pool allocation, checking large session pool...

    86001000 is freed (or corrupt) pool

    Bad allocation size @86001000, zero is invalid

    ***

    *** An error (or corruption) in the pool was detected;

    *** Attempting to diagnose the problem.

    ***

    *** Use !poolval 86001000 for more details.

    ***

     

    0: kd> !poolval 86001000

    Pool page 86001000 region is Nonpaged pool

    Validating Pool headers for pool page: 86001000

    Pool page [ 86001000 ] is __inVALID.

    Analyzing linked list...

    [ 86001000 --> 86001310 (size = 0x310 bytes)]: Corrupt region

    Scanning for single bit errors...

    None found

     

    So now that we have opened the dump, set the trap frame, dumped the stack, then found  the answer. Now, we need to do the actual debug of the assembly code to both prove our theory and determine how this caused us to die. Let's revisit the activeprocesslinks field that we discussed before.

     

    0: kd> dt nt!_EPROCESS activeprocesslinks

       +0x098 ActiveProcessLinks : _LIST_ENTRY  ß the linked list we were following is located 0x98 from the start of the nt!_EPROCESS.  The previous _LIST_ENTRY provided us with the address of this field by in the next process object. To get the start of the actual process object, we need to subtract 0x98 from that value. For example, if you were passed in a null address instead of a valid pointer and subtracted 0x98 you would get:

     

    0: kd> ? 0x0-0x098 

    Evaluate expression: -152 = ffffff68  ß If we start with a null value and backup 0x98 we get ffffff68 

     

    As you can see below, that very value happens to be present in ECX. ECX should instead be the address of our process object. However,since our liked list pointer was null, we instead were following an invalid address. It’s all downhill from here:

     

    0: kd> recx  ß What’s in ECX?

    Last set context:

    ecx=ffffff68   ß it’s the offset from the null pointer

     

    Let’s keep digging to verify this by looking back in the assembly and walking through what happened in the final moments of the crash. To do this, I unassembled backwards using the ub command. Examination of the output of this command (listed below but limited to only one command for blog simplicity) will show that the code took the invalid address ffffff68 that should have been a pointer to the process object and subtracted 0x18. This gave the us the invalid address ffffff50 that we died on. The . represents the current instruction pointer and the L1 tells us to only display 1 unassembled assembly language instruction.

     

    0: kd> ub . L1

    nt!ObReferenceObjectSafe

    80846dce 8d51e8          lea     edx,[ecx-18h]   ß we took the process object pointer in ecx and subtracted 0x18 from it and put it into edx to dereference

     

    0: kd> redx

    edx=ffffff50

     

    0: kd> ? ffffff68-0x18

    Evaluate expression: -176 = ffffff50

     

    Next, we tried to dereference this invalid pointer. Note that @$ip is another way (just like .)to point to the instruction pointer.

     

    0: kd> u @$ip L1

    nt!ObReferenceObjectSafe+0x3:

    80846dd1 8b0a            mov     ecx,dword ptr [edx]  ß edx contains the invalid value

     

    You can also use the r command to dump out all of the registers and display this failed instruction as shown below.

     

    0: kd> r

    Last set context:

    eax=861ce300 ebx=808b5be8 ecx=ffffff68 edx=ffffff50 esi=8747c8d0 edi=00000000

    eip=80846dd1 esp=f40099b4 ebp=f40099c8 iopl=0         nv up ei pl nz ac pe cy

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010217

    nt!ObReferenceObjectSafe+0x3:

    80846dd1 8b0a            mov     ecx,dword ptr [edx]  ds:0023:ffffff50=????????  ß we trapped trying to dereference the bad address that edx was directing us to

     

    So, why would we be backing up from this address? Each process object has an Object Header that precedes it. The following command ?? uses the current Expression Evaluator (C++ is the default) to run  the sizeof() function against  the nt!_OBJECT_HEADER symbol information and output the size of the an object header.

     

    0: kd> ?? sizeof (nt!_OBJECT_HEADER)

    unsigned int 0x20

     

    You can dump out this header to see where the start of the process object is with the following command

     

    0: kd> dt nt!_OBJECT_HEADER

       +0x000 PointerCount     : Int4B  ß the code was accessing the offset 0 of the Object Header which is the PointerCount field

       +0x004 HandleCount      : Int4B

       +0x004 NextToFree       : Ptr32 Void

       +0x008 Type             : Ptr32 _OBJECT_TYPE

       +0x00c NameInfoOffset   : UChar

       +0x00d HandleInfoOffset : UChar

       +0x00e QuotaInfoOffset  : UChar

       +0x00f Flags            : UChar

       +0x010 ObjectCreateInfo : Ptr32 _OBJECT_CREATE_INFORMATION

       +0x010 QuotaBlockCharged : Ptr32 Void

       +0x014 SecurityDescriptor : Ptr32 Void

       +0x018 Body             : _QUAD  ß The process object is located here at offset 0x18. This means that backing up 0x18 would put you at offset 0x0 of the object header which is the Pointer Count field.

     

    So let’s see if we can locate any stack data to validate this. Dumping the stack again, I observe the address of the process object.

     

    0: kd> kvL2

    ChildEBP RetAddr  Args to Child             

    f40099b0 809426b2 a9131e08 00000012 861ce318 nt!ObReferenceObjectSafe+0x3 (FPO: [0,0,0])

    f40099c8 808544d9 861ce318 0007d960 00000000 nt!PsGetNextProcess+0x6c (FPO: [1,1,0])

     

    Let’s dump out this process field. Wow! Look what we have here. The address of the zeroed list entry from above.

    0: kd> dt nt!_EPROCESS 0x861ce318 ActiveProcessLinks

       +0x098 ActiveProcessLinks : _LIST_ENTRY [ 0x860010b8 - 0x87176940 ]

     

    0: kd> dd 0x860010b8 L2

    860010b8  00000000 00000000

     

    So to summarize the problem, we crashed because a memory address that should have contained valid flink/blink pointers to a process object contained the invalid value 00000000.  We expected this value to be pointing to a valid nt!_EPROCESS  ActiveProcessLinks field instead of containing a null pointer (zero). After obtaining this null pointer, we then subtracted 0x098 which normally would have brought us to the start of the process object. Instead we ended up with invalid pointer 0xffffff68. Finally, we subtracted another 0x18 which should have brought us to the PointerCount  field of the object header. This gave us address 0xffffff50 instead of a valid object header address. We then dereferenced this invalid address which led us to the bugcheck.

     

    This crash was due to pool corruption, as shown in the output of the !pool command above.  This blog has been a walkthrough of debugging a memory dump and identifying how pool corruption led up to a crash. Identification of the reason for the corruption is a separate matter. These types of issues are often difficult to troubleshoot due to the fact that any driver in kernel mode could have corrupted this memory region. To further complicate matters, the actual corruption of this memory may have taken place millions of CPU cycles before the time this code stumbled across the corrupted region leading up to the crash that caused this memory dump to be initiated. A memory dump is simply a snapshot of information relating to the state of a system during one particular moment in time.  

     

                    In this case, in order to identify the source of the pool corruption we need to use Special Pool.  Special Pool will use guard pages to catch a buffer overrun or underrun and should provide us with a dump that shows the code that causes the corruption.  You can find more information on Special Pool in KB188831.  Also, in some situations memory corruption can be caused by a driver overflowing on a DMA transfer causing corruption to physical pages rather than virtual pages as we see in typical pool corruption.

                   

                    Usually, a debug such as this one would have started with analysis of the function at the top of the stack and worked  down the stack reading the assembly to definitively identify the source of the bad pointer. However in this case, we were lucky enough to find the answer right away and then validate it with conventional debugging methods by reversing in reverse.

    Share this post :
  • Ntdebugging Blog

    Capturing ADPlus CLR crashes

    • 2 Comments

    Trey Nash here again, and I would like to discuss a scenario we are all too familiar with.  You’ve worked your tail off for the past year, and for the past couple of months you even worked evenings and weekends.  Management rewarded your team with two weeks off in appreciation of your efforts.  But now that you’re back in the office, you’re hearing rumors percolating up from your tech support department that there are some cases where your application is crashing in the field for mysterious reasons.  What do you do?

    The application happens to have been built with an unhandled exception filter registered via the AppDomain.UnhandledException event.  Therefore, you at least know that the application is failing with an InvalidCastException, but you cannot imagine why this is happening.

    Wouldn’t it be nice if you could live debug on the affected system?  Unless you work on-site for your customer or your software is on a laptop and your customer is willing to send it to you, then I doubt you will get this opportunity.  What you need is a tool to capture the state of your application while it is failing.  Then the customer could capture this information and send it to you.

    Enter ADPlus.  ADPlus is a free tool in the Debugging Tools for Windows package that scripts the CDB debugger allowing you to capture dumps for one or multiple processes on a system.  It also offers the following advantages:

    ·         ADPlus can monitor desktop applications, services, etc.

    ·         ADPlus can monitor multiple processes on the system.  When it collects a dump of those processes, it freezes and dumps them simultaneously.  This is essential for tracking down problems with inter-process communications.

    ·         ADPlus supports xcopy deployment meaning the customer does not need to install anything via Windows Installer, etc.  This minimizes configuration changes on the machine, and that is music to customers’ ears.  

     

    Note:    Although ADPlus is xcopy installable, you still have to install the Debugging Tools for Windows package via Windows Installer as that is the only way Microsoft distributes it.  However, once you have installed Debugging Tools for Windows once, you can xcopy deploy ADPlus or the entire Debugging Tools for Windows package to another machine.  In fact, during development, I find it extremely handy to check in the development tools into the source repository.  Debugging Tools for Windows supports this by virtue of the fact that it is xcopy installable.

    For those of you that are savvy with Windows Installer, you can perform an admin install using the msi for Debugging Tools for Windows and that will allow you to extract the files without actually installing the package on the machine, for example:

    msiexec /a dbg_x86_6.11.1.404.msi

     

    With all of that said, let’s see how ADPlus can help you diagnose problems with .NET applications.

    The Sample Application

    In the rest of this post, I will reference the C# 3.0 sample application that I have put together to illustrate using ADPlus to capture an unhandled exception in a .NET application.  The code is listed below:

    using System;

    using System.Linq;

    using System.Runtime.Serialization;

     

    class A

    {

        public void SaySomething() {

            Console.WriteLine( "Yeah, Peter...." );

            throw new BadDesignException();

        }

    }

     

    class B : A

    {

    }

     

    class C

    {

    }

     

    class EntryPoint

    {

        static void Main() {

            DoSomething();

        }

     

        static void DoSomething() {

            Func<int, object> generatorFunc = (x) => {

                if( x == 7 ) {

                    return new C();

                } else {

                    return new B();

                }

            };

     

            var collection = from i in Enumerable.Range( 0, 10 )

                             select generatorFunc(i);

     

            // Let's iterate over each of the items in the collection

            //

            // ASSUMING THEY ARE ALL DERIVED FROM A !!!!

            foreach( var item in collection ) {

                A a = (A) item;

                try {

                    a.SaySomething();

                }

                catch( BadDesignException ) {

                    // Swallow these here. The programmer chose to

                    // use exceptions for normal control flow which

                    // is *very* poor design.

                }

            }

        }

    }

     

    public class BadDesignException : Exception

    {

        public BadDesignException() { }

        public BadDesignException( string msg ) : base( msg ) { }

        public BadDesignException( string msg, Exception x ) : base( msg, x ) { }

        protected BadDesignException( SerializationInfo si,

                                      StreamingContext ctx )

            :base( si, ctx ) { }

    }

     

    You can compile this sample code easily by putting it in a file, such as test.cs, and then from either a Visual Studio Command Prompt or from a Windows SDK Command Shell, execute the following:

    csc /debug+ test.cs

     

    Note:    The code is contrived and there are many bad things about this code from a design/coding standpoint, but that is intentional for the sake of illustration.  For example, one may want to re-think introducing a collection that contains references to System.Object.

    The code above also uses features that are new to C# 3.0 (for brevity) including lambda expressions and LINQ.  If you would like to become more familiar with them, visit the
    C# website on MSDN or reference one of the excellent books on C# such as Pro LINQ or Accelerated C# 2008.

     

    The notable section of the code to focus on is the foreach loop within the EntryPoint.Main() method.  In that foreach loop, we are iterating over a collection of objects where we are assuming that they all derive from type A.  Thus the code attempts to cast all instances to a reference of type A and since I have intentionally put an instance of type C within that collection, it will fail at some point with an exception of type System.InvalidCastException.

    Capturing the Problem

    Using ADPlus in crash mode, we can capture the exception in the customer’s environment.  To illustrate this in action, let’s launch ADPlus in crash mode and monitor the test.exe example application built above by using the following command line:

    adplus -crash -o c:\temp\test -FullOnFirst -sc c:\temp\test\test.exe

     

    Note:    I built and tested this code in a directory named c:\temp\test on my machine as I wrote this.  Therefore, you will see references to it throughout this post.  Incidentally, I have found ADPlus is a lot easier to use if you feed it fully qualified paths.  If you get mysterious errors or behavior, and you are using relative paths with ADPlus, try fully qualifying the paths before you beat your head against the wall for too long trying to figure out what could be going wrong.

     

    ADPlus is easier to launch if you add the directory where the Debugging Tools for Windows was installed to your PATH environment variable.  The command line above assumes this has been done.  Now, let me explain the command line options that I used above.  I highly recommend that you become familiar with all of the ADPlus command line options.

    ·         -crash launches ADPlus in crash mode.  This is the mode you want to use if your application is failing because of an unhandled exception.

    ·         -o c:\temp\test tells ADPlus that I want the output to be placed in c:\temp\test, which is the directory in which I built the test.exe application.

    ·         -FullOnFirst is very important for managed applications..  This tells ADPlus to grab a full process dump on first chance exceptions.  It is essential that you capture full dumps for managed applications, otherwise, all of the necessary data regarding the execution engine and the managed heap will be absent from the dump making it impossible to debug effectively.

    ·         -sc c:\temp\test.exe is only one of the ways you can point ADPlus to an application to monitor.  In this case, we’re instructing ADPlus to tell the debugger to explicitly launch the application.  Had it been a service, or if the application you want to monitor is already running, we would probably have used –p to attach to its PID or –pn to attach to the process by name.  Notice that I provided the full path to the application.

    After you launch ADPlus, you are presented with the following dialog unless you use the –quiet option.

     

    Once the application is finished executing, go to the directory that you specified in the ADPlus –o command line option and you should see a subdirectory named similarly to what you see in the previous dialog snapshot.  For the instance of test.exe I just executed, that directory is named Crash_Mode__Date_05-11-2009__Time_21-12-54PM.  Under that directory, there are quite a few files that I have listed below:

    C:\temp\test\Crash_Mode__Date_05-11-2009__Time_21-12-54PM>dir /b
    ADPlus_report.txt
    CDBScripts
    PID-0__Spawned0__1st_chance_CPlusPlusEH__full_15ac_2009-05-11_21-13-01-332_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-55-482_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-56-527_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-57-370_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-103_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-867_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-59-663_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-00-505_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-02-689_0eac.dmp
    PID-0__Spawned0__1st_chance_Process_Shut_Down__full_15ac_2009-05-11_21-13-27-743_0eac.dmp
    PID-0__Spawned0__2nd_chance_NET_CLR__full_15ac_2009-05-11_21-13-04-140_0eac.dmp
    PID-0__Spawned0__Date_05-11-2009__Time_21-12-54PM.log
    Process_List.txt

    For each first chance exception, a dump file (.dmp) has been collected.  Notice that the dump files can tend to be very large because of the –FullOnFirst option.  You can subsequently load these dump files into either Windbg (or its variants) or the Visual Studio Debugger.  Personally, I prefer Windbg because I can then load the SOS extension along with the SOSEX extension and dig into the state of the application and the CLR.

    Using the ADPlus default configuration as we have above, you can see that ADPlus generated dumps for one first-chance C++ exception, eight first-chance CLR exceptions, one second-chance CLR exception and one dump collected during process shut down.

    What is Included in the Output

    If you look in your Debugging Tools for Windows directory, you’ll notice that ADPlus is really just a very complex VBScript.  It generates quite a bit of useful information along with any problem dump files.  ADPlus_report.txt reports the configuration for ADPlus for this run.  This is handy if you need to know what it will do for a specific type of exception.

    Process_list.txt is generated from executing tlist.exe, another tool that comes with Debugging Tools for Windows.

    And finally, the CDBScripts subdirectory contains a .cfg file which is the exact debugger script ADPlus generated and subsequently fed to CDB to get the job done.  On my machine, when running against the sample application using the command line from the previous section, this file is named PID‑0__Spawned0.cfg.  If you’re ever curious or need to know exactly what ADPlus instructed the debugger to do, this file is the source.

     

    Note:    The reason the name contains a zero PID is because we used the –sc option to launch the application.  Had we used the –p or –pn option, the PID in the filename would not be zero.

     

    I don’t recommend executing this debugger script in a live debugger as doing so could overwrite the data that you already have collected.  Instead, if you need to couple ADPlus with live debugging, you should use the –gs option which I will describe shortly.

    Pinpointing a Specific Exception

    You’ll notice that in the previous run of ADPlus, quite a few dump files were generated and each one was fairly large.  In reality, there are times where a first chance exception does not always indicate a fatal situation.  In Accelerated C# 2008, I go into detail regarding how it is poor design practice to implement any sort of control flow or otherwise reasonably expected behavior using exceptions.  This is because exceptions are supposed to be truly exceptional events!  For more juicy details on how expensive exceptions are, I invite you to read an excellent blog post by Tess Ferrandez on the topic. At any rate, you may encounter situations where you get a lot of dumps for first chance exceptions you are not interested in as this example shows..

    To alleviate this situation, you can create an ADPlus configuration file coupled with the !StopOnException command provided by the SOS extension to instruct ADPlus to filter out only the exceptions you’re interested in.  To do this, I created a configuration file named filter.config with the following contents:

    <ADPlus>

     

        <!-- Configuring ADPlus to log only exceptions we're interested in -->

     

      <Exceptions>

         <Config>

            <!-- This is for the CLR exception -->

           <Code> clr </Code>

           <Actions1> Log </Actions1>

           <CustomActions1> .loadby sos mscorwks; !StopOnException System.InvalidCastException 1; j ($t1 = 1) '.dump /ma /u c:\dumps\InvalidCastException.dmp; gn' ; 'gn'  </CustomActions1>

           <ReturnAction1> VOID  </ReturnAction1>

           <Actions2> Log </Actions2>

         </Config>

      </Exceptions>

     

    </ADPlus>

     

     

    Note:    The <CustomActions1> element above is meant to be in one line in the config file.

     

    The <CustomActions1> element is the element of interest in this configuration file.  This element allows you to specify which commands the debugger should execute on first chance exceptions.  Within this element, you can put any valid debugger commands (except windbg GUI-related commands).  If you need to execute multiple commands, as I have above, you simply separate them with semicolons.  In the <CustomActions1> element shown above, I first load the SOS extension using the .loadby command.  Then, I use the !StopOnException command in predicate mode to set the $t1 pseudo-register to 1 if the exception is of type System.InvalidCastException and 0 otherwise.  Then, the following j command is used to create a full dump if $t1 is 1 and do nothing otherwise.  The gn command in both paths of the j command tells the debugger to go unhandled so that the exception is propagated up rather than swallowed by the debugger.  If you were to go handled, thus swallowing the exception, the exception handlers in the code would never see it and the debugger would alter the behavior of the application.  And finally, note that the .dump command used to create the dump indicates the path where the dump will be stored.  In this case, I am placing it in the c:\dumps directory on my machine.

    Now you can provide this configuration file to ADPlus using the following command:

    adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -sc c:\temp\test\test.exe

    Notice that there are fewer dumps collected.  Along with the dump for the InvalidCastException, it also captures a C++ exception as well as when the application shuts down.  If you open the C++ exception dump in the debugger and inspect the stack, it shows that the CLR is in the process of generating the InvalidCastException.  The CLR catches the C++ exception and converts it into a managed exception.  The C++ exception dump was generated because I left the –FullOnFirst option in the previous command line.  You can eliminate the C++ exception dump by removing -FullOnFirst.

    Using ADPlus during Live Debugging

    I already hinted at the –gs ADPlus command line option earlier in this post.  This option allows you to create the debugger scripts ADPlus creates without actually running them in the debugger.  For example, if you execute the following command (I executed mine from my c:\temp\test directory):

    adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -gs livedebug

    You will notice that ADPlus does not actually launch any debugger or your application.  Rather, it creates a subdirectory named livedebug.  When you go into that directory, you’ll notice it looks similar in layout to the crash directories created in the previous demonstration.  On my machine, I end up with the following two files:

    C:\temp\test\livedebug\ADPlus_report.txt
    C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg

    The PID-0__livedebug.cfg file is actually a debugger script file containing debugger commands.  All we have to do now is launch our test application in the debugger and then execute this script.  Within my c:\temp\test directory, I can launch the debugger using the following command:

    windbg test.exe

    Once inside the debugger, I can invoke the ADPlus debugger script by executing the following $$< command:

    $$<C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg

    Once you execute the $$< command, you will notice that the script takes over and performs the same actions as ADPlus does when run conventionally from the command line.

    As a further experiment, edit the filter.config file and remove the gn commands.  Now the debugger will wait for user input after executing the custom commands rather than continuing execution of the application.  This could be handy if you want the opportunity to perform debugging by hand if ADPlus encounters a certain situation.

    First Chance vs. Second Chance Exceptions

    Throughout this post, I have been focusing on first chance exceptions for the sake of illustration.  However, many times, you are only interested in the truly unhandled exceptions.  In that case, you would want to capture only second chance exceptions.  You certainly would want to capture first chance exceptions if you needed to capture the state at the exact point the exception was thrown and before the OS searches for any handlers.  Additionally, if you suspect that the application you are debugging may be handling an exception inappropriately (maybe even blindly swallowing it), then you certainly want to catch first chance exceptions in that case.

    Conclusion

    In this blog post I have introduced you to ADPlus and the utility it provides when troubleshooting problems in the field.  ADPlus lends itself well to capturing exceptional situations in the field since it requires no configuration changes on the affected machine thus making it an easy pill for your customers to swallow.  You may also find this very useful when working with your Quality Assurance team during the development phase.  For example, how many times have you encountered a situation where a problem only presents itself on a dusty old rarely-used machine in the back corner of a random QA engineer’s office?  How many times, in such situations, have you felt like the only way to trouble shoot the problem effectively is to install the Visual Studio debugger and start working on that machine?  Furthermore, what if the problem only happens on that dusty old machine about once a week.  ADPlus can help you avoid that madness by providing an easy mechanism for capturing full process dumps on the troubled machine so you can then take those dumps to your trusty development machine for further debugging and analysis.

    Have fun out there!

     

    Share this post :
  • Ntdebugging Blog

    XPERF DEMO: Michael Morales demonstrates the Windows Profiling Tool

    • 1 Comments

    In this video Michael introduces the sample based profiling tool named XPERF. He also demonstrates how to use the tool to uncover high DPC’s, and Disk I/O’s. If you would like to see more XPERF blogs on ntdebugging please send your feedback using the email link at the top of the blog site. Feel free to leave comments too.

    Location of the video: http://ittv.net/DesktopModules/UltraVideoGallery/uvg.swf?vId=419&portalId=0

    BTW  - I’m working on making a downloadable version for off-line viewing. Stay Tuned.

    Link to Microsoft Windows Performance Toolkit

    http://msdn.microsoft.com/en-us/performance/default.aspx

     

    Link to XPerf version 4.5. This ships with the Win7 SDK below.

    http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=6db1f17f-5f1e-4e54-a331-c32285cdde0c

     

    We look forward to your feedback!

    Ron Stock

    Share this post :
  • Ntdebugging Blog

    WMI: "REFERENCES OF" query failure during Provider startup could be disastrous

    • 1 Comments

    Hello, my name is Venkatesh Ganga. I’m a Senior Escalation Engineer on the Microsoft Platform Global Escalation Services team. In this blog I would like to talk about an interesting WMI issue I worked a few months ago. The idea behind this blog is to discuss how WMI works under the hood. In this article I show a few call stacks but the focus of the article is really to point out WMI concepts.  

     

    The issue was interesting. The customer encountered the error, "Provider cannot perform the requested operation" for all WMI queries to any provider on some of his Windows XP machines. Also the WMI Control (wmimgmt.msc) properties showed the following:-

     

    Figure 1:
    image 

     

    Moreover, nothing happened when he clicked on other tabs like Logging, Backup/Restore, etc.  Running WMIDIAG showed the message - “24 error(s) 0x80041024 - (WBEM_E_PROVIDER_NOT_CAPABLE) Provider cannot perform the requested operation“.

    All the standard WMI troubleshooting was done before I was engaged, like rebuilding the WMI repository, checking DCOM permissions, etc. But nothing easily resolved the issue so I had the customer send a machine image to Microsoft for a local repro of the problem. Initially I didn’t know where to start with the issue because there were multiple symptoms. I decided to trace the execution of the inbox provider CIMWin32.dll by issuing the WMI query “Select * from Win32_Processor”, next find where it fails, and then go from there.

    WMI maintains each provider’s physical implementation information such as the CLSID of the provider’s COM object, the hosting model, etc. in an instance of the __Win32Provider system class. Each provider can support one or more types mentioned below. Based on the type(s) the provider supports, WMI will create an instance of each supported type.

    ·         __ClassProviderRegistration

    ·         __EventConsumerProviderRegistration

    ·         __EventProvideRegistration

    ·         __InstanceProviderRegistration

    ·         __MethodProviderRegistration

    ·         __ObjectProviderRegistration

    ·         __PropertyProviderRegisration

    For example, looking at the Windows inbox .mof file for the Cimwin32 provider, we can quickly determine that the provider registers for two types - the Instance Provider, and the Method Provider as Figure 2 displays. The inbox mof files are usually stored in c:\Windows\System32\wbem

     

    Figure 2:

    image

     

    In Memory at Runtime 

    Since our test query “select * from Win32_Processor” retrieves the instances of Win32_Processor class, WMI checks the CIMWin32’s __InstanceProviderRegistration object to see if the SupportsEnumeration property is set. While debugging the customer’s image, I found the runtime object WMI maintains for CIMWin32 provider, set to “False” for the ”SupportsEnumeration” property.

     WMI maintains an in-memory structure for each provider and fills that structure out with information from the repository.  The CreateInstanceEnumAsync checks the provider’s in-memory structure to determine whether it supports enumeration.  From my debugging on the customer’s system, I determined the provider did not support enumeration.

    image

    Here's a snippet from my debugging-

    0:003> kL

    ChildEBP RetAddr

    0091f560 77e7a1ac  wmiprvse!CInterceptor_IWbemSyncProvider::CreateInstanceEnumAsync+0x21

    0091f588 77ef421a  rpcrt4!Invoke+0x30

    0091f998 77ef4bf3  rpcrt4!NdrStubCall2+0x297

    0091f9f0 756bd7fe  rpcrt4!CStdStubBuffer_Invoke+0xc6

    0091fa04 77600c31  fastprox!CBaseStublet::Invoke+0x22

    0091fa44 77600bdb  ole32!SyncStubInvoke+0x33

    0091fa8c 7750f237  ole32!StubInvoke+0xa7

    0091fb64 7750f15c  ole32!CCtxComChnl::ContextInvoke+0xe3

    0091fb80 77600b11  ole32!MTAInvoke+0x1a

    0091fbb0 776009bc  ole32!AppInvoke+0x9c

    0091fc84 77600715  ole32!ComInvokeWithLockAndIPID+0x2e0

    0091fcd0 77e79c75  ole32!ThreadInvoke+0x1cd

    0091fd04 77e79bda  rpcrt4!DispatchToStubInC+0x38

    0091fd58 77e79b06  rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x113

    0091fd7c 77e89f9c  rpcrt4!RPC_INTERFACE::DispatchToStub+0x84

    0091fdbc 77e89fdd  rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0xc0

    0091fdfc 77e7be65  rpcrt4!LRPC_SCALL::DealWithRequestMessage+0x2cd

    0091fe20 77e76794  rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d

    0091ff80 00000000  rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0x28f

    Repository 

    In contrast, when we checked the customer’s system, we determined  the repository showed  the provider’s “SupportEnumeration” property was set to true.  Verification steps below:

    1.       Run wbemtest and connect to root\cimv2

    2.       Enter the following query-

                                    select * from __InstanceProviderRegistration where provider="__Win32Provider.Name=\"CIMWin32\""

    3.       After running the query, you should see one object in the Query Result window. Double click it and look at the properties. On a working system you will see the SupportEnumeration property is set to “True” for the CimWin32 provider, as Figure 3 shows.

     

    Figure 3:

    image

     

    Discrepancy Between the In-Memory and Repository

    Now we know WMI is not initializing the provider correctly because there is a discrepancy between the in-memory objects it maintains, and the repository.

    So I started debugging the CIMWin32 provider initialization where WMI fills the in-memory structure. During provider initialization, WMI makes a REFERENCES OF query to get all  types CIMWin32 supports. On our customer’s machine, this query returned zero objects, whereas on a working system it returned two objects (one for __InstanceProviderRegistration and other for __MethodProviderRegistration). Here's a screenshot of this action. 

    "references of {__Win32Provider.Name="CimWin32"}"

    image

    Here is a stack of how we found the exact “REFERENCES OF“ query being issued.

    0:007> kL

    ChildEBP RetAddr 

    0164fa38 59841a6c wbemcore!CWbemNamespace::ExecQuery

    0164fa88 59841c41 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepositoryUsingQuery+0x47

    0164faac 59841d31 wmiprvsd!CServerObject_ProviderRegistrationV1::QueryRepository+0x42

    0164fac8 5980f056 wmiprvsd!CServerObject_ProviderRegistrationV1::Load+0x33

    0164fba8 762e9cfd wmiprvsd!CServerObject_BindingFactory::GetProvider+0x1f5

    0164fc24 76301d1e wbemcore!CWbemNamespace::DynAux_ExecQueryExtendedAsync+0x81

    0164fce0 76302912 wbemcore!CQueryEngine::ExecComplexQuery+0x1f4

    0164fdec 763032bd wbemcore!CQueryEngine::ExecQlQuery+0x37

    0164fe90 762fc769 wbemcore!CQueryEngine::ExecQuery+0x228

    0164feac 762cef24 wbemcore!CAsyncReq_ExecQueryAsync::Execute+0x19

    0164fed8 762ced4e wbemcore!CCoreQueue::pExecute+0x3c

    0164ff08 762f25cb wbemcore!CCoreQueue::Execute+0x18

    0164ff50 762cee89 wbemcore!CWbemQueue::Execute+0xf6

    0164ff84 762cf055 wbemcore!CCoreQueue::ThreadMain+0x111

    0164ffb4 7c80b683 wbemcore!CCoreQueue::_ThreadEntry+0x45

    0164ffec 00000000 kernel32!BaseThreadStart+0x37

    0:044> du 04f57424

    04f57424  "references of {__Win32Provider.N"

    04f57464  "ame="CIMWin32"}"

    The repository driver, repdrvfs.dll, works on the query above. The references query failed with error 80041017 (WBEM_E_INVALID_QUERY) due to an apostrophe in the computer name. While parsing the query, the repository driver considers anything between the apostrophes, or quotes, as a name. This apostrophe in the computer name misleads WMI to look for a closing apostrophe, which is not there, subsequently the references query fails with error.


    0:005> kL
    ChildEBP RetAddr

    05cbfec8 75214a74 repdrvfs!CNamespaceHandle::ExecReferencesQuery+0x1dd

    05cbff88 75215518 repdrvfs!CNamespaceHandle::ExecQuerySink+0xc2

    05cbffac 7520e11c repdrvfs!CExecQueryObject::Execute+0x28

    05cbffb4 7c80b683 repdrvfs!A51FiberBase+0xd

    05cbffec 7c82ffa9 kernel32!BaseThreadStart+0x37

    05cbffec 00000000 kernel32!BaseFiberStart+0x17

     

    0:005> du 00b56218
    00b56218 "references of {\\COMPUTERNAME'User1"
    00b56258 "\ROOT\cimv2:__Win32Provider.Name"
    00b56298 "="CIMWin32"}"


    0:005>r

    image

    WMI makes a similar REFERENCES OF query for all the providers to get the types they support. But they all fail for the same reason. Hence, WMI queries to all the providers fail. We renamed the computer name without an apostrophe and rebooted the box. After the rename, WMI initialized properly, and the WMI queries started working again. An apostrophe is not a valid character for the computer name. Unfortunately, we never found out how the computer name got set with an apostrophe. If we try to set the computer name through GUI, it throws an error saying the computer name has invalid characters.  Hopefully through this experience, you have a better understanding of how WMI implements providers to help you better troubleshoot your next WMI issue.

    Here are the main takeaways from this article-

    1.       WMI maintains an instance of __Win32Provider system class for each provider.  Based on the provider’s registration (see .mof file above) it maintains one or more registration system classes such as __InstanceProviderRegistration.

    2.       WMI uses the REFERENCES OF query to map the instance of __Win32Provider to the providers registration types (e.g. __InstanceProviderRegistration).

    3.       To view the various providers registered on your system, connect to root\Cimv2 and issue the following query: “Select * from __Win32Provider”.

    4.       And to get the types that each provider supports, issue the following query:references of {__Win32Provider.Name="<PROVIDER_NAME>"}

     

    Share this post :
  • Ntdebugging Blog

    Ntdebugging on Twitter

    • 0 Comments

    We're now on Twitter @ http://twitter.com/ntdebugging. If you want to chime in with feedback or suggestions for future articles feel free to drop us a "tweet". I thought this was would give us a better venue to interact with you guys.

     Cheers,

    Ron Stock

Page 1 of 1 (6 items)