• Ntdebugging Blog

    Reversing in Reverse, Part 2: More Linked-List Pool Corruption

    • 3 Comments

    Hello - It's Ryan again with the second installment of my list corruption walkthrough. The previous blog post is here - Reversing in Reverse: Linked-List Pool Corruption, a Complete Walkthrough (Part 1) 

    In part one we walked through the analysis of a memory.dmp collected during a bugcheck caused by pool corruption. The post also discussed doubly linked lists and demonstrated an unconventional order of debugging steps in which we did not begin our examination with the backtrace of the bad pointer value.

    Today’s continuation will consist of another crash dump 'debugging walkthrough' explaining a lot of the typical commands used for debugging. This one involves pool corruption affecting a linked list which led us to a kernel-mode crash.  I will also discuss the removal of an entry from the head of a linked list.  As in the previous post, I shall provide demonstrations of the topics within the debugger in an attempt to relate the information to a real-world problem. Despite the title, we won't be working in reverse today.

     As is typical, I began this windbg session with a quick  !analyze -v to get a feel for what went wrong.

    IRQL_NOT_LESS_OR_EQUAL (a)   An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high.  This is usually caused by drivers using improper addresses. If a kernel debugger is available get the stack backtrace.
    Arguments:
    Arg1: 00000004, memory referenced  <--
    Looks like a BAD pointer
    Arg2: d0000002, IRQL
    Arg3: 00000001, bitfield :
                            bit 0 : value 0 = read operation, 1 = write operation
                            bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
    Arg4: 808436e8, address which referenced memory
    TRAP_FRAME:  f7906cdc -- (.trap 0xfffffffff7906cdc)

    The output from this command lets us know that the code trapped accessing the invalid address 0x00000004. When we crash attempting to access low addresses like this, it is almost always due to dereferencing a null pointer. Such invalid address values are often obtained by adding some value to the pointer in an attempt to access fields in the structure that the pointer should have been referencing.  Recall in part one of this blog, the invalid address was also close to zero. The invalid address in the previous blog’s memory.dmp was obtained by subtracting negative values from a null pointer.   For example, we may find in our analysis today that we attempted to access a structure field that was located at offset 0x4 within the structure.  However, as I mentioned in part one, there are other ways that we could have ended up this value besides a NULL pointer. Some examples would include a hardware problem, following a corrupted pointer to a location that had some bad value, or some code improperly writing some value over a pointer.  Let’s start our analysis and identify what happened.

    Dumping the stack without setting the trap frame shows the code executing within the PageFault Trap Handler

    0: kd> kC

    nt!_KiTrap0E   ß We were in the handler for Trap 0xE

    nt!MiRemoveUnusedSegments

    nt!MiDereferenceSegmentThread

    nt!PspSystemThreadStartup

    nt!KiThreadStartup

     

    0: kd> .formats 0xE 

    Evaluate expression:

      Hex:     0000000e

      Decimal: 14  ßTrap number 14 which per the x86 Intel Manuals is a PageFault Trap

     

    0: kd> rcr2    ßPageFault trap results in the address that triggered the trap being stored in the CR2 register 

    Last set context:

    cr2=00000004  ß Here is the invalid address

     

    Next, we’ll set the trap frame 

    0: kd> .trap 0xfffffffff7906cdc

    After setting the trap frame, I looked at the stack again just to get a quick overview of what calls were made in this thread:

    0: kd> kC

    nt!MiRemoveUnusedSegments  <-- This is the function executing when the trap occurred.  Take notice of the function name.

    nt!MiDereferenceSegmentThread

    nt!PspSystemThreadStartup

    nt!KiThreadStartup

     

    Next, I went straight to the assembly to explore what happened. I’ll begin by examining the faulting instruction.

    0: kd> r

    Last set context:

    eax=00000000 ebx=80a5e540 ecx=808ab4a8 edx=00000000 esi=86f4c658 edi=80a5e4d0

    eip=808436e8 esp=f7906d50 ebp=f7906d90 iopl=0         nv up ei ng nz ac po cy

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010293

    nt!MiRemoveUnusedSegments+0x716:

    808436e8 c74004a8b48a80  mov     dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8) ds:0023:00000004=????????

     

    As you can see, the trap was triggered by attempting to add the value four to the null value in EAX. Also, I would like to point out that parameter number three of the bugcheck provides supporting information which is sometimes important in developing a clear understanding of what occurred: 

    0: kd> .bugcheck

    Bugcheck code 0000000A

    Arguments 00000004 d0000002 00000001 808436e8

     

    This is illustrated by the output of !analyze –v

    Arguments:

    Arg1: 00000004, memory referenced

    Arg2: d0000002, IRQL

    Arg3: 00000001, bitfield :

          bit 0 : value 0 = read operation, 1 = write operation

          bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)

    Arg4: 808436e8, address which referenced memory

     

    0: kd> .formats 0x1

    Evaluate expression:

      Binary:  00000000 00000000 00000000 00000001   ßWrite Operation, Not an execute operation.

     

    If you do not need to see the entire output of !analyze –v, you can get a very abbreviated version of the !analyze–v output by using !analyze –a as displayed below.

    0: kd> !analyze -a

    ****************************

    Bugcheck Analysis                              

    ****************************

    Use !analyze -v to get detailed debugging information.

    BugCheck A, {4, d0000002, 1, 808436e8}

    Probably caused by : memory_corruption ( nt!MiRemoveUnusedSegments+716 )

    Followup: MachineOwner

     

    You can get further information about this bugcheck by typing .hh bug check 0xA into the debugger

    windbg> .hh bug check 0xA

    This will bring up the following screen which displays helpful information.

     Help File

    Now that we understand that this crash was due to a NULL value present in EAX, the next goal of this postmortem memory.dmp debug session becomes identification of the source of this NULL value. This task can often prove extremely challenging due to the fact that it sometimes involves backtracing through several highly optimized functions full of jumps without symbols. Other times, we may be at the beginning of a very simple function for which we have private symbols. Let's see how we luck out today.

     I’ll proceed by disassembling the code around the point of the trap. Here is my favorite command for accomplishing this. 

    0: kd> ub @$ip L4;u . L3;r$ip

    nt!MiRemoveUnusedSegments+0x70a

    808436dc 8bf0            mov     esi,eax       

     808436de 8b06            mov     eax,dword ptr [esi]    

    The first  two instructions (highlighted)  together accomplish the following “Overwrite the value in EAX with the data that EAX is presently pointing to”

    808436e0 a3a8b48a80      mov     dword ptr [nt!MmUnusedSegmentList (808ab4a8)],eax

    808436e5 83c6fc          add     esi,0FFFFFFFCh

    nt!MiRemoveUnusedSegments+0x716

    808436e8 c74004a8b48a80  mov     dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8)    ßHowever EAX was unexpectedly NULL here.

    808436ef ff0d8cb48a80    dec     dword ptr [nt!MmUnusedSegmentCount (808ab48c)]

    808436f5 33c9            xor     ecx,ecx

     Last set context:

    $ip=808436e8

     

    0: kd> reax

    Last set context:

    eax=00000000

     

    In reviewing the code above, I observed that we had just moved this value from EAX to nt!MmUnusedSegmentList two instructions ago.  To summarize, the code moved a value (that obviously wasn’t expected to be NULL)  to nt!MmUnusedSegmentList and then dereferenced this value (plus four) which of course caused us to crash because 0x4 is not a valid address.

     Let's take a quick look at nt!MmUnusedSegmentList

    0: kd> x nt!MmUnusedSegmentList

       808ab4a8 nt!MmUnusedSegmentList = <no type information>

     

    0: kd> dt 808ab4a8 nt!_LIST_ENTRY

         [ 0x0 - 0x86f05234 ]

           +0x000 Flink            : (null)

                +0x004 Blink            : 0x86f05234 _LIST_ENTRY [ 0x808ab4a8 - 0x8724392c ] 

     

    MmUnusedSegmentList is the head of a doubly-linked list. You can see that the value that we just moved to this list head's Flink(forward link) entry is NULL.  As covered in part one of this blog, a Flink is a pointer to the next _LIST_ENTRY in the list. A Blink (backwards link) is an entry to the previous _LIST_ENTRY in the list . In this case we moved a NULL Flink from EAX into the list head (MmUnusedSegmentList->Flink) and crashed trying to dereference offset 4 from this same NULL value. The flink of the list head should contain a pointer to the first _LIST_ENTRY. The flink and blink should never contain NULL values. If the list is empty, both entries will be pointing to the list head ListHead->Flink and ListHead->Blink will both contain a pointer to the ListHead itself.

    The symbols have provided us with some clues to what the assembly was doing. Based on the name of this list head MmUnusedSegmentList, the fact that the code seems to be decrementing MmUnusedSegmentCount, and the function name MiRemoveUnusedSegments, it appears obvious that we trying to remove an entry from a list of unused segments. Also, I can tell that we are in a memory manager function by the Mm function prefix. You can find a list of Commonly Used Prefixes in chapter two of Mark Russinovich's book "Windows Internals 4th Edition".

     It is a common operation when working with a linked list to remove an entry from the beginning of a doubly linked list of LIST_ENTRY structures.  I’ll explain this by providing a simplified fictitious example. Let's pretend that we have a doubly linked list that is used in keeping track of unused Widgets.  Let’s also pretend that we have a function that code can call as follows:

    pWidget = giveMeAWidget ()

    This function first finds the head for the list of unused widgets which we are going to stored in a global variable called UnusedWidgetListHead. It then finds an available widget to return by following UnusedWidgetListHead->Flink. Recall from part one of this blog, that unless the Linked List was @ offset 0x0 from the start of the WIDGET structure, the code would have to subtract the offset of the LIST_ENTRY to reach the base of the actual WIDGET structure.  Before returning a pointer to the widget back to the calling code, it will be necessary to remove this widget from the doubly linked list of unused widgets since this widget will now be in use.  Ignoring possible synchronization requirements, the process of removing this first entry from the list would typically involve code that sets ListHead->Flink to point to the second entry in the list instead of the first entry. It would also be necessary to update the second widget’s blink member to point to the ListHead instead of the first entry that it pointed to before. This would now make entry number two the first widget therefore removing entry number one from the list.

     Before 

    After

    While  reviewing the code, I noticed that a pattern in the assembly that matched the type of operation that I described above where we are removing the first entry from the list.  Let's review the assembly again, this time in more detail. We will break down the instructions to identify what transpired.  You can use . or @$ip to represent the current instruction pointer. In this case, we I’ll use the pseudo-register @$ip.

     

    0: kd> ub @$ip L4;u @$ip L1

    nt!MiRemoveUnusedSegments+0x70a

    808436dc 8bf0            mov     esi,eax               This is copying the ListHead->Flink to esi. The code uses the ListHead flink to find ListMember1->Flink (Not an actual name, I am simply referring to the _LIST_ENTRY as ListMember# just to demonstrate what is taking place). Why does it need ListMember1->Flink? Because it points to ListMember2 which it needs to convert to the new ListMember1 in the manner described above. This is very important for our debugging since it may be able to obtain the value of Member1 from esi.

    808436de 8b06            mov     eax,dword ptr [esi]    follow this flink that it just moved by dereferencing it to obtain ListMember1->Flink and place it in eax. This should be a pointer to ListMember2, however it was null.

    808436e0 a3a8b48a80      mov     dword ptr [nt!MmUnusedSegmentList (808ab4a8)],eax  move this null value to ListHead->Flink. This operation should be setting the list head to point to Member2, therefore converting it to Member1. However since Member1->Flink was null, it now contains a NULL value.

    808436e5 83c6fc          add     esi,0FFFFFFFCh    Earlier I mentioned that the code might be able to obtain the address of Member1 from ESI since esi had the ListHead->flink pointing to it, however ESI was modified here. This means that it can't rely on the value of esi. The code added 0x0ffffffC here;  however,  this was really just a compiler optimized (fancy) way of subtracting four.   It doesn't care why that it's subtracting four from this value since the code crashed before ever using address-4, so I won't be investigating that today. However, I suspect we were subtracting the offset of the following field.

    0: kd> dt nt!_CONTROL_AREA DereferenceList

       +0x004 DereferenceList : _LIST_ENTRY

    Instead of digging into that, we simply want to identify the value of the list member. To obtain our value here, we will perform the reverse of the addition operation and simply subtract 0x0FFFFFFFC as a fancy way to add four :                

     

    0: kd> resi

    Last set context:

    esi=86f4c658

     

    We effectively added 4 to get the original value of esi

     

    0: kd> ? 86f4c658-0x0FFFFFFFC

    Evaluate expression: -2030778788 = 86f4c65c

     

    So this address should contain a NULL flink

     

    0: kd> dd 86f4c65c L2        

    86f4c65c  00000000 808ab4a8

    And there it is (the underlined value above) ....

    And as expected in the output above, the flink was null. Also note that the blink (highlighted) was in fact pointing to 808ab4a8, which is the list head. So this does appear to be the address of the original member1. If you can't recall the address of the list head, don't scroll up in the debugger text (or this blog text), we can prove this quickly as follows;

    0: kd> dd nt!MmUnusedSegmentList L1

    808ab4a8  00000000

     

    Let’s dump the address out:

    0: kd> !address 86f4c65c 

      83041000 - 07fbf000                           

              Usage       KernelSpaceUsageNonPagedPool

    0: kd> !pool 86f4c65c 

    Pool page 86f4c65c region is Nonpaged pool

     86f4c000 size:   98 previous size:    0  (Allocated)  File (Protected)

    86f4c098 is not a valid large pool allocation, checking large session pool...

    bf7f4000: Unable to get contents of pool block

     

    So we are dealing with pool corruption. Now that we have followed the NULL value in the corrupt pool and loaded EAX with zero, lets proceed with the inspection of the next instruction which caused the trap leading to the bugcheck:

     

    nt!MiRemoveUnusedSegments+0x716 808436e8

    c74004a8b48a80  mov     dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8)    This instruction was attempting to place the address of the List Head into eax+4.  Why? Well, EAX has the new Member1. If we add four to this value, this would bring us to Member1->Blink which should be pointing to the list head. Had the new Member1 actually been Member2 instead of NULL, then we would be overwriting the pointer to Member1 with a pointer to the list head. However, the pointer had been zeroed out and that brought us to address 4. Next, we trapped and the system bugchecked. As discussed earlier, we died on a write operation attempting to write a value to the address referenced by [EAX+4]

    For completeness, let’s review what the next two instructions would have executed had we not trapped

    08436ef ff0d8cb48a80    dec     dword ptr [nt!MmUnusedSegmentCount (808ab48c)] If the instruction above would not have trapped, then we would have decremented the count. However we did not make it this far. If you remove an item from a linked list, you should decrement the value of any variable tracking the number of items in such list. This instruction would have accomplished this.

    808436f5 33c9            xor     ecx,ecx          This would have simply zeroed out ecx.

    So this crash was in fact due to a null pointer. More specifically, it was caused by a null flink.  Let's dump the linked list located at nt!MmUnusedSegmentCount  using the dlb just to see what happens. We won't be able to go forward since the flink is null; however, we should be able to dump the linked list going backwards.  If the list loops back on itself or if a null pointer is encountered, the dl command will stop traversing the list.

    First, I would like to know how large this list is since the dl command accepts a value that limits its length.

    The following global should tell us how many member are in this list.

    0: kd> x nt!*UnusedSegmentCount*

    808ab48c nt!MmUnusedSegmentCount = <no type information>

     

    0: kd> dd 808ab48c L1

    808ab48c  0000e7ce 

     

    0: kd> .formats 0xe7ce

    Evaluate expression:

      Hex:     0000e7ce

      Decimal: 59342

     

    This is a huge list, so let’s see if we can traverse it. Based on the size above, I dumped out the list using ffff for the limit.

    0: kd> dlb nt!MmUnusedSegmentList ffff 2

    808ab4a8  00000000 86f05234

    86f05234  808ab4a8 8724392c

    8724392c  86f05234 8867fd94

    8867fd94  8724392c 877d600c

    877d600c  8867fd94 87849944

    .....

    .....

    (omitting lengthy unneeded output, all links were valid)

    .....

    .....

    86ecbaac  86f04354 86f95664

    86f95664  86ecbaac 86f4c28c

    86f4c28c  00000000 00000000  <-- After over a minute of output, we have a null pointer, but it's at a different address.

     

    First Address we found to be corrupt:         86f4c65c 

    Second Address we found to be corrupt:    86f4c28c 

    Now that we have located the address above which appear to be incorrectly zeroed out. Let’s dump out these areas in an attempt to get more information.   It would be great if we could verify if the pool that we are using is corrupt. It would be even better if we  some pointers, text, or other clues that may lead us closer to the problem.  Let's dump the two addresses out in various ways. I’ll start by examining the pool for corruption.

     The !pool extension as used below displays information about a specific pool allocation

     

    0: kd> !pool 86f4c65c;!pool 86f4c28c

    Pool page 86f4c65c region is Nonpaged pool

     86f4c000 size:   98 previous size:    0  (Allocated)  File (Protected)

    86f4c098 is not a valid large pool allocation, checking large session pool...

    bf7f4000: Unable to get contents of pool block

    Pool page 86f4c28c region is Nonpaged pool

     86f4c000 size:   98 previous size:    0  (Allocated)  File (Protected)

    86f4c098 is not a valid large pool allocation, checking large session pool...

    bf7f4000: Unable to get contents of pool block

     

    The !poolval extension analyzes the headers for a pool page and diagnoses any possible corruption.

     

    0: kd> !poolval 86f4c65c;!poolval 86f4c28c

    Pool page 86f4c65c region is Nonpaged pool

    Validating Pool headers for pool page: 86f4c65c

    Pool page [ 86f4c000 ] is __inVALID.

    Analyzing linked list...

    [ 86f4c000 --> 86f4c6c0 (size = 0x6c0 bytes)]: Corrupt region

    Scanning for single bit errors...

    None found

    Pool page 86f4c28c region is Nonpaged pool

    Validating Pool headers for pool page: 86f4c28c

    Pool page [ 86f4c000 ] is __inVALID.

    Analyzing linked list...

    [ 86f4c000 --> 86f4c6c0 (size = 0x6c0 bytes)]: Corrupt region

    Scanning for single bit errors...

    None found

     

    We can also dump out the memory around the addresses in question using various commands.  As discussed previously, we are seeking clues in text format, pointers, etc.  The output from the following command shows that the memory is mostly zeroed.  One value appears to be present. Perhaps the entire region was overwritten and then this value was updated. Dd dumps the raw data as dwords. Dc will dump the data as type char.

     

    0: kd> dd 86f4c65c-100 86f4c65c+100;dd 86f4c28c-100 86f4c28c+100

    ….<omitting zeros>

    86f4c63c  00000000 00000000 00000000 00000000

    86f4c64c  00000000 00000000 00000000 00000000

    86f4c65c  00000000 808ab4a8 00000000 00000000

    86f4c66c  00000000 00000000 00000000 00000000

    86f4c67c  00000000 00000000 00000000 00000000

    86f4c68c  00000000 00000000 00000000

    ….<omitting zeros>

    86f4c26c  00000000 00000000 00000000 00000000

    86f4c27c  00000000 00000000 00000000 00000000

    86f4c28c  00000000 00000000 00000000 00000000

    86f4c29c  00000000 00000000 00000000 00000000

    ….<omitting zeros>

     

          0: kd> dc 86f4c28c-1000 86f4c65c

          <ommitting>

    86f4c62c  00000000 00000000 00000000 00000000  ................

    86f4c63c  00000000 00000000 00000000 00000000  ................

    86f4c64c  00000000 00000000 00000000 00000000  ................

    86f4c65c  00000000                            ....

    0: kd> dc

    86f4c660  808ab4a8

     

    We can see from the command below that we are in fact dealing with a NonPagedPool address range.

    0: kd> !address 86f4c65c;!address 86f4c28c

      83041000 - 07fbf000                          

              Usage       KernelSpaceUsageNonPagedPool

      83041000 - 07fbf000                           

              Usage       KernelSpaceUsageNonPagedPool

     

    0: kd> dps 86f4c65c-8 86f4c65c+8

    86f4c654  00000000

    86f4c658  00000000

    86f4c65c  00000000

    86f4c660  808ab4a8 nt!MmUnusedSegmentList

    86f4c664  00000000 

     

    Just as in the previous dump, in order to identify the source of the pool corruption we need to use Special Pool.  Special Pool will use guard pages to catch a buffer overrun or underrun and should provide us with a dump that shows the code that causes the corruption.  You can find more information on Special Pool in KB188831.  Also, in some situations memory corruption can be caused by a driver overflowing on a DMA transfer causing corruption to physical pages rather than virtual pages as we see in typical pool. For more information on linked lists and list heads refer to the following MSDN article:

    Singly and Doubly Linked Lists- http://msdn.microsoft.com/en-us/library/aa489548.aspx

    For an example of a function that removes the first item of a linked list: http://msdn.microsoft.com/en-us/library/ms804330.aspx

    *RemoveHeadList()  -  The RemoveHeadList routine removes an entry from the beginning of a doubly linked list of LIST_ENTRY structures.

    Share this post :

  • Ntdebugging Blog

    Who is restarting my server?

    • 1 Comments

    Hello - This is Omer and I recently came across a case where the customer reported that they could not reboot into safe mode using their custom image. Whenever they booted into safe mode, the machine would get to the logon screen, wait for 5 seconds and then reboot regardless of any user input. Nothing was being logged in the event logs either, so it was very strange.

    At first it looked like the machine was going through a power cycle, since the shutdown was so quick (we would not see the usual shutdown messages like “Shutting down Services”, etc.). I thought maybe there was some issue with the hardware, but the customer reported that they had the same issue on every machine, regardless of the hardware vendor.

    To figure this out, I attached a kernel debugger to the machine, and broke in to make sure the connection was good. I then let the machine go, and it got to the logon screen. Sure enough, after 5 seconds the machine rebooted. I thought that I would run into some kind of exception, and the debugger would break, however nothing of the sort happened. The only message that I got was that the following

    Shutdown occurred at (Fri Jun 26 17:27:12.714 2009 (GMT-7))...unloading all symbol tables.

    Very strange! The OS disconnected the debugger gracefully. I did a quick source code review and found that one of the places that we disconnect the debugger was in the system shutdown path. Maybe the OS was shutting down gracefully, but since it happened so fast, it looked like a power cycle. To test my theory, I put a breakpoint on nt!NtShutdownSystem to see if it was being called, and find the caller as well. Rebooted the machine, and let it rip.

    nt!NtShutdownSystem()

    nt!KiSystemServiceCopyEnd()+0x13

    ntdll!ZwShutdownSystem(void)+0xa

    services!ScRevertToLastKnownGood()+0x1af

    services!ScStartMarkedServices()+0x154

    services!ScStartServiceAndDependencies()+0x43d

    services!ScAutoStartServices()+0x225

    services!SvcctrlMain()+0xa75

    services!main()+0x31

    services!__mainCRTStartup()+0x13d

    kernel32!BaseThreadInitThunk()+0xd

    ntdll!RtlUserThreadStart()+0x1d

     

    Voila! Services.exe is shutting down the system. Probably some service is not starting, which is then somehow causing the machine to shutdown. From the stack, I was able to figure out which service was not starting. Based on the service record, it was some third party remote assistance service.

    But, how could this non-critical service not starting successfully, cause the Service Control Manager to reboot the machine? And what is that stack frame about reverting to last known good (services!ScRevertToLastKnownGood()+0x1af) doing on the stack?

    Looking at the service record, I found that the SCM returned an error code 0x43c. This can be translated to ERROR_NOT_SAFEBOOT_SERVICE(This service cannot be started in Safe Mode). Also, the ErrorControl value for this service value was set to 0x2, which meant that if the service was not started successfully, the system needs to revert to the last known good configuration and reboot. However if the system was already using last known good, then it should just continue the boot process and log the error.

    Error                      Control                 Meaning

    Level

    0x3                         (Critical)               Fail the attempted system startup.

                                                                    If the startup is not using the

                                                                    LastKnownGood control set, switch to

                                                                    LastKnownGood. If the startup attempt

                                                                    is using LastKnownGood, run a bug-check

                                                                    routine.

    0x2                         (Severe)               If the startup is not using the

                                                                    LastKnownGood control set, switch to

                                                                    LastKnownGood. If the startup attempt

                                                                    is using LastKnownGood, continue on

                                                                    in case of error.

    0x1                         (Normal)             If the driver fails to load or initialize,

                                                                    startup should proceed, but display a

                                                                    warning.

    0x0                         (Ignore)                               If the driver fails to load or initialize,

                                                                    start up proceeds. No warning is displayed.

     

    Because the service’s ErrorControl value is set to 0x2, the machine would revert to the last known good configuration and silently reboot. I booted the machine normally, and changed the ErrorControl value in the registry.

    I also had to change the value in the other ControlSets, since they were identical to the current control set. This also explains why the machine kept rebooting every time, the value in the Last Known Good Configuration was also set incorrectly. L

    I rebooted the machine and was able to boot into safe mode normally. Hence, the mystery of the silent reboots was solved.

    Share this post :

  • Ntdebugging Blog

    Debugging a Bugcheck 0xF4

    • 1 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today’s blog will be a quick walkthrough of the analysis of a bugcheck 0xF4 and how I determined that the action plan going forward should consist of enabling pool tagging on this system.

     

    I began my review with !analyze –v. From the output I can see that a process required for the system to function properly unexpectedly exited or was terminated. The goal of this debugging session will be to determine what failed and why.

     

    0: kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    CRITICAL_OBJECT_TERMINATION (f4)

    A process or thread crucial to system operation has unexpectedly exited or been

    terminated.

    Several processes and threads are necessary for the operation of the

    system; when they are terminated (for any reason), the system can no

    longer function.

    Arguments:

    Arg1: 00000003, Process                                A value of 0x3 in this parameter indicated that it was a process that terminated, not a thread

    Arg2: 8a03ada0, Terminating object        This value is a pointer to the _EPROCESS object that terminated

    Arg3: 8a03af14, Process image file name              Process Name

    Arg4: 805d1204, Explanatory message (ascii)         text message about the problem

     

    We shall begin by dumping out all the parameters of the bugcheck. Let’s dump out the "Terminating Object" below

    0: kd> !object 8a03ada0

    Object: 8a03ada0  Type: (8a490900) Process

        ObjectHeader: 8a03ad88 (old version)

        HandleCount: 3  PointerCount: 228

     

    First, let’s dump out the process image file name from the bugcheck parameter 3 above.

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe

     

     

    0: kd> dt _EPROCESS 8a03ada0 imageFileName

    CSRSRV!_EPROCESS

       +0x174 ImageFileName : [16]  "csrss.exe"

     

    Notice that if we add the base of the _EPROCESS object (8a03ada0- Parameter 2)to the offset of the imageFileName field (+0x174) we get parameter 3. The imageFileName field.

    0: kd> ? 8a03ada0+0x174

    Evaluate expression: -1979470060 = 8a03af14

     

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe.......

     

    Let’s dump out the ASCII message from parameter number 4

    0: kd> dc 805d1204

    805d1204  6d726554 74616e69 20676e69 74697263  Terminating crit

    805d1214  6c616369 6f727020 73736563 25783020  ical process

     

    Let’s review the debugger help file for more information. We can see that this bugcheck occurs when a critical process or thread terminates. “Several processes and threads are necessary for the operation of the system. When they are terminated for any reason, the system can no longer function. “

    0: kd> .hh bug check 0xf4

     

     

     

     

     

     

    Next, we need to determine why this process terminated. !analyze –v also provided us with an exception record which provides us with an error code:

     

    PROCESS_NAME:  csrss.exe

     

    EXCEPTION_RECORD:  9a85e9d8 -- (.exr 0xffffffff9a85e9d8)

    ExceptionAddress: 7c92c375 (ntdll!RtlFindMessage+0x0000007c)

       ExceptionCode: c0000006 (In-page I/O error)

      ExceptionFlags: 00000000

    NumberParameters: 3

       Parameter[0]: 00000000

       Parameter[1]: 7c99c3d8

       Parameter[2]: c000009a

    Inpage operation failed at 7c99c3d8, due to I/O error c000009a

     

    EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

     

    Since we have an error code, let’s investigate that error code. We can quickly perform this operation from within the debugger using the !error command

     

    0: kd> !error c000009a

    Error code: (NTSTATUS) 0xc000009a (3221225626) - Insufficient system resources exist to complete the API.

     

    0: kd > .hh !error

     

     

    Let’s check the output of the !vm command

     

    0: kd> !vm 2

     

    *** Virtual Memory Usage ***

                    Physical Memory:      760875 (   3043500 Kb)

                    Page File: \??\C:\pagefile.sys

                      Current:   4190208 Kb  Free Space:   4156380 Kb

                      Minimum:   4190208 Kb  Maximum:      4190208 Kb

                    Available Pages:      579241 (   2316964 Kb)

                    ResAvail Pages:       673481 (   2693924 Kb)

                    Locked IO Pages:          69 (       276 Kb)

                    Free System PTEs:     115226 (    460904 Kb)

                    Free NP PTEs:              0 (         0 Kb)

                    Free Special NP:           0 (         0 Kb)

                    Modified Pages:          221 (       884 Kb)

                    Modified PF Pages:       219 (       876 Kb)

                    NonPagedPool Usage:    65534 (    262136 Kb)

                    NonPagedPool Max:      65536 (    262144 Kb)

                    ********** Excessive NonPaged Pool Usage *****

                    PagedPool 0 Usage:     24167 (     96668 Kb)

                    PagedPool 1 Usage:       967 (      3868 Kb)

                    PagedPool 2 Usage:       967 (      3868 Kb)

                    PagedPool 3 Usage:       984 (      3936 Kb)

                    PagedPool 4 Usage:       977 (      3908 Kb)

                    PagedPool Usage:       28062 (    112248 Kb)

                    PagedPool Maximum:     92160 (    368640 Kb)

     

                    ********** 2075 pool allocations have failed **********

     

                    Session Commit:         1562 (      6248 Kb)

                    Shared Commit:          2526 (     10104 Kb)

                    Special Pool:              0 (         0 Kb)

                    Shared Process:         4821 (     19284 Kb)

                    PagedPool Commit:      28062 (    112248 Kb)

                    Driver Commit:          5138 (     20552 Kb)

                    Committed pages:      153449 (    613796 Kb)

                    Commit limit:        1767229 (   7068916 Kb)

     

    0: kd> !poolused

    unable to get PoolTrackTable - pool tagging is disabled, enable it to use this command

    Use gflags.exe and check the box that says "Enable pool tagging".

     

    The output above has informed us that pool tagging is disabled. Let’s demonstrate how you can verify that it is disabled:

     

    0: kd> dd nt!NtGlobalFlag L1

    805597ec  00000000

     

    0: kd> !gflag

    Current NtGlobalFlag contents: 0x00000000

     

    Let’s explore the debugging help file entry on the !poolused command

     

    0: kd > .hh !poolused

     

     

     

    Reading the text above, we are informed that “Pool tagging is permanently enabled on Windows Server 2003 and later versions of Windows. On Windows XP and earlier versions of Windows, you must enable pool tagging by using Gflags.”

     

     

    Using the vertarget command, I can see that this system was running Windows XP.

    0: kd> vertarget

    Windows XP Kernel Version 2600 (Service Pack 2) MP (2 procs) Free x86 compatible

     

     

     

    0: kd > .hh !gflag

     

     

     

    By reviewing the help file entry for the !gflag extension, I was able to determine that if pooltagging was set, the following bit would have been set:

     

    0x400 "ptg"         Enable pool tagging.

     

    0: kd> .formats 0x400

    Evaluate expression:

    ….

      Binary:  00000000 00000000 00000100 00000000                0x00000400

     

    Gflags is included in the Debugging Tools for Windows. The screenshot below is from a Windows 7 system. Notice that Pool Tagging is enabled permanently as described above.

     

     

     

    Summary: This system bugchecked when the critical process csrss.exe failed an I/O operation due to insufficient non-paged pool. For an action plan, we recommended the use of gflags to enable pool tagging in order to obtain more information about pool consumption.

     

     

    Share this post :

  • Ntdebugging Blog

    Debug Nugget: DumpConfigurator Utility

    • 0 Comments

    Hi - my name is Naresh and I am a Sr. Escalation Engineer on the Microsoft GES platforms team. Today I'm discussing a simple, yet powerful GUI tool used to configure a Windows system locally or remotely for a memory dump. The name of the tool is DumpConfigurator.hta and it can be accessed from CodePlex.  Check out the following Microsoft KB article which references the use of the tool.

    969028  How to generate a kernel or a complete memory dump file in Windows Server 2008

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;969028

     

    The tool can be used with all currently supported versions of the Windows Operating System. Once you download it, launch it with Administrator privileges to get the following UI: 

    GUI

    The GUI is self-explanatory and all the settings can be edited and saved by clicking Save Settings. The system will have to be rebooted for the settings to take effect.
    NOTE: Read the Warranty Disclaimer for the tool before use:)

     

    Share this post :
Page 1 of 1 (4 items)