My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft. Today’s blog will be a quick walkthrough of the analysis of a bugcheck 0xF4 and how I determined that the action plan going forward should consist of enabling pool tagging on this system.
I began my review with !analyze –v. From the output I can see that a process required for the system to function properly unexpectedly exited or was terminated. The goal of this debugging session will be to determine what failed and why.
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 00000003, Process A value of 0x3 in this parameter indicated that it was a process that terminated, not a thread
Arg2: 8a03ada0, Terminating object This value is a pointer to the _EPROCESS object that terminated
Arg3: 8a03af14, Process image file name Process Name
Arg4: 805d1204, Explanatory message (ascii) text message about the problem
We shall begin by dumping out all the parameters of the bugcheck. Let’s dump out the "Terminating Object" below
0: kd> !object 8a03ada0
Object: 8a03ada0 Type: (8a490900) Process
ObjectHeader: 8a03ad88 (old version)
HandleCount: 3 PointerCount: 228
First, let’s dump out the process image file name from the bugcheck parameter 3 above.
0: kd> dc 8a03af14
8a03af14 73727363 78652e73 00000065 00000000 csrss.exe
0: kd> dt _EPROCESS 8a03ada0 imageFileName
CSRSRV!_EPROCESS
+0x174 ImageFileName : [16] "csrss.exe"
Notice that if we add the base of the _EPROCESS object (8a03ada0- Parameter 2)to the offset of the imageFileName field (+0x174) we get parameter 3. The imageFileName field.
0: kd> ? 8a03ada0+0x174
Evaluate expression: -1979470060 = 8a03af14
8a03af14 73727363 78652e73 00000065 00000000 csrss.exe.......
Let’s dump out the ASCII message from parameter number 4
0: kd> dc 805d1204
805d1204 6d726554 74616e69 20676e69 74697263 Terminating crit
805d1214 6c616369 6f727020 73736563 25783020 ical process
Let’s review the debugger help file for more information. We can see that this bugcheck occurs when a critical process or thread terminates. “Several processes and threads are necessary for the operation of the system. When they are terminated for any reason, the system can no longer function. “
0: kd> .hh bug check 0xf4
Next, we need to determine why this process terminated. !analyze –v also provided us with an exception record which provides us with an error code:
PROCESS_NAME: csrss.exe
EXCEPTION_RECORD: 9a85e9d8 -- (.exr 0xffffffff9a85e9d8)
ExceptionAddress: 7c92c375 (ntdll!RtlFindMessage+0x0000007c)
ExceptionCode: c0000006 (In-page I/O error)
ExceptionFlags: 00000000
NumberParameters: 3
Parameter[0]: 00000000
Parameter[1]: 7c99c3d8
Parameter[2]: c000009a
Inpage operation failed at 7c99c3d8, due to I/O error c000009a
EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.
Since we have an error code, let’s investigate that error code. We can quickly perform this operation from within the debugger using the !error command
0: kd> !error c000009a
Error code: (NTSTATUS) 0xc000009a (3221225626) - Insufficient system resources exist to complete the API.
0: kd > .hh !error
Let’s check the output of the !vm command
0: kd> !vm 2
*** Virtual Memory Usage ***
Physical Memory: 760875 ( 3043500 Kb)
Page File: \??\C:\pagefile.sys
Current: 4190208 Kb Free Space: 4156380 Kb
Minimum: 4190208 Kb Maximum: 4190208 Kb
Available Pages: 579241 ( 2316964 Kb)
ResAvail Pages: 673481 ( 2693924 Kb)
Locked IO Pages: 69 ( 276 Kb)
Free System PTEs: 115226 ( 460904 Kb)
Free NP PTEs: 0 ( 0 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 221 ( 884 Kb)
Modified PF Pages: 219 ( 876 Kb)
NonPagedPool Usage: 65534 ( 262136 Kb)
NonPagedPool Max: 65536 ( 262144 Kb)
********** Excessive NonPaged Pool Usage *****
PagedPool 0 Usage: 24167 ( 96668 Kb)
PagedPool 1 Usage: 967 ( 3868 Kb)
PagedPool 2 Usage: 967 ( 3868 Kb)
PagedPool 3 Usage: 984 ( 3936 Kb)
PagedPool 4 Usage: 977 ( 3908 Kb)
PagedPool Usage: 28062 ( 112248 Kb)
PagedPool Maximum: 92160 ( 368640 Kb)
********** 2075 pool allocations have failed **********
Session Commit: 1562 ( 6248 Kb)
Shared Commit: 2526 ( 10104 Kb)
Special Pool: 0 ( 0 Kb)
Shared Process: 4821 ( 19284 Kb)
PagedPool Commit: 28062 ( 112248 Kb)
Driver Commit: 5138 ( 20552 Kb)
Committed pages: 153449 ( 613796 Kb)
Commit limit: 1767229 ( 7068916 Kb)
0: kd> !poolused
unable to get PoolTrackTable - pool tagging is disabled, enable it to use this command
Use gflags.exe and check the box that says "Enable pool tagging".
The output above has informed us that pool tagging is disabled. Let’s demonstrate how you can verify that it is disabled:
0: kd> dd nt!NtGlobalFlag L1
805597ec 00000000
0: kd> !gflag
Current NtGlobalFlag contents: 0x00000000
Let’s explore the debugging help file entry on the !poolused command
0: kd > .hh !poolused
Reading the text above, we are informed that “Pool tagging is permanently enabled on Windows Server 2003 and later versions of Windows. On Windows XP and earlier versions of Windows, you must enable pool tagging by using Gflags.”
Using the vertarget command, I can see that this system was running Windows XP.
0: kd> vertarget
Windows XP Kernel Version 2600 (Service Pack 2) MP (2 procs) Free x86 compatible
0: kd > .hh !gflag
By reviewing the help file entry for the !gflag extension, I was able to determine that if pooltagging was set, the following bit would have been set:
0x400 "ptg" Enable pool tagging.
0: kd> .formats 0x400
Evaluate expression:
….
Binary: 00000000 00000000 00000100 00000000 0x00000400
Gflags is included in the Debugging Tools for Windows. The screenshot below is from a Windows 7 system. Notice that Pool Tagging is enabled permanently as described above.
Summary: This system bugchecked when the critical process csrss.exe failed an I/O operation due to insufficient non-paged pool. For an action plan, we recommended the use of gflags to enable pool tagging in order to obtain more information about pool consumption.
Share this post :
Hi - my name is Naresh and I am a Sr. Escalation Engineer on the Microsoft GES platforms team. Today I'm discussing a simple, yet powerful GUI tool used to configure a Windows system locally or remotely for a memory dump. The name of the tool is DumpConfigurator.hta and it can be accessed from CodePlex. Check out the following Microsoft KB article which references the use of the tool.
969028 How to generate a kernel or a complete memory dump file in Windows Server 2008 http://support.microsoft.com/default.aspx?scid=kb;EN-US;969028
969028 How to generate a kernel or a complete memory dump file in Windows Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;969028
The tool can be used with all currently supported versions of the Windows Operating System. Once you download it, launch it with Administrator privileges to get the following UI:
The GUI is self-explanatory and all the settings can be edited and saved by clicking Save Settings. The system will have to be rebooted for the settings to take effect. NOTE: Read the Warranty Disclaimer for the tool before use:)
Hello - This is Omer and I recently came across a case where the customer reported that they could not reboot into safe mode using their custom image. Whenever they booted into safe mode, the machine would get to the logon screen, wait for 5 seconds and then reboot regardless of any user input. Nothing was being logged in the event logs either, so it was very strange.
At first it looked like the machine was going through a power cycle, since the shutdown was so quick (we would not see the usual shutdown messages like “Shutting down Services”, etc.). I thought maybe there was some issue with the hardware, but the customer reported that they had the same issue on every machine, regardless of the hardware vendor.
To figure this out, I attached a kernel debugger to the machine, and broke in to make sure the connection was good. I then let the machine go, and it got to the logon screen. Sure enough, after 5 seconds the machine rebooted. I thought that I would run into some kind of exception, and the debugger would break, however nothing of the sort happened. The only message that I got was that the following
Shutdown occurred at (Fri Jun 26 17:27:12.714 2009 (GMT-7))...unloading all symbol tables.
Very strange! The OS disconnected the debugger gracefully. I did a quick source code review and found that one of the places that we disconnect the debugger was in the system shutdown path. Maybe the OS was shutting down gracefully, but since it happened so fast, it looked like a power cycle. To test my theory, I put a breakpoint on nt!NtShutdownSystem to see if it was being called, and find the caller as well. Rebooted the machine, and let it rip.
nt!NtShutdownSystem() nt!KiSystemServiceCopyEnd()+0x13 ntdll!ZwShutdownSystem(void)+0xa services!ScRevertToLastKnownGood()+0x1af services!ScStartMarkedServices()+0x154 services!ScStartServiceAndDependencies()+0x43d services!ScAutoStartServices()+0x225 services!SvcctrlMain()+0xa75 services!main()+0x31 services!__mainCRTStartup()+0x13d kernel32!BaseThreadInitThunk()+0xd ntdll!RtlUserThreadStart()+0x1d
nt!NtShutdownSystem()
nt!KiSystemServiceCopyEnd()+0x13
ntdll!ZwShutdownSystem(void)+0xa
services!ScRevertToLastKnownGood()+0x1af
services!ScStartMarkedServices()+0x154
services!ScStartServiceAndDependencies()+0x43d
services!ScAutoStartServices()+0x225
services!SvcctrlMain()+0xa75
services!main()+0x31
services!__mainCRTStartup()+0x13d
kernel32!BaseThreadInitThunk()+0xd
ntdll!RtlUserThreadStart()+0x1d
Voila! Services.exe is shutting down the system. Probably some service is not starting, which is then somehow causing the machine to shutdown. From the stack, I was able to figure out which service was not starting. Based on the service record, it was some third party remote assistance service.
But, how could this non-critical service not starting successfully, cause the Service Control Manager to reboot the machine? And what is that stack frame about reverting to last known good (services!ScRevertToLastKnownGood()+0x1af) doing on the stack?
Looking at the service record, I found that the SCM returned an error code 0x43c. This can be translated to ERROR_NOT_SAFEBOOT_SERVICE(This service cannot be started in Safe Mode). Also, the ErrorControl value for this service value was set to 0x2, which meant that if the service was not started successfully, the system needs to revert to the last known good configuration and reboot. However if the system was already using last known good, then it should just continue the boot process and log the error.
Error Control Meaning Level 0x3 (Critical) Fail the attempted system startup. If the startup is not using the LastKnownGood control set, switch to LastKnownGood. If the startup attempt is using LastKnownGood, run a bug-check routine. 0x2 (Severe) If the startup is not using the LastKnownGood control set, switch to LastKnownGood. If the startup attempt is using LastKnownGood, continue on in case of error. 0x1 (Normal) If the driver fails to load or initialize, startup should proceed, but display a warning. 0x0 (Ignore) If the driver fails to load or initialize, start up proceeds. No warning is displayed.
Error Control Meaning
Level
0x3 (Critical) Fail the attempted system startup.
If the startup is not using the
LastKnownGood control set, switch to
LastKnownGood. If the startup attempt
is using LastKnownGood, run a bug-check
routine.
0x2 (Severe) If the startup is not using the
is using LastKnownGood, continue on
in case of error.
0x1 (Normal) If the driver fails to load or initialize,
startup should proceed, but display a
warning.
0x0 (Ignore) If the driver fails to load or initialize,
start up proceeds. No warning is displayed.
Because the service’s ErrorControl value is set to 0x2, the machine would revert to the last known good configuration and silently reboot. I booted the machine normally, and changed the ErrorControl value in the registry.
I also had to change the value in the other ControlSets, since they were identical to the current control set. This also explains why the machine kept rebooting every time, the value in the Last Known Good Configuration was also set incorrectly. L
I rebooted the machine and was able to boot into safe mode normally. Hence, the mystery of the silent reboots was solved.
Hello - It's Ryan again with the second installment of my list corruption walkthrough. The previous blog post is here -
In part one we walked through the analysis of a memory.dmp collected during a bugcheck caused by pool corruption. The post also discussed doubly linked lists and demonstrated an unconventional order of debugging steps in which we did not begin our examination with the backtrace of the bad pointer value.
Today’s continuation will consist of another crash dump 'debugging walkthrough' explaining a lot of the typical commands used for debugging. This one involves pool corruption affecting a linked list which led us to a kernel-mode crash. I will also discuss the removal of an entry from the head of a linked list. As in the previous post, I shall provide demonstrations of the topics within the debugger in an attempt to relate the information to a real-world problem. Despite the title, we won't be working in reverse today.
As is typical, I began this windbg session with a quick !analyze -v to get a feel for what went wrong.
IRQL_NOT_LESS_OR_EQUAL (a) An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If a kernel debugger is available get the stack backtrace.Arguments:Arg1: 00000004, memory referenced <-- Looks like a BAD pointerArg2: d0000002, IRQLArg3: 00000001, bitfield : bit 0 : value 0 = read operation, 1 = write operation bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)Arg4: 808436e8, address which referenced memoryTRAP_FRAME: f7906cdc -- (.trap 0xfffffffff7906cdc)
The output from this command lets us know that the code trapped accessing the invalid address 0x00000004. When we crash attempting to access low addresses like this, it is almost always due to dereferencing a null pointer. Such invalid address values are often obtained by adding some value to the pointer in an attempt to access fields in the structure that the pointer should have been referencing. Recall in part one of this blog, the invalid address was also close to zero. The invalid address in the previous blog’s memory.dmp was obtained by subtracting negative values from a null pointer. For example, we may find in our analysis today that we attempted to access a structure field that was located at offset 0x4 within the structure. However, as I mentioned in part one, there are other ways that we could have ended up this value besides a NULL pointer. Some examples would include a hardware problem, following a corrupted pointer to a location that had some bad value, or some code improperly writing some value over a pointer. Let’s start our analysis and identify what happened.
Dumping the stack without setting the trap frame shows the code executing within the PageFault Trap Handler
0: kd> kC
nt!_KiTrap0E ß We were in the handler for Trap 0xE
nt!MiRemoveUnusedSegments
nt!MiDereferenceSegmentThread
nt!PspSystemThreadStartup
nt!KiThreadStartup
0: kd> .formats 0xE
Hex: 0000000e
Decimal: 14 ßTrap number 14 which per the x86 Intel Manuals is a PageFault Trap
0: kd> rcr2 ßPageFault trap results in the address that triggered the trap being stored in the CR2 register
Last set context:
cr2=00000004 ß Here is the invalid address
Next, we’ll set the trap frame
0: kd> .trap 0xfffffffff7906cdc
After setting the trap frame, I looked at the stack again just to get a quick overview of what calls were made in this thread:
nt!MiRemoveUnusedSegments <-- This is the function executing when the trap occurred. Take notice of the function name.
Next, I went straight to the assembly to explore what happened. I’ll begin by examining the faulting instruction.
0: kd> r
eax=00000000 ebx=80a5e540 ecx=808ab4a8 edx=00000000 esi=86f4c658 edi=80a5e4d0
eip=808436e8 esp=f7906d50 ebp=f7906d90 iopl=0 nv up ei ng nz ac po cy
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010293
nt!MiRemoveUnusedSegments+0x716:
808436e8 c74004a8b48a80 mov dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8) ds:0023:00000004=????????
As you can see, the trap was triggered by attempting to add the value four to the null value in EAX. Also, I would like to point out that parameter number three of the bugcheck provides supporting information which is sometimes important in developing a clear understanding of what occurred:
0: kd> .bugcheck
Bugcheck code 0000000A
Arguments 00000004 d0000002 00000001 808436e8
This is illustrated by the output of !analyze –v
Arg1: 00000004, memory referenced
Arg2: d0000002, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 808436e8, address which referenced memory
0: kd> .formats 0x1
Binary: 00000000 00000000 00000000 00000001 ßWrite Operation, Not an execute operation.
If you do not need to see the entire output of !analyze –v, you can get a very abbreviated version of the !analyze–v output by using !analyze –a as displayed below.
0: kd> !analyze -a
****************************
Bugcheck Analysis
Use !analyze -v to get detailed debugging information.
BugCheck A, {4, d0000002, 1, 808436e8}
Probably caused by : memory_corruption ( nt!MiRemoveUnusedSegments+716 )
Followup: MachineOwner
You can get further information about this bugcheck by typing .hh bug check 0xA into the debugger
windbg> .hh bug check 0xA
This will bring up the following screen which displays helpful information.
Now that we understand that this crash was due to a NULL value present in EAX, the next goal of this postmortem memory.dmp debug session becomes identification of the source of this NULL value. This task can often prove extremely challenging due to the fact that it sometimes involves backtracing through several highly optimized functions full of jumps without symbols. Other times, we may be at the beginning of a very simple function for which we have private symbols. Let's see how we luck out today.
I’ll proceed by disassembling the code around the point of the trap. Here is my favorite command for accomplishing this.
0: kd> ub @$ip L4;u . L3;r$ip
nt!MiRemoveUnusedSegments+0x70a
808436dc 8bf0 mov esi,eax
808436de 8b06 mov eax,dword ptr [esi]
The first two instructions (highlighted) together accomplish the following “Overwrite the value in EAX with the data that EAX is presently pointing to”
808436e0 a3a8b48a80 mov dword ptr [nt!MmUnusedSegmentList (808ab4a8)],eax
808436e5 83c6fc add esi,0FFFFFFFCh
nt!MiRemoveUnusedSegments+0x716
808436e8 c74004a8b48a80 mov dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8) ßHowever EAX was unexpectedly NULL here.
808436ef ff0d8cb48a80 dec dword ptr [nt!MmUnusedSegmentCount (808ab48c)]
808436f5 33c9 xor ecx,ecx
$ip=808436e8
0: kd> reax
eax=00000000
In reviewing the code above, I observed that we had just moved this value from EAX to nt!MmUnusedSegmentList two instructions ago. To summarize, the code moved a value (that obviously wasn’t expected to be NULL) to nt!MmUnusedSegmentList and then dereferenced this value (plus four) which of course caused us to crash because 0x4 is not a valid address.
Let's take a quick look at nt!MmUnusedSegmentList
0: kd> x nt!MmUnusedSegmentList
808ab4a8 nt!MmUnusedSegmentList = <no type information>
0: kd> dt 808ab4a8 nt!_LIST_ENTRY
[ 0x0 - 0x86f05234 ]
+0x000 Flink : (null)
+0x004 Blink : 0x86f05234 _LIST_ENTRY [ 0x808ab4a8 - 0x8724392c ]
MmUnusedSegmentList is the head of a doubly-linked list. You can see that the value that we just moved to this list head's Flink(forward link) entry is NULL. As covered in part one of this blog, a Flink is a pointer to the next _LIST_ENTRY in the list. A Blink (backwards link) is an entry to the previous _LIST_ENTRY in the list . In this case we moved a NULL Flink from EAX into the list head (MmUnusedSegmentList->Flink) and crashed trying to dereference offset 4 from this same NULL value. The flink of the list head should contain a pointer to the first _LIST_ENTRY. The flink and blink should never contain NULL values. If the list is empty, both entries will be pointing to the list head ListHead->Flink and ListHead->Blink will both contain a pointer to the ListHead itself.
The symbols have provided us with some clues to what the assembly was doing. Based on the name of this list head MmUnusedSegmentList, the fact that the code seems to be decrementing MmUnusedSegmentCount, and the function name MiRemoveUnusedSegments, it appears obvious that we trying to remove an entry from a list of unused segments. Also, I can tell that we are in a memory manager function by the Mm function prefix. You can find a list of Commonly Used Prefixes in chapter two of Mark Russinovich's book "Windows Internals 4th Edition".
It is a common operation when working with a linked list to remove an entry from the beginning of a doubly linked list of LIST_ENTRY structures. I’ll explain this by providing a simplified fictitious example. Let's pretend that we have a doubly linked list that is used in keeping track of unused Widgets. Let’s also pretend that we have a function that code can call as follows:
pWidget = giveMeAWidget ()
This function first finds the head for the list of unused widgets which we are going to stored in a global variable called UnusedWidgetListHead. It then finds an available widget to return by following UnusedWidgetListHead->Flink. Recall from part one of this blog, that unless the Linked List was @ offset 0x0 from the start of the WIDGET structure, the code would have to subtract the offset of the LIST_ENTRY to reach the base of the actual WIDGET structure. Before returning a pointer to the widget back to the calling code, it will be necessary to remove this widget from the doubly linked list of unused widgets since this widget will now be in use. Ignoring possible synchronization requirements, the process of removing this first entry from the list would typically involve code that sets ListHead->Flink to point to the second entry in the list instead of the first entry. It would also be necessary to update the second widget’s blink member to point to the ListHead instead of the first entry that it pointed to before. This would now make entry number two the first widget therefore removing entry number one from the list.
While reviewing the code, I noticed that a pattern in the assembly that matched the type of operation that I described above where we are removing the first entry from the list. Let's review the assembly again, this time in more detail. We will break down the instructions to identify what transpired. You can use . or @$ip to represent the current instruction pointer. In this case, we I’ll use the pseudo-register @$ip.
0: kd> ub @$ip L4;u @$ip L1
808436dc 8bf0 mov esi,eax This is copying the ListHead->Flink to esi. The code uses the ListHead flink to find ListMember1->Flink (Not an actual name, I am simply referring to the _LIST_ENTRY as ListMember# just to demonstrate what is taking place). Why does it need ListMember1->Flink? Because it points to ListMember2 which it needs to convert to the new ListMember1 in the manner described above. This is very important for our debugging since it may be able to obtain the value of Member1 from esi.
808436de 8b06 mov eax,dword ptr [esi] follow this flink that it just moved by dereferencing it to obtain ListMember1->Flink and place it in eax. This should be a pointer to ListMember2, however it was null.
808436e0 a3a8b48a80 mov dword ptr [nt!MmUnusedSegmentList (808ab4a8)],eax move this null value to ListHead->Flink. This operation should be setting the list head to point to Member2, therefore converting it to Member1. However since Member1->Flink was null, it now contains a NULL value.
808436e5 83c6fc add esi,0FFFFFFFCh Earlier I mentioned that the code might be able to obtain the address of Member1 from ESI since esi had the ListHead->flink pointing to it, however ESI was modified here. This means that it can't rely on the value of esi. The code added 0x0ffffffC here; however, this was really just a compiler optimized (fancy) way of subtracting four. It doesn't care why that it's subtracting four from this value since the code crashed before ever using address-4, so I won't be investigating that today. However, I suspect we were subtracting the offset of the following field.
0: kd> dt nt!_CONTROL_AREA DereferenceList +0x004 DereferenceList : _LIST_ENTRY
0: kd> dt nt!_CONTROL_AREA DereferenceList
+0x004 DereferenceList : _LIST_ENTRY
Instead of digging into that, we simply want to identify the value of the list member. To obtain our value here, we will perform the reverse of the addition operation and simply subtract 0x0FFFFFFFC as a fancy way to add four :
0: kd> resi
esi=86f4c658
We effectively added 4 to get the original value of esi
0: kd> ? 86f4c658-0x0FFFFFFFC
Evaluate expression: -2030778788 = 86f4c65c
So this address should contain a NULL flink
0: kd> dd 86f4c65c L2
86f4c65c 00000000 808ab4a8
And there it is (the underlined value above) ....
And as expected in the output above, the flink was null. Also note that the blink (highlighted) was in fact pointing to 808ab4a8, which is the list head. So this does appear to be the address of the original member1. If you can't recall the address of the list head, don't scroll up in the debugger text (or this blog text), we can prove this quickly as follows;
0: kd> dd nt!MmUnusedSegmentList L1
808ab4a8 00000000
Let’s dump the address out:
0: kd> !address 86f4c65c
83041000 - 07fbf000
Usage KernelSpaceUsageNonPagedPool
0: kd> !pool 86f4c65c
Pool page 86f4c65c region is Nonpaged pool
86f4c000 size: 98 previous size: 0 (Allocated) File (Protected)
86f4c098 is not a valid large pool allocation, checking large session pool...
bf7f4000: Unable to get contents of pool block
So we are dealing with pool corruption. Now that we have followed the NULL value in the corrupt pool and loaded EAX with zero, lets proceed with the inspection of the next instruction which caused the trap leading to the bugcheck:
nt!MiRemoveUnusedSegments+0x716 808436e8
c74004a8b48a80 mov dword ptr [eax+4],offset nt!MmUnusedSegmentList (808ab4a8) This instruction was attempting to place the address of the List Head into eax+4. Why? Well, EAX has the new Member1. If we add four to this value, this would bring us to Member1->Blink which should be pointing to the list head. Had the new Member1 actually been Member2 instead of NULL, then we would be overwriting the pointer to Member1 with a pointer to the list head. However, the pointer had been zeroed out and that brought us to address 4. Next, we trapped and the system bugchecked. As discussed earlier, we died on a write operation attempting to write a value to the address referenced by [EAX+4]
For completeness, let’s review what the next two instructions would have executed had we not trapped
08436ef ff0d8cb48a80 dec dword ptr [nt!MmUnusedSegmentCount (808ab48c)] If the instruction above would not have trapped, then we would have decremented the count. However we did not make it this far. If you remove an item from a linked list, you should decrement the value of any variable tracking the number of items in such list. This instruction would have accomplished this.
808436f5 33c9 xor ecx,ecx This would have simply zeroed out ecx.
So this crash was in fact due to a null pointer. More specifically, it was caused by a null flink. Let's dump the linked list located at nt!MmUnusedSegmentCount using the dlb just to see what happens. We won't be able to go forward since the flink is null; however, we should be able to dump the linked list going backwards. If the list loops back on itself or if a null pointer is encountered, the dl command will stop traversing the list.
First, I would like to know how large this list is since the dl command accepts a value that limits its length.
The following global should tell us how many member are in this list.
0: kd> x nt!*UnusedSegmentCount*
808ab48c nt!MmUnusedSegmentCount = <no type information>
0: kd> dd 808ab48c L1
808ab48c 0000e7ce
0: kd> .formats 0xe7ce
Hex: 0000e7ce
Decimal: 59342
This is a huge list, so let’s see if we can traverse it. Based on the size above, I dumped out the list using ffff for the limit.
0: kd> dlb nt!MmUnusedSegmentList ffff 2
808ab4a8 00000000 86f05234
86f05234 808ab4a8 8724392c
8724392c 86f05234 8867fd94
8867fd94 8724392c 877d600c
877d600c 8867fd94 87849944
.....
(omitting lengthy unneeded output, all links were valid)
86ecbaac 86f04354 86f95664
86f95664 86ecbaac 86f4c28c
86f4c28c 00000000 00000000 <-- After over a minute of output, we have a null pointer, but it's at a different address.
First Address we found to be corrupt: 86f4c65c
Second Address we found to be corrupt: 86f4c28c
Now that we have located the address above which appear to be incorrectly zeroed out. Let’s dump out these areas in an attempt to get more information. It would be great if we could verify if the pool that we are using is corrupt. It would be even better if we some pointers, text, or other clues that may lead us closer to the problem. Let's dump the two addresses out in various ways. I’ll start by examining the pool for corruption.
The !pool extension as used below displays information about a specific pool allocation
0: kd> !pool 86f4c65c;!pool 86f4c28c
Pool page 86f4c28c region is Nonpaged pool
The !poolval extension analyzes the headers for a pool page and diagnoses any possible corruption.
0: kd> !poolval 86f4c65c;!poolval 86f4c28c
Validating Pool headers for pool page: 86f4c65c
Pool page [ 86f4c000 ] is __inVALID.
Analyzing linked list...
[ 86f4c000 --> 86f4c6c0 (size = 0x6c0 bytes)]: Corrupt region
Scanning for single bit errors...
None found
Validating Pool headers for pool page: 86f4c28c
We can also dump out the memory around the addresses in question using various commands. As discussed previously, we are seeking clues in text format, pointers, etc. The output from the following command shows that the memory is mostly zeroed. One value appears to be present. Perhaps the entire region was overwritten and then this value was updated. Dd dumps the raw data as dwords. Dc will dump the data as type char.
0: kd> dd 86f4c65c-100 86f4c65c+100;dd 86f4c28c-100 86f4c28c+100
….<omitting zeros>
86f4c63c 00000000 00000000 00000000 00000000
86f4c64c 00000000 00000000 00000000 00000000
86f4c65c 00000000 808ab4a8 00000000 00000000
86f4c66c 00000000 00000000 00000000 00000000
86f4c67c 00000000 00000000 00000000 00000000
86f4c68c 00000000 00000000 00000000
86f4c26c 00000000 00000000 00000000 00000000
86f4c27c 00000000 00000000 00000000 00000000
86f4c28c 00000000 00000000 00000000 00000000
86f4c29c 00000000 00000000 00000000 00000000
0: kd> dc 86f4c28c-1000 86f4c65c
<ommitting>
86f4c62c 00000000 00000000 00000000 00000000 ................
86f4c63c 00000000 00000000 00000000 00000000 ................
86f4c64c 00000000 00000000 00000000 00000000 ................
86f4c65c 00000000 ....
0: kd> dc
86f4c660 808ab4a8
We can see from the command below that we are in fact dealing with a NonPagedPool address range.
0: kd> !address 86f4c65c;!address 86f4c28c
0: kd> dps 86f4c65c-8 86f4c65c+8
86f4c654 00000000
86f4c658 00000000
86f4c65c 00000000
86f4c660 808ab4a8 nt!MmUnusedSegmentList
86f4c664 00000000
Just as in the previous dump, in order to identify the source of the pool corruption we need to use Special Pool. Special Pool will use guard pages to catch a buffer overrun or underrun and should provide us with a dump that shows the code that causes the corruption. You can find more information on Special Pool in KB188831. Also, in some situations memory corruption can be caused by a driver overflowing on a DMA transfer causing corruption to physical pages rather than virtual pages as we see in typical pool. For more information on linked lists and list heads refer to the following MSDN article:
Singly and Doubly Linked Lists- http://msdn.microsoft.com/en-us/library/aa489548.aspx
For an example of a function that removes the first item of a linked list: http://msdn.microsoft.com/en-us/library/ms804330.aspx
*RemoveHeadList() - The RemoveHeadList routine removes an entry from the beginning of a doubly linked list of LIST_ENTRY structures.