Before we can discuss pool corruption we must understand what pool is. Pool is kernel mode memory used as a storage space for drivers. Pool is organized in a similar way to how you might use a notepad when taking notes from a lecture or a book. Some notes may be 1 line, others may be many lines. Many different notes are on the same page.
Memory is also organized into pages, typically a page of memory is 4KB. The Windows memory manager breaks up this 4KB page into smaller blocks. One block may be as small as 8 bytes or possibly much larger. Each of these blocks exists side by side with other blocks.
The !pool command can be used to see the pool blocks stored in a page.
kd> !pool fffffa8003f42000
Pool page fffffa8003f42000 region is Nonpaged pool
*fffffa8003f42000 size: 410 previous size: 0 (Free) *Irp
Pooltag Irp : Io, IRP packets
fffffa8003f42410 size: 40 previous size: 410 (Allocated) MmSe
fffffa8003f42450 size: 150 previous size: 40 (Allocated) File
fffffa8003f425a0 size: 80 previous size: 150 (Allocated) Even
fffffa8003f42620 size: c0 previous size: 80 (Allocated) EtwR
fffffa8003f426e0 size: d0 previous size: c0 (Allocated) CcBc
fffffa8003f427b0 size: d0 previous size: d0 (Allocated) CcBc
fffffa8003f42880 size: 20 previous size: d0 (Free) Free
fffffa8003f428a0 size: d0 previous size: 20 (Allocated) Wait
fffffa8003f42970 size: 80 previous size: d0 (Allocated) CM44
fffffa8003f429f0 size: 80 previous size: 80 (Allocated) Even
fffffa8003f42a70 size: 80 previous size: 80 (Allocated) Even
fffffa8003f42af0 size: d0 previous size: 80 (Allocated) Wait
fffffa8003f42bc0 size: 80 previous size: d0 (Allocated) CM44
fffffa8003f42c40 size: d0 previous size: 80 (Allocated) Wait
fffffa8003f42d10 size: 230 previous size: d0 (Allocated) ALPC
fffffa8003f42f40 size: c0 previous size: 230 (Allocated) EtwR
Because many pool allocations are stored in the same page, it is critical that every driver only use the space they have allocated. If DriverA uses more space than it allocated they will write into the next driver’s space (DriverB) and corrupt DriverB’s data. This overwrite into the next driver’s space is called a buffer overflow. Later either the memory manager or DriverB will attempt to use this corrupted memory and will encounter unexpected information. This unexpected information typically results in a blue screen.
The NotMyFault application from Sysinternals has an option to force a buffer overflow. This can be used to demonstrate pool corruption. Choosing the “Buffer overflow” option and clicking “Crash” will cause a buffer overflow in pool. The system may not immediately blue screen after clicking the Crash button. The system will remain stable until something attempts to use the corrupted memory. Using the system will often eventually result in a blue screen.
Often pool corruption appears as a stop 0x19 BAD_POOL_HEADER or stop 0xC2 BAD_POOL_CALLER. These stop codes make it easy to determine that pool corruption is involved in the crash. However, the results of accessing unexpected memory can vary widely, as a result pool corruption can result in many different types of bugchecks.
As with any blue screen dump analysis the best place to start is with !analyze -v. This command will display the stop code and parameters, and do some basic interpretation of the crash.
kd> !analyze -v
* Bugcheck Analysis *
An exception happened while executing a system service routine.
Arg1: 00000000c0000005, Exception code that caused the bugcheck
Arg2: fffff8009267244a, Address of the instruction which caused the bugcheck
Arg3: fffff88004763560, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero.
In my example the bugcheck was a stop 0x3B SYSTEM_SERVICE_EXCEPTION. The first parameter of this stop code is c0000005, which is a status code for an access violation. An access violation is an attempt to access invalid memory (this error is not related to permissions). Status codes can be looked up in the WDK header ntstatus.h.
The !analyze -v command also provides a helpful shortcut to get into the context of the failure.
CONTEXT: fffff88004763560 -- (.cxr 0xfffff88004763560;r)
Running this command shows us the registers at the time of the crash.
kd> .cxr 0xfffff88004763560
rax=4f4f4f4f4f4f4f4f rbx=fffff80092690460 rcx=fffff800926fbc60
rdx=0000000000000000 rsi=0000000000001000 rdi=0000000000000000
rip=fffff8009267244a rsp=fffff88004763f60 rbp=fffff8009268fb40
r8=fffffa8001a1b820 r9=0000000000000001 r10=fffff800926fbc60
r11=0000000000000011 r12=0000000000000000 r13=fffff8009268fb48
iopl=0 nv up ei pl nz na po nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010206
fffff800`9267244a 4c8b4808 mov r9,qword ptr [rax+8] ds:002b:4f4f4f4f`4f4f4f57=????????????????
From the above output we can see that the crash occurred in ExAllocatePoolWithTag, which is a good indication that the crash is due to pool corruption. Often an engineer looking at a dump will stop at this point and conclude that a crash was caused by corruption, however we can go further.
The instruction that we failed on was dereferencing rax+8. The rax register contains 4f4f4f4f4f4f4f4f, which does not fit with the canonical form required for pointers on x64 systems. This tells us that the system crashed because the data in rax is expected to be a pointer but it is not one.
To determine why rax does not contain the expected data we must examine the instructions prior to where the failure occurred.
kd> ub .
nt!KzAcquireQueuedSpinLock [inlined in nt!ExAllocatePoolWithTag+0x421]:
fffff800`92672429 488d542440 lea rdx,[rsp+40h]
fffff800`9267242e 49875500 xchg rdx,qword ptr [r13]
fffff800`92672432 4885d2 test rdx,rdx
fffff800`92672435 0f85c3030000 jne nt!ExAllocatePoolWithTag+0x7ec (fffff800`926727fe)
fffff800`9267243b 48391b cmp qword ptr [rbx],rbx
fffff800`9267243e 0f8464060000 je nt!ExAllocatePoolWithTag+0xa94 (fffff800`92672aa8)
fffff800`92672444 4c8b03 mov r8,qword ptr [rbx]
fffff800`92672447 498b00 mov rax,qword ptr [r8]
The assembly shows that rax originated from the data pointed to by r8. The .cxr command we ran earlier shows that r8 is fffffa8001a1b820. If we examine the data at fffffa8001a1b820 we see that it matches the contents of rax, which confirms this memory is the source of the unexpected data in rax.
kd> dq fffffa8001a1b820 l1
To determine if this unexpected data is caused by pool corruption we can use the !pool command.
kd> !pool fffffa8001a1b820
Pool page fffffa8001a1b820 region is Nonpaged pool
fffffa8001a1b000 size: 810 previous size: 0 (Allocated) None
fffffa8001a1b810 doesn't look like a valid small pool allocation, checking to see
if the entire page is actually part of a large page allocation...
fffffa8001a1b810 is not a valid large pool allocation, checking large session pool...
fffffa8001a1b810 is freed (or corrupt) pool
Bad previous allocation size @fffffa8001a1b810, last size was 81
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
*** Use !poolval fffffa8001a1b000 for more details.
Pool page [ fffffa8001a1b000 ] is __inVALID.
Analyzing linked list...
[ fffffa8001a1b000 --> fffffa8001a1b010 (size = 0x10 bytes)]: Corrupt region
Scanning for single bit errors...
The above output does not look like the !pool command we used earlier. This output shows corruption to the pool header which prevented the command from walking the chain of allocations.
The above output shows that there is an allocation at fffffa8001a1b000 of size 810. If we look at this memory we should see a pool header. Instead what we see is a pattern of 4f4f4f4f`4f4f4f4f.
kd> dq fffffa8001a1b000 + 810
fffffa80`01a1b810 4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f
fffffa80`01a1b820 4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f
fffffa80`01a1b830 4f4f4f4f`4f4f4f4f 00574f4c`46524556
fffffa80`01a1b840 00000000`00000000 00000000`00000000
fffffa80`01a1b850 00000000`00000000 00000000`00000000
fffffa80`01a1b860 00000000`00000000 00000000`00000000
fffffa80`01a1b870 00000000`00000000 00000000`00000000
fffffa80`01a1b880 00000000`00000000 00000000`00000000
At this point we can be confident that the system crashed because of pool corruption.
Because the corruption occurred in the past, and a dump is a snapshot of the current state of the system, there is no concrete evidence to indicate how the memory came to be corrupted. It is possible the driver that allocated the pool block immediately preceding the corruption is the one that wrote to the wrong location and caused this corruption. This pool block is marked with the tag “None”, we can search for this tag in memory to determine which drivers use it.
kd> !for_each_module s -a @#Base @#End "None"
fffff800`92411bc2 4e 6f 6e 65 e9 45 04 26-00 90 90 90 90 90 90 90 None.E.&........
kd> u fffff800`92411bc2-1
fffff800`92411bc1 b84e6f6e65 mov eax,656E6F4Eh
fffff800`92411bc6 e945042600 jmp nt!ExAllocatePoolWithTag (fffff800`92672010)
fffff800`92411bcb 90 nop
The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component. Pooltag.txt is installed with Debugging Tools for Windows (in the triage folder) and with the Windows WDK (in \tools\other\platform\poolmon). Pooltag.txt shows the following for this tag:
None - <unknown> - call to ExAllocatePool
Unfortunately what we find is that this tag is used when a driver calls ExAllocatePool, which does not specify a tag. This does not allow us to determine what driver allocated the block prior to the corruption. Even if we could tie the tag back to a driver it may not be sufficient to conclude that the driver using this tag is the one that corrupted the memory.
The next step should be to enable special pool and hope to catch the corruptor in the act. We will discuss special pool in our next article.
This is a really nice article, thanks!
Thanks, I was just thinking about a presentation about pool corruptions for my colleagues as the number of crashes related to this topic is quite high. Now I can point everyone here :) Enabling special pool in verifier is a bit cumbersome process and not all customers are willing to wait for next crash. So in many cases we go for possible driver updates as a first step and enable verifier only in case of reoccurrence.
[In the next article we will cover several different techniques to enable special pool. There is a command line option for verifier which is less cumbersome and easier to provide to customers without deep technical experience.]
Wow!!! really cool one .. waiting for the next article :)
Super Article. thank you so much !
Nice article, thanks !
I can't wait for your next article, which will probably help me a lot with understanding a crashing VM I got (I suspect a malware).
nice and well written.
Is it possible to figure out the culprit driver. in our case myfault.sys. when i ran the tool i get different BSOD on analyzing those them, i am able to figure out issue is caused due to pool corruption but not able to figure out exact driver.
[Hi Ram. We just released part 2 of this article, it should answer your question. http://blogs.msdn.com/b/ntdebugging/archive/2013/08/22/understanding-pool-corruption-part-2-special-pool-for-buffer-overruns.aspx]