This is Ron Stock from the Global Escalation Services team and I recently worked with a customer to determine which misbehaving driver was crashing their critical server. This particular crash was a STOP 0x00000020 which maps to KERNEL_APC_PENDING_DURING_EXIT.
The KERNEL_APC_PENDING_DURING_EXIT bugcheck type indicates the APC disable count for a thread was not equal to zero when the thread exited. The APC disable count is a field in the _KTHREAD structure and it is decremented when drivers disable APCs by calling functions such as KeEnterCriticalRegion, FsRtlEnterFileSystem or by acquiring a mutex. Disabling APC delivery to a thread is the equivalent of hanging the “Do Not Disturb” sign on your door. When drivers need to perform a critical operation they ‘hang the sign on the door’ to prevent interruption from APCs. When the same driver fails to ‘take the sign off the door’ by calling KeLeaveCriticalRegion, FsRtlExitFileSystem or KeReleaseMutex, the APC disable count is never incremented back to its original value. This forgetful behavior causes a bugcheck because the APC disable count is checked when the thread is exiting. The OS expects this value to be zero on thread exit.
In my case the value was 0xffff (negative 1) indicating a driver had forgot to remove the ‘Do Not Disturb’ sign.
kd> !analyze –v
KERNEL_APC_PENDING_DURING_EXIT (20)
The key data item is the thread's APC disable count.
If this is non-zero, then this is the source of the problem.
Arguments:
Arg1: 0000000000000000, The address of the APC found pending during exit.
Arg2: 000000000000ffff, The thread's APC disable count
Arg3: 0000000000000000, The current IRQL
Arg4: 0000000000000001
Because the value is decremented earlier in time the current call stack is not particularly useful. It merely shows the thread exiting under normal conditions.
0: kd> !thread -1 e
THREAD fffffa8049f04b50 Cid 0004.0998 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap fffff8a000007ee0
Owning Process fffffa8048cad9e0 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 11503325 Ticks: 0
Context Switch Count 185715 IdealProcessor: 0
UserTime 00:00:00.000
KernelTime 00:00:06.078
Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88003c4b400)
Stack Init fffff88005078db0 Current fffff880050789b0
Base fffff88005079000 Limit fffff88005073000 Call 0
Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffff880`05078b08 fffff800`01984bd9 nt!KeBugCheckEx
fffff880`05078b10 fffff800`019a1a3d nt!PspExitThread+0xffffffff`fffe3ae9
fffff880`05078c10 fffff800`0195bc8a nt!PspTerminateThreadByPointer+0x4d
fffff880`05078c60 fffff880`03c56769 nt!PsTerminateSystemThread+0x22
fffff880`05078c90 fffff880`03c4b5b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99
fffff880`05078cc0 fffff800`01966e5a srv2!SrvProcWorkerThread+0x1b6
fffff880`05078d40 fffff800`016c0d26 nt!PspSystemThreadStartup+0x5a
fffff880`05078d80 00000000`00000000 nt!KxStartSystemThread+0x16
Driver Verifier is the ideal tool for this type of bugcheck. It has a feature called Critical Region logging which tracks the call stack and KTRHEAD value for each call to either KeEnterCriticalRegion() and KeLeaveCriticalRegion(). I had the customer enable this logging by selecting the “Miscellaneous checks” option in Driver Verifier using these steps-
After running through the steps above, we gathered another STOP 0x00000020 dump. I confirmed the “Miscellaneous checks” option was enabled by using the !verifier command
0: kd> !verifier
Verify Level 800 ... enabled options are:
Miscellaneous checks enabled
The stack in this new dump was in the same SrvProcWorkerThread thread exit path so we had a consistent pattern. The thread with the negative APC Disable count was fffffa804b5be040.
THREAD fffffa804b5be040 Cid 0004.082c Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Wait Start TickCount 4458237 Ticks: 0
Context Switch Count 36067 IdealProcessor: 0
KernelTime 00:00:01.218
Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88004827400)
Stack Init fffff88005cc6db0 Current fffff88005cc69b0
Base fffff88005cc7000 Limit fffff88005cc1000 Call 0
fffff880`05cc6b08 fffff800`0198dbd9 nt!KeBugCheckEx
fffff880`05cc6b10 fffff800`019aaa3d nt!PspExitThread+0xffffffff`fffe3ae9
fffff880`05cc6c10 fffff800`01964c8a nt!PspTerminateThreadByPointer+0x4d
fffff880`05cc6c60 fffff880`048326d9 nt!PsTerminateSystemThread+0x22
fffff880`05cc6c90 fffff880`048275b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99
fffff880`05cc6cc0 fffff800`0196fe5a srv2!SrvProcWorkerThread+0x1b6
fffff880`05cc6d40 fffff800`016c9d26 nt!PspSystemThreadStartup+0x5a
fffff880`05cc6d80 00000000`00000000 nt!KxStartSystemThread+0x16
I dumped the Critical Region log by using the !verifier 200 command. The Critical Region log has enough room for 128 stacks. After dumping the log, the first thing to do is to find the KTHREAD value of the thread with the non-zero APC disable count. Unfortunately in my case thread fffffa804b5be040 didn’t appear in the log. In fact all 128 stacks had a driver named Suspect.sys calling KeEnterCriticalRegion or KeLeaveCriticalRegion. Note: To protect our vendor friends, I renamed the actual sys file in this article to suspect.sys.
The customer disabled the suspect.sys driver hoping this was the driver forgetting to re-enable APCs. If nothing else, this would perhaps remove the noisy suspect.sys from the log in the next dump.
0: kd> !verifier 200
Enter/Leave Critical Region log:
There are up to 0x80 entries in the log.
Displaying all the log entries.
======================================================================
Thread fffffa8048ce4b50
fffff80001b74293 nt!VerifierKeLeaveCriticalRegion+0xc3
fffff8800100aafa Suspect.sys+0xaafa
fffff88001001e30 Suspect.sys+0x1e30
fffff80001abc68c nt!IopLoadUnloadDriver+0x1c
fffff800016e1641 nt!ExpWorkerThread+0x111
fffff8000196ee5a nt!PspSystemThreadStartup+0x5a
fffff800016c8d26 nt!KiStartSystemThread+0x16
fffff80001b6b0a2 nt!VerifierKeEnterCriticalRegion+0x92
fffff880010062a3 Suspect.sys+0x62a3
fffff8800100a7e2 Suspect.sys+0xa7e2
Unfortunately, the system continued to crash and in the next dump the critical region log was completely empty. My guess is the complier was optimizing the KeEnterCriticalRegion and KeLeaveCriticalRegion calls in the driver, causing them to be inlined and skipping the call to VerifierKeLeaveCriticalRegion/VerifierKeEnterCriticalRegion. I needed another attack plan.
There is another Verifier option called I/O Verification and it works in a similar way to the steps below. Please note that this functionality is not documented and may be subject to change at any time.
I had the customer enable I/O Verification using these steps-
As we expected, the machine crashed again because of the APC Disable issue. Because we enabled I/O Verification, the bugcheck type changed to DRIVER_VERIFIER_DETECTED_VIOLATION and now we have a smoking gun.
I used the !verifier command to review the Verifier settings. The output below shows “I\O subsystem checking enabled” which confirms I/O Verification was been enabled.
Verify Level 810 ... enabled options are:
Io subsystem checking enabled
The parameters to KeBugcheck reconfirmed the APC disable count was -1 (ffff). And this time we have an additional breadcrumb, the driver dispatch routine address.
DRIVER_VERIFIER_DETECTED_VIOLATION (c4)
A device driver attempting to corrupt the system has been caught. This is
because the driver was specified in the registry as being suspect (by the
administrator) and the kernel has enabled substantial checking of this driver.
If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will
be among the most commonly seen crashes.
Arg1: 00000000000000c5, Thread APC disable count changed by driver dispatch routine.
Arg2: fffff88001345610, Driver dispatch routine address.
Arg3: 000000000000ffff, Current thread APC disable count.
Arg4: 0000000000000000, Thread APC disable count before calling driver dispatch routine.
The APC disable count is decremented each time a driver calls
KeEnterCriticalRegion, FsRtlEnterFileSystem, or acquires a mutex. The APC
disable count is incremented each time a driver calls KeLeaveCriticalRegion,
FsRtlExitFileSystem, or KeReleaseMutex. Since these calls should always be in
pairs, this value should be zero when a thread exits. A negative value
indicates that a driver has disabled APC calls without re-enabling them. A
positive value indicates that the reverse is true.
Notice the Verifier functions on the call stack which we I leveraged for the “saved state” information I discussed above in the I/O Verification architecture.
0: kd> kn
# Child-SP RetAddr Call Site
00 fffff880`05db1b08 fffff800`0174a9c0 nt!KeBugCheckEx
01 fffff880`05db1b10 fffff800`01b66b4ent!VfBugCheckNoStackUsage+0x30
02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e
03 fffff880`05db1ba0 fffff880`04054756 nt!IovCallDriver+0x57e
04 fffff880`05db1c00 fffff880`0404b7b0 srv2!Smb2ExecuteRead+0x9a6
05 fffff880`05db1c80 fffff880`0404b6fb srv2!SrvProcessPacket+0xa0
06 fffff880`05db1cc0 fffff800`01960e5a srv2!SrvProcWorkerThread+0x2fb
07 fffff880`05db1d40 fffff800`016bad26nt!PspSystemThreadStartup+0x5a
08 fffff880`05db1d80 00000000`00000000nt!KiStartSystemThread+0x16
Next I dumped the driver dispatch routine noted in the KeBugCheckoutput above using the ln command. This points to fltmgr!FltpDispatch which tells me we have a filter manager minifilter driver making calls to disable APCs but rudely forgetting to re-enable them. As I noted above we save the state info before the call to IoCallDriver().
0: kd> ln fffff88001345610
(fffff880`01345610) fltmgr!FltpDispatch | (fffff880`01345710) fltmgr!FltReleasePushLock
Exact matches:
fltmgr!FltpDispatch (<no parameter info>)
Now the goal was to determine which minifilter is leaving the “Do Not Disturb” sign on the door and forgetting to remove it. We can find this using the fltmgr device object. The “saved state” structure is passed to VfAfterCallDriver as the first parameter so I switched to the VfAfterCallDriverframe (second frame) to dig it out. I used the /r flag to show the original values of the registers for this frame.
0: kd> .frame /r 2
rax=0000000000000000 rbx=fffffa804b729790 rcx=00000000000000c4
rdx=00000000000000c5 rsi=fffffa804a3b0000 rdi=fffff8000183ce80
rip=fffff80001b66b4e rsp=fffff88005db1b50 rbp=fffffa804de8c290
r8=fffff88001345610 r9=000000000000ffff r10=fffff80001b7a640
r11=0000000000000000 r12=000000004de8c290 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei ng nz na pe nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00000282
nt!VfAfterCallDriver+0x33e:
fffff800`01b66b4e cc int 3
Parameter 1 on x64 is always passed via rcx. I dumped the assembly for VfAfterCallDriver and confirmed the value in rcx (the base of the saved state structure) is moved to rbx.
0: kd> u nt!VfAfterCallDriver
nt!VfAfterCallDriver:
fffff800`01b66810 48895c2410 mov qword ptr[rsp+10h],rbx
fffff800`01b66815 48896c2418 mov qword ptr[rsp+18h],rbp
fffff800`01b6681a 4889742420 mov qword ptr[rsp+20h],rsi
fffff800`01b6681f 57 push rdi
fffff800`01b66820 4154 push r12
fffff800`01b66822 4155 push r13
fffff800`01b66824 4883ec30 sub rsp,30h
fffff800`01b66828 488bfa mov rdi,rdx
fffff800`01b6682b 488bd9 mov rbx,rcx
The device object is stored in the save state information at offset 0xa0.
0: kd> ? fffffa804b729790 + 0xa0
Evaluate expression: -6046048151504 = fffffa80`4b729830
0: kd> dq fffffa80`4b729830 l1
fffffa80`4b729830 fffffa80`4a3b0060
0: kd> !devobj fffffa80`4a3b0060
Device object (fffffa804a3b0060) is for:
\FileSystem\FltMgr DriverObject fffffa80491fb7c0
Current Irp 00000000 RefCount 0 Type 00000008 Flags 00040000
DevExt fffffa804a3b01b0 DevObjExt fffffa804a3b0208
ExtensionFlags (0x80000800) DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO
Characteristics (0000000000)
AttachedTo (Lower) fffffa804a3b1030 \FileSystem\Ntfs
Device queue is not busy.
0: kd> !devstack fffffa80`4a3b0060
!DevObj !DrvObj !DevExt ObjectName
> fffffa804a3b0060 \FileSystem\FltMgr fffffa804a3b01b0
fffffa804a3b1030 \FileSystem\Ntfs fffffa804a3b1180
As http://msdn.microsoft.com/en-us/library/ff541610(v=vs.85).aspxexplains – “The filter manager is installed with Windows, but it becomes active only when a minifilter driver is loaded. The filter manager attaches to the file system stack for a target volume. A minifilter driver attaches to the file system stack indirectly, by registering with the filter manager for the I/O operations the minifilter driver chooses to filter.”
Using the power of the fltkd extension, I dumped the volume information associated with this device object. From the output below, we can extract the name of the filter attached to the volume. The culprit is named BadDriver.sys. The customer removed the driver and the problem went away long enough for the vendor to create an update for BadDriver.sys. Happy Ending!
0: kd> !fltkd.volume fffffa80`4a3b0060
FLT_VOLUME: fffffa804a3b0800 "\Device\HarddiskVolume3"
FLT_OBJECT: fffffa804a3b0800 [04000000] Volume
RundownRef : 0x000000000000020a (261)
PointerCount : 0x00000001
PrimaryLink : [fffffa804ae06810-fffffa804a2b16f0]
Frame : fffffa8049fcd420 "Frame 0"
Flags : [00000064] SetupNotifyCalledEnableNameCaching FilterAttached
FileSystemType : [00000002] FLT_FSTYPE_NTFS
VolumeLink : [fffffa804ae06810-fffffa804a2b16f0]
DeviceObject : fffffa804a3b0060
DiskDeviceObject : fffffa804a1c0350
FrameZeroVolume : fffffa804a3b0800
VolumeInNextFrame : 0000000000000000
Guid : "\??\Volume{552791b0-455d-11de-b7b9-00145eed6acc}"
CDODeviceName : "\Ntfs"
CDODriverName : "\FileSystem\Ntfs"
TargetedOpenCount : 258
Callbacks : (fffffa804a3b0910)
ContextLock : (fffffa804a3b0cf8)
VolumeContexts : (fffffa804a3b0d00) Count=0
StreamListCtrls : (fffffa804a3b0d08) rCount=2871
FileListCtrls : (fffffa804a3b0d88) rCount=0
NameCacheCtrl : (fffffa804a3b0e08)
InstanceList : (fffffa804a3b0890)
FLT_INSTANCE: fffffa804b5b1010 "BadDriver.sys Instance" "189600"
Recently I came across an instance where my debugger did not do what I wanted. Rarely do computers disobey me, but this one was unusually stubborn. There was no other option; I had to bend the debugger to my will.
There are many ways to make a computer program do what you want. If you have the source code you can rewrite and recompile the program. If you have a hex editor you can edit the code of the binary. A shim can be used to modify a program at runtime. In this instance I was in a hurry and I was ok with a temporary solution, so I used a debugger to change the execution of the debugger while it ran.
Debug a debugger? Can you do such a thing? Of course you can.
In this example a memory dump was captured of a system and I was asked to determine if the system had run out of desktop heap. Usually the !dskheap command can be used to determine how much heap has been used by each desktop. Unfortunately, this command failed me.
23: kd> !dskheap
Error Reading TotalFreeSize from nt!_HEAP @ fffffa8019c65c00
Failed to GetHeapInfo for desktop @fffffa8019c65c00
EnumDsktps failed on Winsta: 19c4f090FillWinstaArray failed
The error indicates that the command couldn’t read from the _HEAP structure at fffffa8019c65c00 for desktop fffffa8019c65c00. Further investigation found that reason I got this error is that the heap for the desktop in question is not valid memory. Because the memory is described by a prototype PTE I assume that the heap has not been initialized (Note: See Windows Internals’ Memory Management chapter for more information about proto PTEs).
23: kd> dt win32k!tagDESKTOP fffffa8019c65c00
+0x000 dwSessionId : 0
+0x008 pDeskInfo : 0xfffff900`c05e0a70 tagDESKTOPINFO
+0x010 pDispInfo : 0xfffff900`c0581e50 tagDISPLAYINFO
+0x018 rpdeskNext : 0xfffffa80`19c6ef20 tagDESKTOP
+0x020 rpwinstaParent : 0xfffffa80`19c4f090 tagWINDOWSTATION
+0x028 dwDTFlags : 0x110
+0x030 dwDesktopId : 0x19c65c00`00000003
+0x038 spmenuSys : (null)
+0x040 spmenuDialogSys : (null)
+0x048 spmenuHScroll : (null)
+0x050 spmenuVScroll : (null)
+0x058 spwndForeground : (null)
+0x060 spwndTray : (null)
+0x068 spwndMessage : 0xfffff900`c05e0d90 tagWND
+0x070 spwndTooltip : 0xfffff900`c05e0fa0 tagWND
+0x078 hsectionDesktop : 0xfffff8a0`00ef08e0 Void
+0x080 pheapDesktop : 0xfffff900`c05e0000 tagWIN32HEAP
+0x088 ulHeapSize : 0x18000
+0x090 cciConsole : _CONSOLE_CARET_INFO
+0x0a8 PtiList : _LIST_ENTRY [ 0xfffffa80`19c65ca8 - 0xfffffa80`19c65ca8 ]
+0x0b8 spwndTrack : (null)
+0x0c0 htEx : 0n0
+0x0c4 rcMouseHover : tagRECT
+0x0d4 dwMouseHoverTime : 0
+0x0d8 pMagInputTransform : (null)
23: kd> dd 0xfffff900`c05e0000
fffff900`c05e0000 ???????? ???????? ???????? ????????
fffff900`c05e0010 ???????? ???????? ???????? ????????
fffff900`c05e0020 ???????? ???????? ???????? ????????
fffff900`c05e0030 ???????? ???????? ???????? ????????
fffff900`c05e0040 ???????? ???????? ???????? ????????
fffff900`c05e0050 ???????? ???????? ???????? ????????
fffff900`c05e0060 ???????? ???????? ???????? ????????
fffff900`c05e0070 ???????? ???????? ???????? ????????
23: kd> !pte fffff900`c05e0000
VA fffff900c05e0000
PXE at FFFFF6FB7DBEDF90 PPE at FFFFF6FB7DBF2018 PDE at FFFFF6FB7E403010 PTE at FFFFF6FC80602F00
contains 000000076245C863 contains 0000000762569863 contains 000000045FA17863 contains F8A000F4F9780400
pfn 76245c ---DA--KWEV pfn 762569 ---DA--KWEV pfn45fa17 ---DA--KWEV not valid
Proto: FFFFF8A000F4F978
There are many desktops in this session and I wanted to know about the usage of the other desktops, but the !dskheap command stopped after just one error. I needed to force it to continue after this error, so I launched a debugger to debug my debugger. There is a command to do this, just run .dbgdbg.
23: kd> .dbgdbg
Debugger spawned, connect with
"-remotenpipe:icfenable,pipe=cdb_pipe,server=NINJA007"
For clarity I will call the original debugger where I ran !dskheap debugger1, and the new debugger spawned by .dbgdbg debugger2 .
Before switching to debugger2 I need to know what I am going to debug. The error message gives a hint about where to set a breakpoint, I am looking for a failure from GetHeapInfo.
I need to know which module GetHeapInfo is in, the .extmatch match command indicates which module contains the !dskheap command.
23: kd> .extmatch dskheap
!kdexts.dskheap
Switching to debugger2 I set a breakpoint on kdexts!GetHeapInfo. Use Ctrl+C to trigger a debug break in cdb (this is the same as a Ctrl+Break in windbg).
0:004> bp kdexts!GetHeapInfo
0:004> g
Switch back to debugger1 and run the !dskheap command.
In debugger2 I have hit the breakpoint.
Breakpoint 0 hit
kdexts!GetHeapInfo:
000007f9`4237b9b0 4055 push rbp
The error says GetHeapInfo failed, so I am interested in what this function returns. To see what GetHeapInfo returns I go up one level in the stack and set a new breakpoint on the code just after it returns. This new breakpoint will also dump the return value of GetHeapInfo (return values are always in the rax register).
0:000> gu
kdexts!EnumDsktps+0x197:
000007f9`4237b483 4885c0 test rax,rax
0:000> r rax
rax=0000000000000000
0:000> bc *
0:000> bp 000007f9`4237b483 "r rax"
0:000> g
The next time the breakpoint hit the return value was 1, which in this instance means GetHeapInfo failed. This is where I exerted my control over the computer: I forced the return value to 0.
rax=0000000000000001
0:000> r rax=0
I ran through the other breakpoints and changed rax as necessary.
Everything was going well, until the computer defied me again. The !dskheap output computes the percentage of heap usage by dividing the bytes used by the size of the heap. Unfortunately, the size of the heap was left at 0 for the two heaps where I changed the return value. It is well known that only Chuck Norris can divide by zero; to prevent a roundhouse kick to your computer the processor generates an exception.
(2d0.928): Integer divide-by-zero - code c0000094 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
kdexts!DisplayInfo+0x2ee:
000007f9`4237b90e 49f7f3 div rax,r11
0:000> r r11
r11=0000000000000000
Fortunately debugger1 handles the divide by zero exception and it is easy to run !dskheap again.
Back in debugger2 I set a new breakpoint on the div instruction that outputs the divisor. When the divisor (r11) is 0 I changed it to a non-zero value to avoid the wrath of Mr. Norris.
0:000> bp 000007f9`4237b90e
0:000> bp 000007f9`4237b90e "r r11"
breakpoint 1 redefined
r11=0000000000033333
0:000> r r11=1
r11=00000000000007ae
r11=0000000000013333
Finally, back in debugger1 I now have complete output for !dskheap. After a few strategic modifications of the program’s execution I got it to output the data I wanted. As it turns out we aren’t out of desktop heap after all.
Error Reading TotalFreeSize from nt!_HEAP @ fffffa801a53ea30
Winstation\Desktop Heap Size(KB) Used Rate(%)
------------------------------------------------------------
WinSta0\Default 20480 0%
WinSta0\Disconnect 0 0%
WinSta0\Winlogon 192 2%
Service-0x0-3e7$\Default 7680 1%
Service-0x0-3e4$\Default 7680 0%
Service-0x0-3e5$\Default 7680 0%
Service-0x0-26f46a$\Default 0 0%
Service-0x0-2706f2$\Default 7680 0%
------------------------------------------------------
Total Desktop: ( 51392 KB - 8 desktops)
Session ID: 0
============================================================