Leaving the Do Not Disturb Sign on the Door Will Cause the KERNEL_APC_PENDING_DURING_EXIT Bugcheck

Leaving the Do Not Disturb Sign on the Door Will Cause the KERNEL_APC_PENDING_DURING_EXIT Bugcheck

  • Comments 1

This is Ron Stock from the Global Escalation Services team and I recently worked with a customer to determine which misbehaving driver was crashing their critical server. This particular crash was a STOP 0x00000020 which maps to KERNEL_APC_PENDING_DURING_EXIT.

 

The KERNEL_APC_PENDING_DURING_EXIT bugcheck type indicates the APC disable count for a thread was not equal to zero when the thread exited. The APC disable count is a field in the _KTHREAD structure and it is decremented when drivers disable APCs by calling functions such as KeEnterCriticalRegion, FsRtlEnterFileSystem or by acquiring a mutex. Disabling APC delivery to a thread is the equivalent of hanging the “Do Not Disturb” sign on your door. When drivers need to perform a critical operation they ‘hang the sign on the door’ to prevent interruption from APCs. When the same driver fails to ‘take the sign off the door’ by calling KeLeaveCriticalRegion, FsRtlExitFileSystem or KeReleaseMutex, the APC disable count is never incremented back to its original value. This forgetful behavior causes a bugcheck because the APC disable count is checked when the thread is exiting. The OS expects this value to be zero on thread exit.

 

In my case the value was 0xffff (negative 1) indicating a driver had forgot to remove the ‘Do Not Disturb’ sign.

 

kd> !analyze –v

 

KERNEL_APC_PENDING_DURING_EXIT (20)

The key data item is the thread's APC disable count.

If this is non-zero, then this is the source of the problem.

Arguments:

Arg1: 0000000000000000, The address of the APC found pending during exit.

Arg2: 000000000000ffff, The thread's APC disable count

Arg3: 0000000000000000, The current IRQL

Arg4: 0000000000000001

 

Because the value is decremented earlier in time the current call stack is not particularly useful. It merely shows the thread exiting under normal conditions.

 

0: kd> !thread -1 e

THREAD fffffa8049f04b50  Cid 0004.0998  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0

Not impersonating

DeviceMap                 fffff8a000007ee0

Owning Process            fffffa8048cad9e0       Image:         System

Attached Process          N/A            Image:         N/A

Wait Start TickCount      11503325       Ticks: 0

Context Switch Count      185715         IdealProcessor: 0

UserTime                  00:00:00.000

KernelTime                00:00:06.078

Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88003c4b400)

Stack Init fffff88005078db0 Current fffff880050789b0

Base fffff88005079000 Limit fffff88005073000 Call 0

Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffff880`05078b08 fffff800`01984bd9 nt!KeBugCheckEx

fffff880`05078b10 fffff800`019a1a3d nt!PspExitThread+0xffffffff`fffe3ae9

fffff880`05078c10 fffff800`0195bc8a nt!PspTerminateThreadByPointer+0x4d

fffff880`05078c60 fffff880`03c56769 nt!PsTerminateSystemThread+0x22

fffff880`05078c90 fffff880`03c4b5b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99

fffff880`05078cc0 fffff800`01966e5a srv2!SrvProcWorkerThread+0x1b6

fffff880`05078d40 fffff800`016c0d26 nt!PspSystemThreadStartup+0x5a

fffff880`05078d80 00000000`00000000 nt!KxStartSystemThread+0x16

 

Driver Verifier is the ideal tool for this type of bugcheck. It has a feature called Critical Region logging which tracks the call stack and KTRHEAD value for each call to either KeEnterCriticalRegion() and KeLeaveCriticalRegion(). I had the customer enable this logging by selecting the “Miscellaneous checks” option in Driver Verifier using these steps-

  • Run Verifier.exe
  • Select “Create custom settings (For code developers)”
  • Select individual settings from a full list
  • Select Miscellaneous checks
  • Select Driver Names from a list
  • Manually choose all of the third-party drivers.
  • Reboot after making the changes. 

 

After running through the steps above, we gathered another STOP 0x00000020 dump. I confirmed the “Miscellaneous checks” option was enabled by using the !verifier command

 

0: kd> !verifier

 

Verify Level 800 ... enabled options are:

      Miscellaneous checks enabled

 

The stack in this new dump was in the same SrvProcWorkerThread thread exit path so we had a consistent pattern. The thread with the negative APC Disable count was fffffa804b5be040.

 

0: kd> !thread -1 e

THREAD fffffa804b5be040 Cid 0004.082c  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0

Not impersonating

DeviceMap                 fffff8a000007ee0

Owning Process            fffffa8048cad9e0       Image:         System

Attached Process          N/A            Image:         N/A

Wait Start TickCount      4458237        Ticks: 0

Context Switch Count      36067          IdealProcessor: 0            

UserTime                  00:00:00.000

KernelTime                00:00:01.218

Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88004827400)

Stack Init fffff88005cc6db0 Current fffff88005cc69b0

Base fffff88005cc7000 Limit fffff88005cc1000 Call 0

Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffff880`05cc6b08 fffff800`0198dbd9 nt!KeBugCheckEx

fffff880`05cc6b10 fffff800`019aaa3d nt!PspExitThread+0xffffffff`fffe3ae9

fffff880`05cc6c10 fffff800`01964c8a nt!PspTerminateThreadByPointer+0x4d

fffff880`05cc6c60 fffff880`048326d9 nt!PsTerminateSystemThread+0x22

fffff880`05cc6c90 fffff880`048275b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99

fffff880`05cc6cc0 fffff800`0196fe5a srv2!SrvProcWorkerThread+0x1b6

fffff880`05cc6d40 fffff800`016c9d26 nt!PspSystemThreadStartup+0x5a

fffff880`05cc6d80 00000000`00000000 nt!KxStartSystemThread+0x16

 

I dumped the Critical Region log by using the !verifier 200 command. The Critical Region log has enough room for 128 stacks. After dumping the log, the first thing to do is to find the KTHREAD value of the thread with the non-zero APC disable count. Unfortunately in my case thread fffffa804b5be040 didn’t appear in the log. In fact all 128 stacks had a driver named Suspect.sys calling KeEnterCriticalRegion or KeLeaveCriticalRegion.  Note: To protect our vendor friends, I renamed the actual sys file in this article to suspect.sys.

 

The customer disabled the suspect.sys driver hoping this was the driver forgetting to re-enable APCs. If nothing else, this would perhaps remove the noisy suspect.sys from the log in the next dump.

 

0: kd> !verifier 200

 

Enter/Leave Critical Region log:

There are up to 0x80 entries in the log.

 

Displaying all the log entries.

 

======================================================================

Thread fffffa8048ce4b50

fffff80001b74293 nt!VerifierKeLeaveCriticalRegion+0xc3

fffff8800100aafa Suspect.sys+0xaafa

fffff88001001e30 Suspect.sys+0x1e30

fffff80001abc68c nt!IopLoadUnloadDriver+0x1c

fffff800016e1641 nt!ExpWorkerThread+0x111

fffff8000196ee5a nt!PspSystemThreadStartup+0x5a

fffff800016c8d26 nt!KiStartSystemThread+0x16

======================================================================

Thread fffffa8048ce4b50

fffff80001b6b0a2 nt!VerifierKeEnterCriticalRegion+0x92

fffff880010062a3 Suspect.sys+0x62a3

fffff8800100a7e2 Suspect.sys+0xa7e2

fffff88001001e30 Suspect.sys+0x1e30

fffff80001abc68c nt!IopLoadUnloadDriver+0x1c

fffff800016e1641 nt!ExpWorkerThread+0x111

fffff8000196ee5a nt!PspSystemThreadStartup+0x5a

 

Unfortunately, the system continued to crash and in the next dump the critical region log was completely empty. My guess is the complier was optimizing the KeEnterCriticalRegion and KeLeaveCriticalRegion calls in the driver, causing them to be inlined and skipping the call to VerifierKeLeaveCriticalRegion/VerifierKeEnterCriticalRegion. I needed another attack plan.

 

There is another Verifier option called I/O Verification and it works in a similar way to the steps below.  Please note that this functionality is not documented and may be subject to change at any time.

  1. A call to IoCallDriver() is made to send an IO packet to a driver associated with a device.
  2. Verifier hooks the call.
  3. Verifier creates a structure to record state info.
  4. Verifier fills in the structure with data including the thread’s APC Disable Count.
  5. Next Verifier calls the normal IoCallDriver() routine to “continue” the call made in step 1.
  6. The driver does its work (disables and re-enables APCs as needed)
  7. The call to IoCallDriver() returns when the driver is finished.
  8. Verifier checks the real APC count in the thread. Next it compares the value to the recorded value in the structure from step 4. If the two values do not match, then Verifier crashes the machine so we can pull the bad driver out of the dump.

 

I had the customer enable I/O Verification using these steps-

  • Run Verifier.exe
  • Select “Create custom settings (For code developers)”
  • Select individual settings from a full list
  • Select I/O Verification
  • Select Driver Names from a list
  • Manually choose all of the third-party drivers.
  • Reboot after making the changes. 

 

As we expected, the machine crashed again because of the APC Disable issue.  Because we enabled I/O Verification, the bugcheck type changed to DRIVER_VERIFIER_DETECTED_VIOLATION and now we have a smoking gun.

 

I used the !verifier command to review the Verifier settings. The output below shows “I\O subsystem checking enabled” which confirms I/O Verification was been enabled.

 

0: kd> !verifier

 

Verify Level 810 ... enabled options are:

      Io subsystem checking enabled

 

The parameters to KeBugcheck reconfirmed the APC disable count was -1 (ffff). And this time we have an additional breadcrumb, the driver dispatch routine address.

 

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)

A device driver attempting to corrupt the system has been caught.  This is

because the driver was specified in the registry as being suspect (by the

administrator) and the kernel has enabled substantial checking of this driver.

If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will

be among the most commonly seen crashes.

Arguments:

Arg1: 00000000000000c5, Thread APC disable count changed by driver dispatch routine.

Arg2: fffff88001345610, Driver dispatch routine address.

Arg3: 000000000000ffff, Current thread APC disable count.

Arg4: 0000000000000000, Thread APC disable count before calling driver dispatch routine.

      The APC disable count is decremented each time a driver calls

      KeEnterCriticalRegion, FsRtlEnterFileSystem, or acquires a mutex. The APC

      disable count is incremented each time a driver calls KeLeaveCriticalRegion,

      FsRtlExitFileSystem, or KeReleaseMutex. Since these calls should always be in

      pairs, this value should be zero when a thread exits. A negative value

      indicates that a driver has disabled APC calls without re-enabling them. A

      positive value indicates that the reverse is true.

 

Notice the Verifier functions on the call stack which we I leveraged for the “saved state” information I discussed above in the I/O Verification architecture.

 

0: kd> kn

# Child-SP          RetAddr           Call Site

00 fffff880`05db1b08 fffff800`0174a9c0 nt!KeBugCheckEx

01 fffff880`05db1b10 fffff800`01b66b4ent!VfBugCheckNoStackUsage+0x30

02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e

03 fffff880`05db1ba0 fffff880`04054756 nt!IovCallDriver+0x57e

04 fffff880`05db1c00 fffff880`0404b7b0 srv2!Smb2ExecuteRead+0x9a6

05 fffff880`05db1c80 fffff880`0404b6fb srv2!SrvProcessPacket+0xa0

06 fffff880`05db1cc0 fffff800`01960e5a srv2!SrvProcWorkerThread+0x2fb

07 fffff880`05db1d40 fffff800`016bad26nt!PspSystemThreadStartup+0x5a

08 fffff880`05db1d80 00000000`00000000nt!KiStartSystemThread+0x16

 

Next I dumped the driver dispatch routine noted in the KeBugCheckoutput above using the ln command. This points to fltmgr!FltpDispatch which tells me we have a filter manager minifilter driver making calls to disable APCs but rudely forgetting to re-enable them. As I noted above we save the state info before the call to IoCallDriver().

 

0: kd> ln fffff88001345610

(fffff880`01345610)   fltmgr!FltpDispatch   |  (fffff880`01345710)   fltmgr!FltReleasePushLock

Exact matches:

    fltmgr!FltpDispatch (<no parameter info>)

 

Now the goal was to determine which minifilter is leaving the “Do Not Disturb” sign on the door and forgetting to remove it. We can find this using the fltmgr device object.  The “saved state” structure is passed to VfAfterCallDriver as the first parameter so I switched to the VfAfterCallDriverframe (second frame) to dig it out. I used the /r flag to show the original values of the registers for this frame.

 

0: kd> .frame /r 2

02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e

rax=0000000000000000 rbx=fffffa804b729790 rcx=00000000000000c4

rdx=00000000000000c5 rsi=fffffa804a3b0000 rdi=fffff8000183ce80

rip=fffff80001b66b4e rsp=fffff88005db1b50 rbp=fffffa804de8c290

r8=fffff88001345610   r9=000000000000ffff r10=fffff80001b7a640

r11=0000000000000000 r12=000000004de8c290 r13=0000000000000000

r14=0000000000000000 r15=0000000000000000

iopl=0         nv up ei ng nz na pe nc

cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282

nt!VfAfterCallDriver+0x33e:

fffff800`01b66b4e cc              int     3

 

Parameter 1 on x64 is always passed via rcx. I dumped the assembly for VfAfterCallDriver and confirmed the value in rcx (the base of the saved state structure) is moved to rbx.

 

0: kd> u nt!VfAfterCallDriver

nt!VfAfterCallDriver:

fffff800`01b66810 48895c2410      mov     qword ptr[rsp+10h],rbx

fffff800`01b66815 48896c2418      mov     qword ptr[rsp+18h],rbp

fffff800`01b6681a 4889742420      mov     qword ptr[rsp+20h],rsi

fffff800`01b6681f 57              push    rdi

fffff800`01b66820 4154            push    r12

fffff800`01b66822 4155            push    r13

fffff800`01b66824 4883ec30        sub     rsp,30h

fffff800`01b66828 488bfa          mov     rdi,rdx

fffff800`01b6682b 488bd9          mov     rbx,rcx

 

The device object is stored in the save state information at offset 0xa0.

 

0: kd> ? fffffa804b729790 + 0xa0

Evaluate expression: -6046048151504 = fffffa80`4b729830

 

0: kd> dq fffffa80`4b729830 l1

fffffa80`4b729830  fffffa80`4a3b0060

 

0: kd> !devobj fffffa80`4a3b0060

Device object (fffffa804a3b0060) is for:

  \FileSystem\FltMgr DriverObject fffffa80491fb7c0

Current Irp 00000000 RefCount 0 Type 00000008 Flags 00040000

DevExt fffffa804a3b01b0 DevObjExt fffffa804a3b0208

ExtensionFlags (0x80000800)  DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO

Characteristics (0000000000) 

AttachedTo (Lower) fffffa804a3b1030 \FileSystem\Ntfs

Device queue is not busy.

 

0: kd> !devstack fffffa80`4a3b0060

  !DevObj   !DrvObj            !DevExt   ObjectName

> fffffa804a3b0060  \FileSystem\FltMgr fffffa804a3b01b0 

  fffffa804a3b1030  \FileSystem\Ntfs   fffffa804a3b1180 

 

As http://msdn.microsoft.com/en-us/library/ff541610(v=vs.85).aspxexplains – “The filter manager is installed with Windows, but it becomes active only when a minifilter driver is loaded. The filter manager attaches to the file system stack for a target volume. A minifilter driver attaches to the file system stack indirectly, by registering with the filter manager for the I/O operations the minifilter driver chooses to filter.”

 

Using the power of the fltkd extension, I dumped the volume information associated with this device object. From the output below, we can extract the name of the filter attached to the volume. The culprit is named BadDriver.sys. The customer removed the driver and the problem went away long enough for the vendor to create an update for BadDriver.sys. Happy Ending!

 

0: kd> !fltkd.volume fffffa80`4a3b0060

 

FLT_VOLUME: fffffa804a3b0800 "\Device\HarddiskVolume3"

   FLT_OBJECT: fffffa804a3b0800  [04000000] Volume

      RundownRef               : 0x000000000000020a (261)

      PointerCount             : 0x00000001

      PrimaryLink              : [fffffa804ae06810-fffffa804a2b16f0]

   Frame                    : fffffa8049fcd420 "Frame 0"

   Flags                    : [00000064] SetupNotifyCalledEnableNameCaching FilterAttached

   FileSystemType           : [00000002] FLT_FSTYPE_NTFS

   VolumeLink               : [fffffa804ae06810-fffffa804a2b16f0]

   DeviceObject             : fffffa804a3b0060

   DiskDeviceObject         : fffffa804a1c0350

   FrameZeroVolume          : fffffa804a3b0800

   VolumeInNextFrame        : 0000000000000000

   Guid                     : "\??\Volume{552791b0-455d-11de-b7b9-00145eed6acc}"

   CDODeviceName            : "\Ntfs"

   CDODriverName            : "\FileSystem\Ntfs"

   TargetedOpenCount        : 258

   Callbacks                : (fffffa804a3b0910)

   ContextLock              : (fffffa804a3b0cf8)

   VolumeContexts           : (fffffa804a3b0d00)  Count=0

   StreamListCtrls          : (fffffa804a3b0d08)  rCount=2871

   FileListCtrls            : (fffffa804a3b0d88)  rCount=0

   NameCacheCtrl            : (fffffa804a3b0e08)

   InstanceList             : (fffffa804a3b0890)

      FLT_INSTANCE: fffffa804b5b1010 "BadDriver.sys Instance" "189600"

Leave a Comment
  • Please add 2 and 5 and type the answer here:
  • Post
  • This is pure gold. Posts like this are clear example why this blog is and was my absolute favorite over the years. Thanks for sharing.

Page 1 of 1 (1 items)