Debugging a CLOCK_WATCHDOG_TIMEOUT Bugcheck

Debugging a CLOCK_WATCHDOG_TIMEOUT Bugcheck

Rate This
  • Comments 0

Hi debuggers, Andrew Richards here for my first NT Debugging post. I thought I’d share a recent case that used a lot of discovery techniques to uncover the details of what was going on. Most bugchecks give you the information you need as arguments, but in the case of bugcheck 0x101, I had to go digging for a trap frame, the thread stack, look at the disassembly of the running threads, and lots of other goodies.

 

As usual, the first thing I did was run !analyze -v to get a more detailed explanation of what a bugcheck 0x101 “CLOCK_WATCHDOG_TIMEOUT” is.

 

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

CLOCK_WATCHDOG_TIMEOUT (101)
An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arguments:
Arg1: 00000031, Clock interrupt time out interval in nominal clock ticks.
Arg2: 00000000, 0.
Arg3: 87337120, The PRCB address of the hung processor.
Arg4: 00000003, 0.

 

A bugcheck 0x101 occurs when the Clock interrupt (IRQL #28) has not been processed by each processor within a timeout.  The Clock interrupt is quite high in the IRQL table for x86; only the Inter-Processor Interrupt (IPI), Power-Fail and High interrupts are higher.

 

31

High

30

Power Fail

29

Inter-processor Interrupt

28

Clock

27

Profile/Synch

26

Device n

 

5

CMCI

4

3

Device 1

2

DPC/Dispatch

1

APC

0

Passive

 

The immediate hypothesis was that one of the processors was stuck processing an IPI (a common interrupt), causing it to be above the Clock IRQ level.

The questions were:

1.  At what IRQL was the problem processor?  Was it at IPI?

2.  Why was it stuck?

3.  Who did it?

 

This was a 4 core box, so I ran !prcb four times to view the Processor Resource control Block (PRCB) of each processor. Processor #3 matched the PRCB specified in the bugcheck’s 3rd argument (Arg3 = 87337120). The first question was answered; Processor #3 was the culprit. Interestingly, the Arg4 value in the bugcheck has a value of 3.  Per the documentation it should be 0, but it seems that the processor number is provided.

 

0: kd> !prcb 0
PRCB for Processor 0 at 82b34d20:
Current IRQL -- 28
Threads--  Current 82b3e380 Next 00000000 Idle 82b3e380
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 000475e1
Times -- Dpc    000008d6 Interrupt 0000007b
         Kernel 0001920f User      00003b0b

0: kd> !prcb 1
PRCB for Processor 1 at 807c7120:
Current IRQL -- 0
Threads--  Current 9bb65d48 Next 9bf52c10 Idle 807cc800
Processor Index 1 Number (0, 1) GroupSetMember 2
Interrupt Count -- 00030ae8
Times -- Dpc    000005ca Interrupt 00000098
         Kernel 00017e4a User      00004eab

0: kd> !prcb 2
PRCB for Processor 2 at 87300120:
Current IRQL -- 0
Threads--  Current 9b41b6e8 Next 00000000 Idle 87305800
Processor Index 2 Number (0, 2) GroupSetMember 4
Interrupt Count -- 0002ab35
Times -- Dpc    00000568 Interrupt 000000ac
         Kernel 0001a788 User      00002565

0: kd> !prcb 3
PRCB for Processor 3 at 87337120:
Current IRQL -- 0
Threads--  Current 8aaa17c8 Next 00000000 Idle 8733c800
Processor Index 3 Number (0, 3) GroupSetMember 8
Interrupt Count -- 00026c0d
Times -- Dpc    00000620 Interrupt 0000008b
         Kernel 0001ac65 User      00001e33

 

IRQL

The Processor Context Records reported that Processors #0, #1 and #2 were at IRQL 31 (1f - HIGH), and Processor #3 was at IRQL 27 (1b - SYNCH). Having a value of 31 is expected at bugcheck as that is how the bugcheck gains control of the processor to gather the current context. It was strange that Processor #3 was different and that is matched the processor mentioned in the bugcheck

0: kd> !pcr 0
KPCR for Processor 0 at 82b34c00:
...
                Irql: 0000001f
...

0: kd> !pcr 1
KPCR for Processor 1 at 807c7000:
...
                Irql: 0000001f
...

0: kd> !pcr 2
KPCR for Processor 2 at 87300000:
...
                Irql: 0000001f
...

0: kd> !pcr 3
KPCR for Processor 3 at 87337000:
...
                Irql: 0000001b
...

 

The next step was to look at the stacks of the processors to see what the threads were all involved in.

 

Processor #0

To determine the stack of the Processor #0’s thread before the bugcheck, the trap frame needed to be found. The trap frame is stored immediately above the interrupt handler. To find that, I looked at the Interrupt Descriptor Table to find out the name of the handler for clock interrupt and then I searched for that symbol on the stack.

 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
==> Processor #0
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0: kd> knL
# ChildEBP RetAddr 
00 82b31674 82a84a6f nt!KeBugCheckEx+0x1e
01 82b316b0 82a840be nt!KeAccumulateTicks+0x242
02 82b316f0 82a83f6b nt!KeUpdateRunTime+0x145
03 82b3174c 82a88c17 nt!KeUpdateSystemTime+0x613
04 82b3174c 82a85e79 nt!KeUpdateSystemTimeAssist+0x13
05 82b317e0 82abfa17 nt!KiIpiSendPacket+0xdd
06 82b31820 82af0866 nt!KeFlushSingleTb+0x136
07 82b3190c 82b2ab90 nt!MmFreeSpecialPool+0x2b4
08 82b31970 82d4a06e nt!ExFreePoolWithTag+0xd6
09 82b31980 82d3fab2 nt!ViCtxFreeIsrContext+0xf
0a 82b31998 82d3a1c0 nt!VfIoFreeIrp+0xd3
0b 82b319a8 8ced5986 nt!IovFreeIrpPrivate+0x47
WARNING: Stack unwind information not available. Following frames may be wrong.
0c 82b319c4 82d3acd4 irsir+0x2986
0d 82b319f4 82a81933 nt!IovpLocalCompletionRoutine+0x14b
0e 82b31a3c 82d3ab64 nt!IopfCompleteRequest+0x128
0f 82b31aa4 8b7c6abd nt!IovCompleteRequest+0x133
10 82b31ab4 8b7c6461 serial!SerialGetNextIrpLocked+0x61
11 82b31ad8 8b7c7567 serial!SerialGetNextIrp+0x27
12 82b31b00 8b7b9eb7 serial!SerialTryToCompleteCurrent+0x7a
13 82b31b38 82a83039 serial!SerialReadTimeout+0x68
14 82b31b7c 82a82fdd nt!KiProcessTimerDpcTable+0x50
15 82b31c68 82a82e9a nt!KiProcessExpiredTimerList+0x101
16 82b31cdc 82a8100e nt!KiTimerExpiration+0x25c
17 82b31d20 82a80e38 nt!KiRetireDpcList+0xcb
18 82b31d24 00000000 nt!KiIdleLoop+0x38

 

0: kd> !irql
Debugger saved IRQL for processor 0x0 -- 28 (CLOCK2_LEVEL)

 

0: kd> !idt

Dumping IDT:

37:   82e35104 hal!PicSpuriousService37

51:   8aba1558 serial!SerialCIsrSw (KINTERRUPT 8aba1500)

71:   89003cd8 i8042prt!I8042KeyboardInterruptService (KINTERRUPT 89003c80)

72:   8aba17d8 USBPORT!USBPORT_InterruptService (KINTERRUPT 8aba1780)

82:   89123058 ataport!IdePortInterrupt (KINTERRUPT 89123000)

               ataport!IdePortInterrupt (KINTERRUPT 89123a00)

               ataport!IdePortInterrupt (KINTERRUPT 89123780)

               ataport!IdePortInterrupt (KINTERRUPT 89123500)

92:   8aba1058 Impcd+0x8540 (KINTERRUPT 8aba1000)

a0:   8aba1a58 ndis!ndisMiniportMessageIsr (KINTERRUPT 8aba1a00)

a2:   8aba1cd8 USBPORT!USBPORT_InterruptService (KINTERRUPT 8aba1c80)

b0:   891232d8 ndis!ndisMiniportMessageIsr (KINTERRUPT 89123280)

b1:   89123cd8 ACPI!ACPIInterruptServiceRoutine (KINTERRUPT 89123c80)

b2:   8aba12d8 serial!SerialCIsrSw (KINTERRUPT 8aba1280)

c1:   82e353f4 hal!HalpBroadcastCallService

d1:   82e1d634 hal!HalpHpetClockInterrupt

d2:   82e1d898 hal!HalpHpetRolloverInterrupt

df:   82e351dc hal!HalpApicRebootService

e1:   82e35958 hal!HalpIpiHandler

e3:   82e356f8 hal!HalpLocalApicErrorService

fd:   82e35f2c hal!HalpProfileInterrupt

fe:   82e361a8 hal!HalpPerfInterrupt

 

The search was between the nt!KiIpiSendPacket call (I just chose it as it didn’t seem to be bugcheck related) and the current stack pointer.  Using the dereferenced pointer (poi) and some maths, the trap frame location is retrieved.

 

0: kd> dps @esp 82b317e0
...
82b31744  82b31760 nt!KiDoubleFaultStack+0x2760 <<<<< This is not a real symbol; it actually is a part of the trap frame
82b31748  82e1d72a hal!HalpHpetClockInterrupt+0xf6
82b3174c  82b31760 nt!KiDoubleFaultStack+0x2760
82b31750  82a88c17 nt!KeUpdateSystemTimeAssist+0x13
...

 

0: kd> .trap poi(82b31748-4)
ErrCode = 00000000
eax=87300120 ebx=841882dc ecx=4cdfc4c4 edx=82b34d20 esi=807c7120 edi=82b738c4
eip=82a85e79 esp=82b317d4 ebp=82b31820 iopl=0         nv up ei pl nz na po nc
cs=0008  ss=0010  ds=0000  es=dea0  fs=040f  gs=0008             efl=00000202
nt!KiIpiSendPacket+0xdd:
82a85e79 f390            pause

 

So you might be asking yourself, can this be done an easier way?  The answer is definitely Yes. You just need to use kv instead; it adds the trap frame information on the end of the line.

 

0: kd> kv

ChildEBP RetAddr  Args to Child             

82b31674 82a84a6f 00000101 00000031 00000000 nt!KeBugCheckEx+0x1e

82b316b0 82a840be 00026161 00000000 0001cd00 nt!KeAccumulateTicks+0x242

82b316f0 82a83f6b 82a85e79 807c7120 00000000 nt!KeUpdateRunTime+0x145

82b3174c 82a88c17 ffffff1b ffffff1b 000000d1 nt!KeUpdateSystemTime+0x613

82b3174c 82a85e79 ffffff1b ffffff1b 000000d1 nt!KeUpdateSystemTimeAssist+0x13 (FPO: [0,2] TrapFrame @ 82b31760)

82b317e0 82abfa17 00000001 00000000 82a3cbe1 nt!KiIpiSendPacket+0xdd (FPO: [6,2,0])

82b31820 82af0866 a33d6f20 00000001 a33d00cf nt!KeFlushSingleTb+0x136

82b3190c 82b2ab90 a33d6f20 8a2096d8 8a2096d8 nt!MmFreeSpecialPool+0x2b4

82b31970 82d4a06e a33d6f20 00000000 82b31998 nt!ExFreePoolWithTag+0xd6

82b31980 82d3fab2 a33d6f20 a33d6f20 a33d6f20 nt!ViIrpFree+0xf

82b31998 82d3a1c0 8ced5986 905cadb0 82b319c4 nt!VfIoFreeIrp+0xd3

82b319a8 8ced5986 a33d6f20 8ced58f0 82b31a6c nt!IovFreeIrpPrivate+0x47

WARNING: Stack unwind information not available. Following frames may be wrong.

82b319c4 82d3acd4 00000000 a33d6f20 905cadb0 irsir+0x2986

82b319f4 82a81933 00000000 a33d6f20 82b31a6c nt!IovpLocalCompletionRoutine+0x14b

82b31a3c 82d3ab64 a33d6f20 8ab151ac 8ab150f0 nt!IopfCompleteRequest+0x128

82b31aa4 8b7c6abd 00000000 8ab151ac 82b31ad8 nt!IovCompleteRequest+0x133

82b31ab4 8b7c6461 8ab151ac 8ab1518c 82b31b0c serial!SerialGetNextIrpLocked+0x61 (FPO: [Non-Fpo])

82b31ad8 8b7c7567 8ab151ac 8ab1518c 82b31b0c serial!SerialGetNextIrp+0x27 (FPO: [Non-Fpo])

82b31b00 8b7b9eb7 8ab150f0 00000000 8ab15002 serial!SerialTryToCompleteCurrent+0x7a (FPO: [Non-Fpo])

82b31b38 82a83039 8ab15314 8ab15002 3c171d26 serial!SerialReadTimeout+0x68 (FPO: [Non-Fpo])

82b31b7c 82a82fdd 82b34d20 82b31ca8 00000001 nt!KiProcessTimerDpcTable+0x50

82b31c68 82a82e9a 82b34d20 82b31ca8 00000000 nt!KiProcessExpiredTimerList+0x101

82b31cdc 82a8100e 0001cacf 9bb374c0 82b3e380 nt!KiTimerExpiration+0x25c

82b31d20 82a80e38 00000000 0000000e 00000000 nt!KiRetireDpcList+0xcb

82b31d24 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0x38 (FPO: [0,0,0])

 

By using the stack based maths, or the value provide by kv, the trap frame address is used to set the context (.trap <addr>) to the code running before the interrupt.  These are the stored registers and the stack at the time of the interrupt.

 

0: kd> .trap 82b31760

ErrCode = 00000000

eax=87300120 ebx=841882dc ecx=4cdfc4c4 edx=82b34d20 esi=807c7120 edi=82b738c4

eip=82a85e79 esp=82b317d4 ebp=82b31820 iopl=0         nv up ei pl nz na po nc

cs=0008  ss=0010  ds=0000  es=dea0  fs=040f  gs=0008             efl=00000202

nt!KiIpiSendPacket+0xdd:

82a85e79 f390            pause

 

0: kd> knL
  *** Stack trace for last set context - .thread/.cxr resets it
# ChildEBP RetAddr 
00 82b317e0 82abfa17 nt!KiIpiSendPacket+0xdd
01 82b31820 82af0866 nt!KeFlushSingleTb+0x136
02 82b3190c 82b2ab90 nt!MmFreeSpecialPool+0x2b4
03 82b31970 82d4a06e nt!ExFreePoolWithTag+0xd6
04 82b31980 82d3fab2 nt!ViCtxFreeIsrContext+0xf
05 82b31998 82d3a1c0 nt!VfIoFreeIrp+0xd3
06 82b319a8 8ced5986 nt!IovFreeIrpPrivate+0x47
WARNING: Stack unwind information not available. Following frames may be wrong.
07 82b319c4 82d3acd4 irsir+0x2986
08 82b319f4 82a81933 nt!IovpLocalCompletionRoutine+0x14b
09 82b31a3c 82d3ab64 nt!IopfCompleteRequest+0x128
0a 82b31aa4 8b7c6abd nt!IovCompleteRequest+0x133
0b 82b31ab4 8b7c6461 serial!SerialGetNextIrpLocked+0x61
0c 82b31ad8 8b7c7567 serial!SerialGetNextIrp+0x27
0d 82b31b00 8b7b9eb7 serial!SerialTryToCompleteCurrent+0x7a
0e 82b31b38 82a83039 serial!SerialReadTimeout+0x68
0f 82b31b7c 82a82fdd nt!KiProcessTimerDpcTable+0x50
10 82b31c68 82a82e9a nt!KiProcessExpiredTimerList+0x101
11 82b31cdc 82a8100e nt!KiTimerExpiration+0x25c
12 82b31d20 82a80e38 nt!KiRetireDpcList+0xcb
13 82b31d24 00000000 nt!KiIdleLoop+0x38

 

Dissassembling the first few instructions reveals a jump (jmp) that is back up in the nt!KiIpiSendPacket function. Using the jmp location and the instruction after the jmp as the bound, we can disassemble the loop. At the time of the bugcheck, the thread was executing a pause (a CPU based delay), and seemingly doing this in a loop while waiting for to release it.

 

0: kd> u @eip
nt!KiIpiSendPacket+0xdd:
82a85e79 f390            pause  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @eip
82a85e7b eb9e            jmp     nt!KiIpiSendPacket+0x7f (82a85e1b)
82a85e7d 8d4900          lea     ecx,[ecx]

...

0: kd> u 82a85e1b 82a85e7d
nt!KiIpiSendPacket+0x7f:
82a85e1b 8b86a4180000    mov     eax,dword ptr [esi+18A4h]
82a85e21 0bc0            or      eax,eax <<<<<< Checking if value is nonzero
82a85e23 7538            jne     nt!KiIpiSendPacket+0xc1 (82a85e5d) <<<<<< Take jmp to stay in loop
82a85e25 f00fb196a4180000 lock cmpxchg dword ptr [esi+18A4h],edx
82a85e2d 75ec            jne     nt!KiIpiSendPacket+0x7f (82a85e1b)
82a85e2f 59              pop     ecx
82a85e30 d1e9            shr     ecx,1
82a85e32 8d7f04          lea     edi,[edi+4]
82a85e35 72df            jb      nt!KiIpiSendPacket+0x7a (82a85e16)
82a85e37 75f7            jne     nt!KiIpiSendPacket+0x94 (82a85e30)
82a85e39 59              pop     ecx
82a85e3a 64890d8c190000  mov     dword ptr fs:[198Ch],ecx
82a85e41 8b4c240c        mov     ecx,dword ptr [esp+0Ch]
82a85e45 8b542410        mov     edx,dword ptr [esp+10h]
82a85e49 64ff0590360000  inc     dword ptr fs:[3690h]
82a85e50 52              push    edx
82a85e51 51              push    ecx
82a85e52 ff15a4a0a082    call    dword ptr [nt!_imp__HalRequestIpi (82a0a0a4)]
82a85e58 5f              pop     edi
82a85e59 5e              pop     esi
82a85e5a c21800          ret     18h
82a85e5d 41              inc     ecx
82a85e5e 850d7c3ab782    test    dword ptr [nt!HvlLongSpinCountMask (82b73a7c)],ecx
82a85e64 7513            jne     nt!KiIpiSendPacket+0xdd (82a85e79)
82a85e66 f605783ab78240  test    byte ptr [nt!HvlEnlightenments (82b73a78)],40h <<<<<< Don’t spin if you’re an enlightened VM, just pause
82a85e6d 740a            je      nt!KiIpiSendPacket+0xdd (82a85e79)
82a85e6f 52              push    edx
82a85e70 51              push    ecx
82a85e71 51              push    ecx
82a85e72 e8c8b60500      call    nt!HvlNotifyLongSpinWait (82ae153f)
82a85e77 59              pop     ecx
82a85e78 5a              pop     edx
82a85e79 f390            pause  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @eip
82a85e7b eb9e            jmp     nt!KiIpiSendPacket+0x7f (82a85e1b)

 

The summary so far: Since processor #0 was the thread that created the bugcheck, it must have been interrupted by a Clock interrupt so as to trigger a CLOCK_WATCHDOG_TIMEOUT bugcheck. It is not surprising then that the value of !irql is CLOCK2.

The existence of the nt!KiIpiSendPacket function lends weight to the thought that this was the creator of an IPI and is not one of the processors that wasn’t processing IPIs.

 

Processor #1

Using the same technique as Processor #0, the disassembly for the loop on Processor #1 is determined. Processor #1 is in a tight loop within the nt!KeFlushMultipleRangeTb function, an ancestor function of nt!ExAllocatePoolWithTag, a memory related operation.  This is an interesting function to see, considering that nt!ExFreePoolWithTag, the counterpart, is on Processor #0’s stack.

 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
==> Processor #1
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
1: kd> knL
# ChildEBP RetAddr 
00 95bfc7bc 82abb217 nt!KeFlushMultipleRangeTb+0x2d3
01 95bfc89c 82aa7d11 nt!MiFlushTbAsNeeded+0x12e
02 95bfc8dc 82b29487 nt!MiAllocatePagedPoolPages+0x567
03 95bfc940 82aa4674 nt!MiAllocatePoolPages+0x1f
04 95bfc998 82b2a132 nt!ExpAllocateBigPool+0xa6
05 95bfc9fc 82aab6b1 nt!ExAllocatePoolWithTag+0x12d
06 95bfca20 82c4cc62 nt!ExAllocatePoolWithTagPriority+0x196
07 95bfca78 82c7a662 nt!IopQueryNameInternal+0x60
08 95bfca98 82c57d88 nt!IopQueryName+0x1b
09 95bfcb1c 82c71a50 nt!ObpQueryNameString+0x7f
0a 95bfcb38 82c75e12 nt!ObQueryNameString+0x18
0b 95bfcc14 82c65788 nt!EtwTraceProcess+0xa2
0c 95bfcc38 82c73625 nt!PspExitProcess+0x37
0d 95bfccb4 82c87051 nt!PspExitThread+0x59a
0e 95bfcccc 82aba8c0 nt!PsExitSpecialApc+0x22
0f 95bfcd1c 82a472a4 nt!KiDeliverApc+0x28b
10 95bfcd1c 77556fc0 nt!KiServiceExit+0x64
WARNING: Frame IP not in any known module. Following frames may be wrong.
11 014ef918 00000000 0x77556fc0

 

1: kd> !irql

Debugger saved IRQL for processor 0x1 -- 0 (LOW_LEVEL)  <<<<<< Windows Internals 4th Edition notes that IRQL may not be saved; this explains the 0

 

1: kd> u @eip
nt!KeFlushMultipleRangeTb+0x2d3:
82a40c31 f390            pause <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @eip
82a40c33 8b07            mov     eax,dword ptr [edi]
82a40c35 85c0            test    eax,eax
82a40c37 75de            jne     nt!KeFlushMultipleRangeTb+0x2b9 (82a40c17)
82a40c39 8a4d0f          mov     cl,byte ptr [ebp+0Fh]
82a40c3c ff1558a1a082    call    dword ptr [nt!_imp_KfLowerIrql (82a0a158)]
82a40c42 5f              pop     edi
82a40c43 5e              pop     esi

 

1: kd> u 82a40c17 82a40c39
nt!KeFlushMultipleRangeTb+0x2b9:
82a40c17 46              inc     esi
82a40c18 85357c3ab782    test    dword ptr [nt!HvlLongSpinCountMask (82b73a7c)],esi
82a40c1e 7511            jne     nt!KeFlushMultipleRangeTb+0x2d3 (82a40c31)
82a40c20 f605783ab78240  test    byte ptr [nt!HvlEnlightenments (82b73a78)],40h
82a40c27 7408            je      nt!KeFlushMultipleRangeTb+0x2d3 (82a40c31)
82a40c29 56              push    esi
82a40c2a e810090a00      call    nt!HvlNotifyLongSpinWait (82ae153f)
82a40c2f eb02            jmp     nt!KeFlushMultipleRangeTb+0x2d5 (82a40c33)
82a40c31 f390            pause <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @eip
82a40c33 8b07            mov     eax,dword ptr [edi]
82a40c35 85c0            test    eax,eax
82a40c37 75de            jne     nt!KeFlushMultipleRangeTb+0x2b9 (82a40c17)

 

This function tests a variable at edi in each loop. While the signal is not set, the thread goes around loop, eventually executes a pause instruction and then tries the test again. This thread seems to be waiting on someone to set a flag.

 

Processor #2

Processor #2 is also trying to send an IPI using nt!KiIpiSendPacket. It was caught in the same function that Processor #0 is in.  In this case, it is executing the jmp instruction that is one instruction after the pause which Processor #0 is executing.

 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
==> Processor #2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
2: kd> knL
# ChildEBP RetAddr 
00 a0c4fac0 82a40bcd nt!KiIpiSendPacket+0xdf
01 a0c4fafc 82b29431 nt!KeFlushMultipleRangeTb+0x26f
02 a0c4fbe8 82b2aef1 nt!MiFreePoolPages+0x42c
03 a0c4fc50 82aa6b37 nt!ExFreePoolWithTag+0x436
04 a0c4fc64 82c3745d nt!MmFreeAccessPfnBuffer+0x2f
05 a0c4fcc0 82c4b83a nt!PfpFlushBuffers+0x2ba
06 a0c4fd50 82c12f5e nt!PfTLoggingWorker+0xaa
07 a0c4fd90 82aba219 nt!PspSystemThreadStartup+0x9e
08 00000000 00000000 nt!KiThreadStartup+0x19

 

2: kd> !irql

Debugger saved IRQL for processor 0x2 -- 0 (LOW_LEVEL)  <<<<<< Windows Internals 4th Edition notes that IRQL may not be saved; this explains the 0

 

2: kd> u @eip
nt!KiIpiSendPacket+0xdf:
82a85e7b eb9e            jmp     nt!KiIpiSendPacket+0x7f (82a85e1b) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @eip
82a85e7d 8d4900          lea     ecx,[ecx]
...

 

Summary so far

So at this point we can say (without any real knowledge of how IPI handling is achieved) that:

 

Processor #0 is waiting for Processor #1, #2 and/or #3 to respond to its IPI

Processor #0 is running nt!ExFreePoolWithTag

Processor #0 is reported to be at IRQL 28 (CLOCK2_LEVEL)

 

Processor #1 is waiting for a flag to bet set; probably an IPI flag

Processor #0 is running nt!ExAllocatePoolWithTag

Processor #1 is reported to be at IRQL 0 (LOW_LEVEL); probably incorrect

 

Processor #2 is waiting for Processor #0, #1 and/or #3 to respond to its IPI

Processor #0 is running nt!ExFreePoolWithTag

Processor #2 is reported to be at IRQL 0 (LOW_LEVEL); probably incorrect

 

Processor #3 has been tagged by bugcheck as being the cause

Processor #3 hasn’t been looked at yet

 

Processor #3

Analyzing Processor #3 was problematic. The Processor Context Record (PCR) was not completely captured in the dump. This happens when the bugcheck thread is unable to interrupt the target processor to gather the context – usually because it is running at the HIGH (31) IRQL.

When the context is missing, all of the registers are treated as being zero. The zero value instruction pointer causes the stack to be incorrect when calling knL and various other commands.

 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
==> Processor #3
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
1: kd> ~3
WARNING: Process directory table base 7B59C400 doesn't match CR3 00185000
WARNING: Process directory table base 7B59C400 doesn't match CR3 00185000

 

3: kd> r
eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
eip=00000000 esp=00000000 ebp=00000000 iopl=0         nv up di pl nz na po nc
cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000
00000000 ??              ???

 

3: kd> knL
# ChildEBP RetAddr 
WARNING: Frame IP not in any known module. Following frames may be wrong.
00 00000000 00000000 0x0

 

To determine the thread running on the processor, the !thread command was used. The value for the _ETHREAD (8aaa17c8) comes from the PCR that Windows maintains as part of scheduling. Once again, like the knL command, because the context was missing, the stack was not shown.

 

3: kd> !pcr
KPCR for Processor 3 at 87337000:
...
       CurrentThread:
8aaa17c8
...

 

3: kd> !thread
THREAD
8aaa17c8  Cid 0454.08ec  Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 3
IRP List:
    8f8c6f68: (0006,0094) Flags: 40000404  Mdl: 00000000
Not impersonating
DeviceMap                 89970d50
Owning Process            89f93ac0       Image:         ZyxApp.exe
Attached Process          N/A            Image:         N/A
Wait Start TickCount      117454         Ticks: 589 (0:00:00:09.188)
Context Switch Count      37             IdealProcessor: 3            
UserTime                  00:00:00.000
KernelTime                00:00:00.000
Win32 Start Address 0x7753fd0f
Stack Init 91aecfd0 Current 91aeca78 Base
91aed000 Limit 91aea000 Call 0
Priority 10 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5
ChildEBP RetAddr  Args to Child             
00000000 00000000 00000000 00000000 00000000 0x0

 

The important piece of information in the !thread output was the stack limits. These values allow me to do a search of the raw stack for known symbols. I used dps <limit> <base>. What I was looking for was the first symbol after the 0xffffffff entries which has a value above it that points to a location within this stack. In this case, is it the hal!KfLowerIrql symbol that has a value (base pointer) above it. This symbol is the deepest function that the stack has got to; it doesn’t necessarily mean that the stack is at this depth now.

 

3: kd> dps 91aea000 91aed000
91aea000  ffffffff
91aea004  ffffffff

91aec85c  ffffffff
91aec860  ffffffff
91aec864  ffffffff
91aec868  ffffffff
91aec86c  ffffffff
91aec870  ffffffff
91aec874  0001187f
91aec878  00000010
91aec87c  00011a1f
91aec880  ffffffff
91aec884  0001187f
91aec888  82b7561f nt!MmSystemSpaceLock+0x1f
91aec88c  91aec89c
91aec890  82e20ba9 hal!KfLowerIrql+0x61
91aec894  00011a00
91aec898  82b7561f nt!MmSystemSpaceLock+0x1f
91aec89c  91aec95c
91aec8a0  82e1e92d hal!KeReleaseQueuedSpinLock+0x2d
91aec8a4  82a3bdc7 nt!MiReturnNonPagedPoolVa+0x1d4
91aec8a8  00000001

91aec8ac  00000000

91aec8b0  83cac9e8

91aec8b4  ffffffff

91aec8b8  ffffffff

91aec8bc  ffffffff

91aec8c0  ffffffff

91aec8c4  ffffffff

91aec8c8  ffffffff

91aec8cc  ffffffff

91aec8d0  82e21cee hal!HalpLegacyApicWriteIcr+0xa

91aec8d4  91aec8f0

91aec8d8  82e2aea0 hal!HalpRequestIpiSpecifyVector+0x40

91aec8dc  00000000

91aec8e0  83cac9e8

91aec8e4  82b738cc nt!KiProcessorBlock+0xc

91aec8e8  87300120

91aec8ec  00000202

91aec8f0  91aec914

91aec8f4  82aa6e36 nt!MiInsertPageInFreeOrZeroedList+0x25b

91aec8f8  83cac9e8

91aec8fc  00000000

91aec900  0000001f

91aec904  00000001

91aec908  83cac9e8

91aec90c  00000003

91aec910  00000007

91aec914  00000fff

91aec918  00000003

91aec91c  82a40bcd nt!KeFlushMultipleRangeTb+0x26f

91aec920  00000001

91aec924  00000000

91aec928  82a9fa43 nt!KiFlushTargetMultipleRangeTb

91aec92c  00000000

91aec930  91aec9bc

91aec934  91aec9b8

91aec938  00000001

91aec93c  00000000

91aec940  83e9cc04

91aec944  00018736

91aec948  8733a480

91aec94c  00000003

91aec950  91aec9bc

91aec954  91aeca40

91aec958  82b29431 nt!MiFreePoolPages+0x42c

91aec95c  91aec9b8

91aec960  1f000000

91aec964  91aec9b8

91aec968  00000000

...

 

Using the first location found, I used the stack variables to build up a k= command. With an x86 stack, if you select the stack location above the symbol as the base pointer and stack pointer, and the symbol’s address as the instruction pointer, you’ll get a reconstructed stack.

 

3: kd> k= 91aec89c 91aec88c 82e20ba9

ChildEBP RetAddr 

91aec89c 82e1e92d hal!KfLowerIrql+0x61

91aec8a0 82a3bdc7 hal!KeReleaseQueuedSpinLock+0x2d

91aec9a8 82aab049 nt!MiReturnNonPagedPoolVa+0x1d4

91aec9c0 82ab1685 nt!FindNodeOrParent+0x2091aec9d8 82abd820 nt!RtlLookupElementGenericTableFullAvl+0x16

91aeca0c 82e20ba9 nt!RtlLookupElementGenericTableAvl+0x18

91aeca1c 82b199d2 hal!KfLowerIrql+0x61

91aeca24 82d41ad1 nt!VfAvlUnlockShared+0x15

91aeca40 82b2aef1 nt!VfRemLockDeleteMemoryRange+0x5c

91aecaa8 963a55c5 nt!ExFreePoolWithTag+0x436

91aecac8 963a61d4 Zyx+0x5c5
91aecb00 82d3a6c3 Zyx+0x11d4
91aecb24 82a4054a nt!IovCallDriver+0x258
91aecb38 82c3b975 nt!IofCallDriver+0x1b
91aecb7c 82c2c591 nt!IopDeleteFile+0x10c
91aecb94 82a81d60 nt!ObpRemoveObjectRoutine+0x59
91aecba8 82a81cd0 nt!ObfDereferenceObjectWithTag+0x88
91aecbb0 82c4f308 nt!ObfDereferenceObject+0xd
91aecbf0 82c7dba9 nt!ObpCloseHandleTableEntry+0x21d
91aecc20 82c65f86 nt!ExSweepHandleTable+0x5f
91aecc40 82c73666 nt!ObKillProcess+0x54
91aeccb4 82c87051 nt!PspExitThread+0x5db
91aecccc 82aba8c0 nt!PsExitSpecialApc+0x22
91aecd1c 82a472a4 nt!KiDeliverApc+0x28b
91aecd1c 775570b4 nt!KiServiceExit+0x64
0147ff88 00000000 0x775570b4

 

In this case, because nt!KiFlushTargetMultipleRangeTb was present on Processor #1 as well as being in the Processor #3 raw stack, I found the first symbol lower than this symbol on the (raw) stack that had a valid base pointer above it.  The stack was thus built around the nt!MiFreePoolPages function instead.

 

91aec94c  00000003

91aec950  91aec9bc

91aec954  91aeca40

91aec958  82b29431 nt!MiFreePoolPages+0x42c

91aec95c  91aec9b8

91aec960  1f000000

91aec964  91aec9b8

91aec968  00000000

 

3: kd> k= 91aeca40 91aec954 82b29431

ChildEBP RetAddr 

91aeca40 82b2aef1 nt!MiFreePoolPages+0x42c

91aecaa8 963a55c5 nt!ExFreePoolWithTag+0x436

WARNING: Stack unwind information not available. Following frames may be wrong.

91aecac8 963a61d4 Zyx+0x5c5

91aecb00 82d3a6c3 Zyx+0x11d4

91aecb24 82a4054a nt!IovCallDriver+0x258

91aecb38 82c3b975 nt!IofCallDriver+0x1b

91aecb7c 82c2c591 nt!IopDeleteFile+0x10c

91aecb94 82a81d60 nt!ObpRemoveObjectRoutine+0x59

91aecba8 82a81cd0 nt!ObfDereferenceObjectWithTag+0x88

91aecbb0 82c4f308 nt!ObfDereferenceObject+0xd

91aecbf0 82c7dba9 nt!ObpCloseHandleTableEntry+0x21d

91aecc20 82c65f86 nt!ExSweepHandleTable+0x5f

91aecc40 82c73666 nt!ObKillProcess+0x54

91aeccb4 82c87051 nt!PspExitThread+0x5db

91aecccc 82aba8c0 nt!PsExitSpecialApc+0x22

91aecd1c 82a472a4 nt!KiDeliverApc+0x28b

91aecd1c 775570b4 nt!KiServiceExit+0x64

0147ff88 00000000 0x775570b4

 

It’s very hard to get much further this stack since the exact registers are not known. In particular, it is hard to determine what function above nt!MiFreePoolPages the thread is executing at the moment.  As with the Processor #1 investigation, it is interesting to note that Processor #3 is also involved in a memory operation; specifically, it is doing a nt!ExFreePoolWithTag much like Processor #0 is.

 

Inter-Processor Interrupts

The only time an IPI (interrupt) is not processed immediately is when the target processor is at IPI IRQL or higher. The most common example being when it is already processing an IPI (interrupts of the same level cannot interrupt the handler for the same IRQL). In this case, the interrupt has to be queued until the interrupt mask allows its arrival. Usually, this design allows only one IPI to be processed at any one time.

 

A deadlock (like) condition can arise though if an IPI is issued to a processor that is at a higher IRQL, and this processor (thread) attempts to send an IPI. The IPI logic blocks the send if there is an outstanding IPI to complete on the processor. The assumption being that interrupt queuing should be avoided as there is a probability of loss if the queue overflows.

 

Looking in the Windows Internals book, there is a single sentence that says “Each interrupt level has a specific purpose. For example, the kernel issues an interprocessor interrupt (IPI) to request that another processor perform an action, such as dispatching a particular thread for execution or updating its translation look-aside buffer cache.”.  This is very interesting as the translation look-aside buffer is part of the memory manager, and the memory operations are being undertaken on all the processors.

 

3rd Party Driver

Instead of pulling my hair out combing through the threads on the system, the IPI code, the Memory Manager code or bugcheck code, I decided to look at the 3rd party driver in processor #3 to see if it was changing the IRQL.

 

The first step was to find the bound of the Zyx+0x5c5 function.  The end address of the assembler is easy to determine, it is Zyx+0x5c5.  The question is, how big is the function? To work that out, you look at the assembler of the caller.  The caller’s assembler will point to the starting instruction in the Zyx+0x5c5 function.

 

3: kd> k= 91aeca40 91aec954 82b29431

ChildEBP RetAddr 

91aeca40 82b2aef1 nt!MiFreePoolPages+0x42c

91aecaa8 963a55c5 nt!ExFreePoolWithTag+0x436

WARNING: Stack unwind information not available. Following frames may be wrong.

91aecac8 963a61d4 Zyx+0x5c5

91aecb00 82d3a6c3 Zyx+0x11d4

91aecb24 82a4054a nt!IovCallDriver+0x258

91aecb38 82c3b975 nt!IofCallDriver+0x1b

91aecb7c 82c2c591 nt!IopDeleteFile+0x10c

...

 

3: kd> ub Zyx+0x11d4
Zyx+0x11b7:
963a61b7 ff7008          push    dword ptr [eax+8]
963a61ba 52              push    edx
963a61bb 6a01            push    1
963a61bd ff7018          push    dword ptr [eax+18h]
963a61c0 e827fdffff      call    Zyx+0xeec (963a5eec)
963a61c5 e9f1000000      jmp     Zyx+0x12bb (963a62bb)
963a61ca e8d3fcffff      call    Zyx+0xea2 (963a5ea2)
963a61cf e89ef3ffff      call    Zyx+0x572 (963a5572)

 

3: kd> u 963a5572 Zyx+0x5c5
Zyx+0x572:
963a5572 8bff            mov     edi,edi
963a5574 55              push    ebp
963a5575 8bec            mov     ebp,esp
963a5577 51              push    ecx
963a5578 53              push    ebx
963a5579 56              push    esi
963a557a 57              push    edi
963a557b b11f            mov     cl,1Fh <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 0x1F = 0n31 (HIGH)
963a557d ff15006b3a96    call    dword ptr [Zyx+0x1b00 (963a6b00)] <<<<<<<< hal!KfRaiseIrql (via lookup)
963a5583 bb14813a96      mov     ebx,offset Zyx+0x3114 (963a8114)
963a5588 8bcb            mov     ecx,ebx
963a558a 8845ff          mov     byte ptr [ebp-1],al
963a558d e8c2100000      call    Zyx+0x1654 (963a6654) <<<<<<<<<<<<<<<< [Unresolved]
963a5592 a1746d3a96      mov     eax,dword ptr [Zyx+0x1d74 (963a6d74)]
963a5597 85c0            test    eax,eax
963a5599 8b35586b3a96    mov     esi,dword ptr [Zyx+0x1b58 (963a6b58)] <<<< nt!ExFreePoolWithTag (via lookup)
963a559f 7413            je      Zyx+0x5b4 (963a55b4)
963a55a1 8b7808          mov     edi,dword ptr [eax+8]
963a55a4 6a00            push    0
963a55a6 50              push    eax
963a55a7 ffd6            call    esi  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< nt!ExFreePoolWithTag (via 963a5599)
963a55a9 85ff            test    edi,edi
963a55ab 8bc7            mov     eax,edi
963a55ad a3746d3a96      mov     dword ptr [Zyx+0x1d74 (963a6d74)],eax
963a55b2 75ed            jne     Zyx+0x5a1 (963a55a1)
963a55b4 a1a06d3a96      mov     eax,dword ptr [Zyx+0x1da0 (963a6da0)]
963a55b9 85c0            test    eax,eax
963a55bb 7413            je      Zyx+0x5d0 (963a55d0)
963a55bd 8b7808          mov     edi,dword ptr [eax+8]
963a55c0 6a00            push    0
963a55c2 50              push    eax
963a55c3 ffd6            call    esi <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< nt!ExFreePoolWithTag (via 963a5599)

 

3: kd> ln poi(963a6b00)
(82e20844)   hal!KfRaiseIrql   |  (82e208c6)   hal!HalpDispatchSoftwareInterrupt
Exact matches:

 

3: kd> ln poi(963a6b58)
(82b2aaba)   nt!ExFreePoolWithTag   |  (82b2b660)   nt!ExDeferredFreePool
Exact matches:

 

The next step was to work out what functions the Zyx+0x5c5 function called, and with what parameters. The assembler didn’t make this immediately obvious. The address of the function calls were lookups. This is common when the address is imported by the code, instead of being bound. To get the function name of the call, I dereferenced the pointer (poi(<addr>) and passed the address to ln so that it listed the nearest symbols.

 

The first function confirmed some the theory. A call to hal!KfRaiseIrql was being made with a IRQL of HIGH (31). And this was prior to a code path that called functions.

 

The value of the @esi register was determined to be nt!ExFreePoolWithTag. This matched the function name in the reconstructed stack (nt!ExFreePoolWithTag+0x436) and confirmed that the stack was reconstructed in this area correctly.

 

Conclusion

The Zyx function raised IRQL to HIGH_LEVEL before calling nt!ExFreePoolWithTag. This function caused a Translation Look-aside Buffer (TLB) flush via nt!KiFlushTargetMultipleRangeTb) to occur. This in turn caused an IPI notification to the other processors (via nt!KiIpiSendPacket) to indicate the release of the TLB memory.

 

The Processor #0 IPI was stuck as incomplete because Processor #3 could not be interrupted, and Processor #3 did not lower its IRQL as it was waiting to send its own IPI upon completion of the Processor #0 IPI – thus a deadlock was formed.

 

I contacted the driver’s author and organized a new version that used the SYNCH (27) IRQL (the level immediately under IPI). The new build was sent to the customer and the immediate bugcheck that they were observing in their test environment disappeared.  Although this is not an ideal solution, it was sufficient for the time being.  An ideal solution would be to follow the documentation for ExFreePoolWithTag and not make the call with an IRQL greater than DISPATCH_LEVEL.

 

I hope this case has provided a good foundation in how to recover and navigate through stacks without symbols or context. Once you understand the mechanics of calling conventions, the values needed to bound a disassembling operation and/or a stack reassemble are quite easy to determine.

 

Leave a Comment
  • Please add 6 and 5 and type the answer here:
  • Post