Hi, I'm Manish from Global Escalation Services. I would like to present a multiple random bug check issue, which was caused by malicious code (trojan) running on the machine. This is the walkthrough of how we found the virus on the server.
In this particular dump, the machine crashed with Bugcheck 0xA (IRQL_NOT_LESS_OR_EQUAL) because we got a Page Fault at dispatch level, also known as DPC Level (IRQL 2). Windows system architecture governs that we cannot have a page fault at dispatch level because paging requires I/O, I/O requires a wait, and we cannot wait while the IRQL is above dispatch level. So when this anomaly happens Windows will intentionally crash the machine. We trapped in CcMapData because we touched the address c226d800 which was not valid because it’s paged out. If we look at implementation of CcMapData on MSDN http://msdn.microsoft.com/en-us/library/windows/hardware/ff539155(v=vs.85).aspx we see that this function can only be called below Dispatch Level, but the current CPU IRQL is 2.
So how did this happen? The most likely possibility is that some driver on the stack raised the IRQL by calling KeRaiseIrql and then forgot to lower it by calling KeLowerIrql. There are many drivers on this stack and anyone could be the culprit. It is difficult to track when one of these may have done it as it’s long gone.
We could have enabled IRQL Checking using Driver Verifier to find the culprit but we have to enable it on all drivers listed in the stack (and possibly others), which could cause some performance issues. Also this was not the only crash we were seeing; there were various stop codes. So before going the verifier route I decided to dig more into this dump. I noticed there is an address on the stack for which the module name is not getting resolved. This looked odd (suspicious) as most of the time the debugger does a great job of finding the module. I started investigating what is this module. It turned out to be a Trojan “TrojanDropper:Win32/Sirefef.B”. http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=TrojanDropper%3AWin32%2FSirefef.B
0: kd> !analyze -v
* Bugcheck Analysis *
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arg1: c226d800, memory referenced
Arg2: d0000002, IRQL
Arg3: 00000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 808b64a6, address which referenced memory
0: kd> kv
ChildEBP RetAddr Args to Child
f78ae41c 808b64a6 badb0d00 00000000 00000001 nt!KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78ae41c)
f78ae4cc f71a6f2d 8b22d520 f78ae4fc 00000400 nt!CcMapData+0x8c (FPO: [Non-Fpo])
f78ae4ec f71a4494 f78ae7ec 8b64c150 01c6d800 Ntfs!NtfsMapStream+0x4b (FPO: [Non-Fpo])
f78ae560 f71a6df0 f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadMftRecord+0x86 (FPO: [Non-Fpo])
f78ae598 f71a6fac f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadFileRecord+0x7a (FPO: [Non-Fpo])
f78ae5d0 f718e46d f78ae7ec e7c50cd8 e7c50ce0 Ntfs!NtfsLookupInFileRecord+0x37 (FPO: [Non-Fpo])
f78ae67c f718e541 f78ae7ec e7c50cd8 f718e2cb Ntfs!NtfsWalkUpTree+0xbe (FPO: [Non-Fpo])
f78ae6d8 f718e263 f78ae7ec e7c50cd8 00000000 Ntfs!NtfsBuildNormalizedName+0x44 (FPO: [Non-Fpo])
f78ae704 f7196c2e f78ae7ec 8801d600 e7c50da0 Ntfs!NtfsQueryNameInfo+0x4b (FPO: [Non-Fpo])
f78ae774 f71a0ff6 f78ae7ec 88081cb0 8b5f9260 Ntfs!NtfsCommonQueryInformation+0x291 (FPO: [Non-Fpo])
f78ae7d8 f71a102f f78ae7ec 88081cb0 00000001 Ntfs!NtfsFsdDispatchSwitch+0x12a (FPO: [Non-Fpo])
f78ae8f4 8081df85 8b3a7020 88081cb0 88081cb0 Ntfs!NtfsFsdDispatchWait+0x1c (FPO: [Non-Fpo])
f78ae908 f721fd28 8b2fd220 8b60d308 8b54b020 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae934 8081df85 8b5f9260 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])
f78ae948 f721fd28 88081ed0 8b60d308 c000000d nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae974 8081df85 8b54b020 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])
f78ae988 f5c62bec 88081cb0 8b54a980 88081cb0 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
f78ae99c f5c5e3ee 8b03b690 00000000 f78ae9c0 CtxSbx+0x5bec
f78ae9ac 8081df85 8b03b690 88081cb0 8b0528e0 CtxSbx+0x13ee
f78ae9c0 f61074e1 8b0528e0 8801d600 f78aea00nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78ae9e4 f61075d0 8b0a9b80 00081cb0 f78aea08 CtxAltStr+0x44e1
f78ae9f4 8081df85 8b0a9b80 88081cb0 88081ed0 CtxAltStr+0x45d0
f78aea08 8b5cfc89 00000000 87ef0000 87ef9002 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aec90 8081df85 8b5ef610 8b08c110 8b08c2a8 0x8b5cfc89 <------------ This looks odd what is this module?
f78aeca4 f7241607 8b08c2a8 00000000 f78aece8nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aecb4 f72412b2 8b08c2a8 8b392b70 87f11974 CLASSPNP!SubmitTransferPacket+0xbb (FPO: [Non-Fpo])
f78aece8 f7241533 00000000 00000e00 87f11808 CLASSPNP!ServiceTransferRequest+0x1e4 (FPO: [Non-Fpo])
f78aed0c 8081df85 8b392ab8 00000000 8b2a2670 CLASSPNP!ClassReadWrite+0x159 (FPO: [Non-Fpo])
f78aed20 f74c80cf 8b2b7d80 87f11998 f78aed44 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed30 8081df85 8b6e4020 87f11808 87f119bc PartMgr!PmReadWrite+0x95 (FPO: [Non-Fpo])
f78aed44 f7317053 87f119d8 8b6e8148 882b9888 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed60 8081df85 8b2b7cc8 87f11808 87f119fc ftdisk!FtDiskReadWrite+0x1a9 (FPO: [Non-Fpo])
f78aed74 f72cf4f5 885360cc 88536098 87f26810 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
f78aed90 f72d1517 87f11808 885360cc 8b60c470 volsnap!VspDecrementIrpRefCount+0x14f (FPO: [Non-Fpo])
f78aeda8 f72c0398 87f26810 87f26810 88536098 volsnap!VspWriteVolumePhase22+0x3d (FPO: [Non-Fpo])
f78aee30 f72d15d3 00f26810 885360cc 00000000 volsnap!VspAcquireNonPagedResource+0xc6 (FPO: [Non-Fpo])
f78aee50 8081e123 00000000 87f215e8 88536098 volsnap!VspWriteVolumePhase2+0x59 (FPO: [Non-Fpo])
f78aee80 f7241829 f78aeeb0 f72413ec 8b392ab8 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])
f78aee88 f72413ec 8b392ab8 87f215e8 00000001 CLASSPNP!ClassCompleteRequest+0x11 (FPO: [Non-Fpo])
f78aeeb0 8081e123 00000000 8b08c428 8b08c5c0 CLASSPNP!TransferPktComplete+0x1fd (FPO: [Non-Fpo])
f78aeee0 f7266545 8b6e30e8 8b08c428 f78aef24 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])
f78aeef0 f7265a8a 87f856a8 00000001 00000000 SCSIPORT!SpCompleteRequest+0x5e (FPO: [Non-Fpo])
f78aef24 f7265130 8b6e30e8 87f856a8 f78aef9b SCSIPORT!SpProcessCompletedRequest+0x6a7 (FPO: [Non-Fpo])
f78aef9c 8083211c 8b6e30a4 8b6e3030 00000000 SCSIPORT!ScsiPortCompletionDpc+0x2bd (FPO: [Non-Fpo])
f78aeff4 8088dba7 f3f4f92c 00000000 00000000 nt!KiRetireDpcList+0xca (FPO: [Non-Fpo])
f3f4f948 80a603d9 ffdffa02 f3f4f980 f3f4f980 nt!KiDispatchInterrupt+0x37 (FPO: [Uses EBP] [0,0,1])
f3f4f964 80a60577 8b6c226c f3f4f980 8088d91d hal!HalpCheckForSoftwareInterrupt+0x81 (FPO: [Non-Fpo])
f3f4f970 8088d91d 8b304c00 000001a3 f3f4fa04 hal!HalEndSystemInterrupt+0x67 (FPO: [Non-Fpo])
f3f4f970 809395e7 8b304c00 000001a3 f3f4fa04 nt!KiInterruptDispatch+0x5d (FPO: [0,2] TrapFrame @ f3f4f980)
f3f4fa04 80939c37 88139178 88028301 00000008 nt!ObpAllocateObject+0x199 (FPO: [Non-Fpo])
f3f4fa38 808f8d28 00000000 8b76aad0 f3f4fa70 nt!ObCreateObject+0x129 (FPO: [Non-Fpo])
f3f4fb44 80937a40 8b6afd10 00000000 88028398 nt!IopParseDevice+0x710 (FPO: [Non-Fpo])
f3f4fbc4 80933b74 00000000 f3f4fc04 00000042 nt!ObpLookupObjectName+0x5b0 (FPO: [Non-Fpo])
f3f4fc18 808eaee7 00000000 00000000 dffa7c01 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo])
f3f4fc94 808ec181 077ef198 c0140000 077ef14c nt!IopCreateFile+0x447 (FPO: [Non-Fpo])
f3f4fcf0 808eec10 077ef198 c0140000 077ef14c nt!IoCreateFile+0xa3 (FPO: [Non-Fpo])
f3f4fd30 808897ec 077ef198 c0140000 077ef14c nt!NtCreateFile+0x30 (FPO: [Non-Fpo])
f3f4fd30 7c82847c 077ef198 c0140000 077ef14c nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f3f4fd64)
077ef22c 00000000 00000000 00000000 00000000 0x7c82847c
The current CPU IRQL is 2, at which pagefaults cannot be serviced, hence windows crashed the machine.
0: kd> !irql
Debugger saved IRQL for processor 0x0 -- 2 (DISPATCH_LEVEL)
Un-assembling from the return address of this unknown driver, to try to confirm that this is really driver code and that it really does belong in this call stack:
0: kd> ub 8b5cfc89
8b5cfc62 8bd7 mov edx,edi
8b5cfc64 c60605 mov byte ptr [esi],5
8b5cfc67 895e18 mov dword ptr [esi+18h],ebx
8b5cfc6a c7460400020000 mov dword ptr [esi+4],200h
8b5cfc71 c7460809000000 mov dword ptr [esi+8],9
8b5cfc78 c7461c70fb5c8b mov dword ptr [esi+1Ch],8B5CFB70h
8b5cfc7f c64603e0 mov byte ptr [esi+3],0E0h
8b5cfc83 ff158c805d8b call dword ptr ds:[8B5D808Ch]
Sure does call IofCallDriver
0: kd> dps 8B5D808Ch l1
8b5d808c 8081df40 nt!IofCallDriver
So what is this driver? In Windows every image starts with a PE header which starts with letters “MZ”. So I started looking for PE header. I took the offset of the module on the stack and went back to its page boundary, then I started moving 1 page back at a time looking for the PE header.
0: kd> dc 8b5cf000
8b5cf000 0010b9f2 f02b0000 3b063c8b 83137538 ......+..<.;8u..
8b5cf010 c08304e9 04f98304 b85fee73 00000001 ........s._.....
8b5cf020 9cb8c35e b98b5d86 00000010 ff8bd02b ^....]......+...
8b5cf030 3b02348b 83137530 c08304e9 04f98304 .4.;0u..........
8b5cf040 b85fee73 00000001 335fc35e ccc35ec0 s._.....^._3.^..
8b5cf050 83ec8b55 78a04cec 538b5da4 c0b60f56 U....L.x.].SV...
8b5cf060 827ae857 ff330000 33f46589 f845c7db W.z...3..e.3..E.
8b5cf070 00000400 8b084d8b 52510c55 50b4458d .....M..U.QR.E.P
0: kd> dc 8b5cf000-1000
8b5ce000 01c73024 00000000 c70cc483 00800002 $0..............
8b5ce010 845e5f00 b10874db 0815ff01 8b8b5d80 ._^..t.......]..
8b5ce020 6a0c2444 15ff5000 8b5d8028 14c25b5d D$.j.P..(.].][..
8b5ce030 cccccc00 cccccccc cccccccc cccccccc ................
8b5ce040 6a306a56 2415ff00 8b8b5d80 74f685f0 Vj0j...$.].....t
8b5ce050 6a006a53 68026a01 8b5d0892 5d08ae68 Sj.j.j.h..].h..]
8b5ce060 ff006a8b 5d80ac15 ff56508b 5d80a815 .j.....].PV....]
8b5ce070 000d8b8b ff8b5da2 5d80b015 6a006a8b .....].....].j.j
0: kd> dc 8b5cf000-1000*2
8b5cd000 0689c033 89044689 46890846 1046890c 3....F..F..F..F.
8b5cd010 89144689 ec831846 1c46891c 57204689 .F..F.....F..F W
8b5cd020 85244689 8d0574f6 02eb284e c033c933 .F$..t..N(..3.3.
8b5cd030 41890189 08418904 8b34468b 4e891048 ...A..A..F4.H..N
8b5cd040 14508b10 8b145689 4e891848 1c508b18 ..P..V..H..N..P.
8b5cd050 8b1c5689 4e891848 1c508b20 2424448d .V..H..N .P..D$$
8b5cd060 244c8d50 186a510c 8b245689 6a302454 P.L$.Qj..V$.T$0j
8b5cd070 46c65205 46c70005 00000008 0c46c700 .R.F...F......F.
Finally I found the PE header for this image.
0: kd> dc 8b5cf000-1000*3
8b5cc000 00905a4d 00000003 00000004 0000ffff MZ..............
8b5cc010 000000b8 00000000 00000040 00000000 ........@.......
8b5cc020 00000000 00000000 00000000 00000000 ................
8b5cc030 00000000 00000000 00000000 000000d0 ................
8b5cc040 0eba1f0e cd09b400 4c01b821 685421cd ........!..L.!Th
8b5cc050 70207369 72676f72 63206d61 6f6e6e61 is program canno
8b5cc060 65622074 6e757220 206e6920 20534f44 t be run in DOS
8b5cc070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......
Using the built-in debugger extension !dh I dumped the header of this image to find the name. Unfortunately there is no name for this image. This address 8b5cc000 is not in the loaded module list, which raised further suspicion. Either it is hiding its load address or this driver was not loaded by standard loading mechanism. This module date shows it’s pretty recent build.
0: kd> !dh 8b5cc000
File Type: DLL
FILE HEADER VALUES
14C machine (i386)
4 number of sections
4EA3461E time date stamp Sun Oct 23 04:09:26 2011
0 file pointer to symbol table
0 number of symbols
E0 size of optional header
32 bit word machine
OPTIONAL HEADER VALUES
10B magic #
9.00 linker version
A400 size of code
2000 size of initialized data
0 size of uninitialized data
46C0 address of entry point
1000 base of code
----- new -----
10000000 image base
1000 section alignment
200 file alignment
1 subsystem (Native)
5.00 operating system version
0.00 image version
5.00 subsystem version
10000 size of image
400 size of headers
00100000 size of stack reserve
00001000 size of stack commit
00100000 size of heap reserve
00001000 size of heap commit
0 DLL characteristics
0 [ 0] address [size] of Export Directory
C91C [ 3C] address [size] of Import Directory
0 [ 0] address [size] of Resource Directory
0 [ 0] address [size] of Exception Directory
0 [ 0] address [size] of Security Directory
F000 [ 3FC] address [size] of Base Relocation Directory
0 [ 0] address [size] of Debug Directory
0 [ 0] address [size] of Description Directory
0 [ 0] address [size] of Special Directory
0 [ 0] address [size] of Thread Storage Directory
0 [ 0] address [size] of Load Configuration Directory
0 [ 0] address [size] of Bound Import Directory
C000 [ 1A8] address [size] of Import Address Table Directory
0 [ 0] address [size] of Delay Import Directory
0 [ 0] address [size] of COR20 Header Directory
0 [ 0] address [size] of Reserved Directory
SECTION HEADER #1
A354 virtual size
1000 virtual address
A400 size of raw data
400 file pointer to raw data
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
(no align specified)
SECTION HEADER #2
13AC virtual size
C000 virtual address
1400 size of raw data
A800 file pointer to raw data
SECTION HEADER #3
4B0 virtual size
E000 virtual address
200 size of raw data
BC00 file pointer to raw data
SECTION HEADER #4
576 virtual size
F000 virtual address
600 size of raw data
BE00 file pointer to raw data
0: kd> !lmi 8b5cc000
Loaded Module Info: [8b5cc000]
ffffffff8b5cc000 is not a valid address
0: kd> lmvm 8b5cc000
start end module name
Checking the import table it does have calls to raise and lower the IRQL. We get to the import table by taking the base address plus the offset to the Import Address Table Directory. The below output is just a snippet of the whole table.
0: kd> dps 8b5cc000+c000
8b5d8000 80a603f4 hal!KfLowerIrql
8b5d8004 80a5ff00 hal!KeGetCurrentIrql
8b5d8008 80a600b4 hal!KfRaiseIrql
Dumping the entire image contents in memory to find more clues about this driver. Again, I am only showing snippets of the whole output.
0: kd> dc 8b5cc000 L?10000/4
8b5cc070 65646f6d 0a0d0d2e 00000024 00000000 mode....$.......
8b5d5e20 76000000 66697265 77252079 00000a5a ...verify %wZ...
8b5d5e30 31000000 35343332 39383736 33323130 ...1234567890123
8b5d5e40 37363534 31303938 35343332 39383736 4567890123456789
8b5d5e50 33323130 37363534 31303938 35343332 0123456789012345
8b5d5e60 39383736 33323130 37363534 31303938 6789012345678901
8b5d5e70 00343332 66000000 646e756f 67697320 234....found sig
8b5d5e80 7574616e 3d206572 0a752520 b4000000 nature = %u.....
Interestingly this image has other images (modules) embedded in it. We can see other PE headers, which again have no name. This looks highly suspicious and resembles behavior used by malicious software.
8b5d0970 f775c085 4848c78b 4dc35f5e 0300905a ..u...HH^_.MZ...
8b5d0980 04000000 ff000000 b80000ff 00000000 ................
8b5d0990 40000000 00000000 00000000 00000000 ...@............
8b5d09a0 00000000 00000000 00000000 00000000 ................
8b5d09b0 00000000 c8000000 0e000000 000eba1f ................
8b5d09c0 21cd09b4 cd4c01b8 69685421 72702073 ...!..L.!This pr
8b5d09d0 6172676f 6163206d 746f6e6e 20656220 ogram cannot be
8b5d09e0 206e7572 44206e69 6d20534f 2e65646f run in DOS mode. 8b5d2200 4d00004e 0300905a 04000000 ff000000 N..MZ...........
8b5d2210 b80000ff 00000000 40000000 00000000 ...........@....
8b5d2220 00000000 00000000 00000000 00000000 ................
8b5d2230 00000000 00000000 00000000 f0000000 ................
8b5d2240 0e000000 000eba1f 21cd09b4 cd4c01b8 ...........!..L.
8b5d2250 69685421 72702073 6172676f 6163206d !This program ca
8b5d2260 746f6e6e 20656220 206e7572 44206e69 nnot be run in D
8b5d2270 6d20534f 2e65646f 240a0d0d 00000000 OS mode....$....
After ensuring my AV definitions were up to date, I decided to dump this memory contents into a file onto my machine. The moment contents were written my AV Microsoft Forefront Endpoint Protection caught a Trojan in this file.
0: kd> .writemem c:\temp\drv.sys 8b5cc000 L?10000 Writing 10000 bytes................................
Every Wednesday (usually) we post a debug tip to our twitter page at https://twitter.com/#!/ntdebugging. This blog is an archive of these tips to allow our readers to find this information easily. Periodically we post an updated blog with the current archive. Follow us on twitter if you want to see the new tips as we post them.
The goal of these tips is to share debug commands, and forms of commands (parameters, flags, etc) that we in Platforms Global Escalation Services find useful. I hope you can add these commands to your toolkit and they will help you debug more efficiently.
!thread/!process [address] e - on x64 will not show you the meaningless Args to Child information.
.frame /c [FrameNumber] - sets context to specificied stack frame. On x64 provides more reliable register information than .trap.
kn - Dumps call stack with frame numbers, easier than counting stacks for .frame.
.frame /r [FrameNumber] - same as .frame /c, but shows registers without changing context.
Note: With .frame /c or /r you can only trust the nonvolatile registers. See http://bit.ly/dik4OR for vol/nonvol regs.
k=rbp rip FrameCount - Dumps call stack starting at rbp/rip on x64. Useful when the stack is corrupt.
.process/.thread /p /r [address] - sets new process context, sets .cache forcedecodeuser, and reloads user symbols.
!process [address] 17 - Sets the context for this command, avoids the need for .process to see user stacks. Try !process 0 17
~~[ThreadID]s - Changes threads in user mode. Use Thread ID number from output such as !locks. Ex: ~~[1bd4]s
runas /netonly /u:<account> windbg.exe - Launch windbg with domain account. Use when dbg computer isn't in domain and symbol server is.
!heap -p -a <address> - Shows information about the heap block containing <address>, even if you aren't using pageheap.
ub - Unassembles starting at a location prior to your address. Accepts l<number> to specify how many instructions to go back. ub . l20
!stacks 2 [FilterString] - Finds kernel mode call stacks that contain the FilterString in a symbol.
!thread [address] 17 (or 1e on x64) - Sets context for this command, avoids the need for .thread/.process for user stacks.
.hh [Text] - Opens the debugger help. [Text] is the topic to lookup in the index. Example: .hh !pte
?? can dump structs using C++ style expressions. Ex: ??((nt!_KTHREAD*)(0xfffffa800ea43bb0))->ApcState
bp /t EThread - Sets a kernel mode breakpoint that only triggers when hit in the context of this thread.
bp /p EProcess - Sets a kernel mode breakpoint that only triggers when hit in the context of this process.
gc - If you run 'p' and hit a breakpoint, gc takes you where p would have gone if you had not hit the bp.
gu - Go until the current function returns. Effectively this unwinds one stack frame. #windbg
pc - Steps through until the next 'call' instruction. Combine with other commands to find who returned your error> pc;p;r eax
pt - Steps through until the next 'ret' instruction. Similar to gu, but pt stops on the ret and gu stops after the ret.
.ignore_missing_pages 1 - supresses the error: "Page 2a49 not present in the dump file. Type ".hh dbgerr004" for details"
.exr -1 shows the most recent exception. Useful in user dumps of crashes, especially for no execute crashes (NX/DEP).
wt - Trace calls until they return to the current address. More useful with -or to get return values. Use -l for depth.
.thread /w - Changes to the WOW64 32-bit context from 64-bit kernel mode. Wow64exts doesn't work in kernel mode.
??sizeof(structure) - Gets the size of a structure, it's easier than counting.
sxe ld:module.dll - Enables an exception which will break into the debugger when module.dll is loaded.
vertarget - Shows OS version of the debug target. Also shows machine name, uptime, and session time (when the dump was taken).
!vm 1 - In a kernel debugger, shows basic information about memory usage. Available, committed, pagefile, pool, sysptes, etc.
.time - Shows session time (when dump was taken) and system uptime. In user mode shows process uptime, kernel/user time.
ba w size [address] - Break on write access only. Replace size with the num bytes you want to watch. Ex: ba w 4 005d5f10
.process -i <address> - Make the process active and break into the debugger. Use in live kernel debugs to get into process context.
.reload /f /o - Overwrites cached files in your downstream symbol store. Useful when your store has corrupt pdbs.
->* - Use with dt to dump pointers. Example: dt _EPROCESS [Address] ObjectTable->*
!for_each_module s -a @#Base @#End "PTag" - Find the drivers using pool tag "PTag".
.unload [DllName] - Unloads the debug extension you didn't intend to load. Omit DllName to unload the last dll loaded.
!exqueue dumps the executive worker queues. Use flags 7f to dump the worker threads and the queues.
lmvm <module> - Dumps information about the module. Remember to use <module> and not <module.dll>.
!thread -t TID - Dump a thread using thread ID rather than thread address. Useful when working with a critical section.
!list - Walks a linked list and displays informatino for each element in a list. See blog later today for an example.
.time -h # - Shows the debug session time using the timezone offset of #. Ex: .time -h 0 shows when a dump was taken in UTC.
!session - Lists all of the user session IDs. A quick way to list the active sessions from a dump of a terminal server.
!session -s SessionID - Changes the current session context to SessionID. Useful when looking at GDI, or other per session data.
| ProcNum s - Switches to process number ProcNum. Use when debugging multiple dumps, or processes, in one windbg.
!! - Launches a shell process and redirects its output to the debugger. The same as .shell, but "bang bang" sounds cooler.
uf Function - Dumps the assembly for Function (name or address). Useful for optimized code that is not contiguous in memory.
uf /c Function - Shows all of the calls made by Function (can be function name or address).
!wow64exts.sw - switches between x64 and x86 contexts. Often used to reverse .thread /w context switch.
Hello debuggers, the debug ninja is back again. Sometimes we have a scenario where a process is using a lot of memory, and the only data we are able to get at the moment is a user dump. Ordinarily data from tools such as umdh or xperf would be preferable because they provide memory usage data over a period of time and can include call stack information. However, umdh requires restarting the process (which loses the state of high memory usage), and xperf requires the installation of the Windows Performance Toolkit which may not always be an immediate option.
When we have such a dump we may not be able to specifically identify what piece of code is generating the high memory usage, but we may be able to narrow the scope of troubleshooting to a specific dll.
The first thing we need to do is identify what type of memory is using most of the address space. The debugger command !address –summary allows us to do this:
0:000> !address -summary
--- Usage Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal
Free 489 7fe`6ff5a000 ( 7.994 Tb) 99.92%
Heap 9094 1`75ed1000 ( 5.843 Gb) 93.47% 0.07%
<unknown> 275 0`12e41000 ( 302.254 Mb) 4.72% 0.00%
Image 937 0`05a6a000 ( 90.414 Mb) 1.41% 0.00%
Stack 138 0`01700000 ( 23.000 Mb) 0.36% 0.00%
Other 14 0`001bd000 ( 1.738 Mb) 0.03% 0.00%
TEB 46 0`0005c000 ( 368.000 kb) 0.01% 0.00%
PEB 1 0`00001000 ( 4.000 kb) 0.00% 0.00%
From this example we can see that most of the memory is used by heap. A process will usually have multiple heaps, each created by a call to HeapCreate. We can examine the size of each of these heaps with !heap –s:
0:000> !heap -s
LFH Key : 0x0000006c1104d280
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
0000000000100000 00000002 16384 12824 16384 1180 254 5 0 3 LFH
0000000000010000 00008000 64 4 64 1 1 1 0 0
00000000003d0000 00001002 1088 708 1088 121 20 2 0 0 LFH
0000000003080000 00001002 1536 700 1536 4 4 2 0 0 LFH
00000000033a0000 00001002 5229696 1377584 5229696 414244 4039 3059 0 2c LFH
External fragmentation 30 % (4039 free blocks)
Virtual address fragmentation 73 % (3059 uncommited ranges)
0000000003380000 00001002 64 8 64 3 1 1 0 0
0000000003600000 00001002 512 56 512 3 1 1 0 0
0000000003c20000 00001002 512 8 512 3 1 1 0 0
0000000003220000 00001002 512 8 512 3 1 1 0 0
0000000003e50000 00001002 512 8 512 3 1 1 0 0
0000000003d00000 00001002 512 148 512 5 3 1 0 0 LFH
From the above output we can see that most of the memory is being used by heap 00000000033a0000.
At this point we need to try to identify what this heap is used for. A brute force method to do this is to search memory with the ‘s’ command.
0:000> s -q 0 l?7fffffffffffffff 00000000033a0000
000007fe`f21810a0 00000000`033a0000 00000000`00000001
The output of the ‘s’ command may be verbose. You will need to manually examine the addresses where ‘s’ finds hits. Most of these addresses will probably be in heap memory, we are looking for an address that matches a module. I snipped the above output to just show the relevant hit, an address inside of a loaded module.
The search of memory revealed that the heap 00000000033a0000 is used by the module useheap.dll, specifically it is part of the global ‘Blob’ class.
0:000> ln 000007fe`f21810a0
At this point we don’t know specifically what code in useheap.dll has allocated a lot of heap, however we have significantly narrowed the scope of the problem. We can now determine if there is a known issue with heap usage in useheap.dll that is addressed in a later version. We may also know from experience that this module uses a lot of memory under specific circumstances, such as a high volume of work sent to this service.
I hope this example helps the next time you have high memory usage and only have a user dump to troubleshoot with. Good luck!
Last time, we discussed how applications place data on the clipboard, and how to access that data using the debugger. Today, we'll take a look at how an application can monitor the clipboard for changes. Understanding this is important because it is a place where Windows allows 3rd-party code to "hook" into the system. If you experience unexpected behavior with copying and pasting, a program using these hooks may be misbehaving. We’ll start by covering the hooking mechanisms for clipboard, and then review how to identify what applications, if any, are using these hooks in the debugger.
There are three ways to monitor the clipboard for changes - clipboard viewers, clipboard format listeners, and querying the clipboard sequence number. We will focus on the first two as these allow an application to register for notifications whenever the clipboard is updated. The third method simply allows an application to check and see if a change has occurred, and should not be used in a polling loop.
The Clipboard Viewer functionality has been around since Windows 2000, if not earlier. The way it works is pretty simple - an application interested in receiving clipboard change notifications calls SetClipboardViewer and passes a handle to its window. Windows then stores that handle in a per-session win32k global, and anytime the clipboard is changed Windows sends a WM_DRAWCLIPBOARD message to the registered window.
Of course, multiple applications are allowed to register their windows as clipboard viewers - so how does Windows handle that? Well, if an application calls SetClipboardViewer and another clipboard viewer was already registered, Windows returns the handle value of the previous viewer's window to the new viewer. It is then the responsibility of the new viewer to call SendMessage every time it receives a WM_DRAWCLIPBOARD to notify the next viewer in the chain. Each clipboard viewer also needs to handle the WM_CHANGECBCHAIN message, which notifies all viewers when one of the viewers in the chain is removed, and specifies what the next viewer in the chain is. This allows the chain to be maintained.
An obvious problem with this design is it relies on each clipboard viewer application to behave correctly, not to terminate unexpectedly, and to generally be a good citizen. If any viewer decided not to be friendly, it could simply skip notifying the next viewer in line about an update, rendering the next viewer and all subsequent viewers impotent.
To address these problems, the Clipboard Format Listener mechanism was added in Windows Vista. This works in much the same way as the clipboard viewer functionality except in this case Windows maintains the list of listeners, instead of depending on each application to preserve a chain.
If an application wishes to become a clipboard format listener,it calls the AddClipboardFormatListener function and passes in a handle to its window. After that, its window message handler will receive WM_CLIPBOARDUPDATE messages. When the application is ready to exit or no longer wishes to receive notifications, it can call RemoveClipboardFormatListener.
Now that we've covered the ways to register a viewer/listener, let's take a look at how to identify them using the debugger. First, you'll need to identify a process in the session you are interested in checking for clipboard monitors. It can be any win32 process in that session -we just need to use it to locate a pointer to the Window Station. In this case, I'll use the Notepad window I used in part 1:
kd> !process 0 0 notepad.exe
SessionId: 1 Cid: 0374 Peb: 7fffffd8000 ParentCid: 0814
DirBase: 1867e000 ObjectTable: fffff9803d28ef90 HandleCount: 52.
If you are doing this in a live kernel debug, you'll need to change context into the process interactively (using .process /I< address> then hit g and wait for the debugger to break back in). Now DT the process address as an _EPROCESS and look for the Win32Process field:
kd> dt _EPROCESS fffff980366ecb30 Win32Process
+0x258 Win32Process : 0xfffff900`c18c0ce0 Void
Now DT the Win32Process address as a win32k!tagPROCESSINFO and identify the rpwinsta value:
kd> dt win32k!tagPROCESSINFO 0xfffff900`c18c0ce0 rpwinsta
+0x258 rpwinsta : 0xfffff980`0be2af60 tagWINDOWSTATION
This is our Window Station. Dump it using dt:
kd> dt 0xfffff980`0be2af60 tagWINDOWSTATION
+0x000 dwSessionId : 1
+0x008 rpwinstaNext : (null)
+0x010 rpdeskList : 0xfffff980`0c5e2f20 tagDESKTOP
+0x018 pTerm : 0xfffff960`002f5560 tagTERMINAL
+0x020 dwWSF_Flags : 0
+0x028 spklList : 0xfffff900`c192cf80 tagKL
+0x030 ptiClipLock : (null)
+0x038 ptiDrawingClipboard: (null)
+0x040 spwndClipOpen : (null)
+0x048 spwndClipViewer : 0xfffff900`c1a4ca70 tagWND
+0x050 spwndClipOwner : 0xfffff900`c1a3ef70 tagWND
+0x058 pClipBase : 0xfffff900`c5512fa0 tagCLIP
+0x060 cNumClipFormats : 4
+0x064 iClipSerialNumber : 0x16
+0x068 iClipSequenceNumber : 0xc1
+0x070 spwndClipboardListener : 0xfffff900`c1a53440 tagWND
+0x078 pGlobalAtomTable: 0xfffff980`0bd56c70 Void
+0x080 luidEndSession : _LUID
+0x088 luidUser : _LUID
+0x090 psidUser : 0xfffff900`c402afe0 Void
Note the spwndClipViewer, spwndClipboardListener, and spwndClipOwnerfields. spwndClipViewer is the most-recently-registered window in the clipboard viewer chain. Similarly, spwndClipboardListener is the most recent listener in our Clipboard Format Listener list. spwndClipOwner is the window that set the content in the clipboard.
Given the window, it is just a few steps to determine the process. This would work forspwndClipViewer, spwndClipboardListener, and spwndClipOwner. First, dt the value as a tagWND. We'll use the spwndClipViewer for this demonstration:
kd> dt 0xfffff900`c1a4ca70 tagWND
+0x000 head : _THRDESKHEAD
+0x028 state : 0x40020008
+0x028 bHasMeun : 0y0
+0x028 bHasVerticalScrollbar : 0y0
We only care about the head - so since it is at offset 0, dt the same address as a _THRDESKHEAD:
kd> dt 0xfffff900`c1a4ca70 _THRDESKHEAD
+0x000 h : 0x00000000`000102ae Void
+0x008 cLockObj : 6
+0x010 pti : 0xfffff900`c4f26c20tagTHREADINFO
+0x018 rpdesk : 0xfffff980`0c5e2f20 tagDESKTOP
+0x020 pSelf : 0xfffff900`c1a4ca70 "???"
Now, dt the address in pti as a tagTHREADINFO:
kd> dt 0xfffff900`c4f26c20 tagTHREADINFO
+0x000 pEThread : 0xfffff980`0ef6cb10 _ETHREAD
+0x008 RefCount : 1
+0x010 ptlW32 : (null)
+0x018 pgdiDcattr : 0x00000000`000f0d00 Void
Here, we only care about the value of pEThread, which we can pass to !thread:
kd> !thread 0xfffff980`0ef6cb10 e
THREAD fffff9800ef6cb10 Cid 087c.07ec Teb: 000007fffffde000 Win32Thread: fffff900c4f26c20 WAIT: (WrUserRequest) UserModeNon-Alertable
Owning Process fffff98032e18b30 Image: viewer02.exe
Attached Process N/A Image: N/A
Wait Start TickCount 5435847 Ticks: 33 (0:00:00:00.515)
Context Switch Count 809 IdealProcessor: 0 LargeStack
Win32 Start Address 0x000000013f203044
Stack Init fffff880050acdb0 Current fffff880050ac6f0
Base fffff880050ad000 Limit fffff880050a3000 Call 0
Priority 11 BasePriority 8 UnusualBoost 0 ForegroundBoost 2IoPriority 2 PagePriority 5
Child-SP RetAddr Call Site
fffff880`050ac730 fffff800`01488f32 nt!KiSwapContext+0x7a
fffff880`050ac870 fffff800`0148b74f nt!KiCommitThreadWait+0x1d2
fffff880`050ac900 fffff960`000dc8e7 nt!KeWaitForSingleObject+0x19f
fffff880`050ac9a0 fffff960`000dc989 win32k!xxxRealSleepThread+0x257
fffff880`050aca40 fffff960`000dafc0 win32k!xxxSleepThread+0x59
fffff880`050aca70 fffff960`000db0c5 win32k!xxxRealInternalGetMessage+0x7dc
fffff880`050acb50 fffff960`000dcab5 win32k!xxxInternalGetMessage+0x35
fffff880`050acb90 fffff800`01482ed3 win32k!NtUserGetMessage+0x75
fffff880`050acc20 00000000`77929e6a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`050acc20)
00000000`002ffb18 00000000`00000000 0x77929e6a
As you can see, we have a clipboard viewer registered from process viewer02.exe. Because of viewer's process-maintained chain architecture, it isn't easy to see the next process in the chain. However, we can do this for clipboard listeners. Let's look back at our window station:
+0x038 ptiDrawingClipboard : (null)
+0x048 spwndClipViewer : 0xfffff900`c1a4ca70tagWND
+0x050 spwndClipOwner : 0xfffff900`c1a3ef70tagWND
+0x070 spwndClipboardListener: 0xfffff900`c1a53440 tagWND
If we dt the spwndClipboardListener, there is a field that shows the next listener named spwndClipboardListenerNext:
kd> dt 0xfffff900`c1a53440 tagWND spwndClipboardListenerNext
+0x118 spwndClipboardListenerNext : 0xfffff900`c1a50080 tagWND
When you reach the last clipboard format listener's tagWND, its spwndClipboardListenerNext value will be null:
kd> dt 0xfffff900`c1a50080 tagWND spwndClipboardListenerNext
+0x118 spwndClipboardListenerNext : (null)
Using this window address, we can go through the same steps as above to identify this listener's process name. As mentioned earlier, since tagWND is a kernel structure, the OS is maintaining these spwndClipboardListener/spwndClipboardListenerNext pointers, so they aren't susceptible to the chain problems of clipboard viewers.
That wraps up our clipboard coverage. I hope you found it informative. Want to learn more about monitoring the clipboard? This MSDN article is a good resource.
Recently I had the opportunity to debug the clipboard in Windows, and I thought I’d share some of the things I learned. The clipboard is one of those parts of Windows that many of us use dozens (hundreds?) of times a day and don’t really think about. Before working on this case, I had never even considered how it works under the hood. It turns out that there’s more to it than you might think. In this first article, I’ll describe how applications store different types of data to the clipboard and how it is retrieved. My next post will describe how applications can hook the clipboard to monitor it for changes. In both, I’ll include debug notes to show you how to access the data from the debugger.
Let’s start by discussing clipboard formats. A clipboard format is used to describe what type of data is placed on the clipboard. There are a number of predefined standard formats that an application can use, such as bitmap, ANSI text, Unicode text, and TIFF. Windows also allows an application to specify its own formats. For example, a word processor may want to register a format that includes text, formatting, and images. Of course, this leads to one problem, what happens if you want to copy from your word processor and paste into Notepad, which doesn’t understand all of the formatting and pictures?
The answer is to allow multiple formats to be stored in the clipboard at one time. When I used to think of the clipboard I thought of it as a single object (“my text” or “my image”), but in reality the clipboard usually has my data in several different forms. The destination program gets a format it can use when I paste.
So how does this data end up on the clipboard? Simple, an application first takes ownership of the clipboard via the OpenClipboard function. Once it has done that, it can empty the clipboard with EmptyClipboard. Finally, it is ready to place data on the clipboard using SetClipboardData. SetClipboardData takes two parameters; the first is the identifier of one of the clipboard formats we discussed above. The second is a handle to the memory containing the data in that format. The application can continue to call SetClipboardData for each of the various formats it wishes to provide going from best to worst (since the destination application will select the first format in the list it recognizes). To make things easier for the developer, Windows will automatically provide converted formats for some clipboard format types. Once the program is done, it calls CloseClipboard.
When a user hits paste, the destination application will call OpenClipboard and one of these functions to determine what data format(s) are available: IsClipboardFormatAvailable, GetPriorityClipboardFormat, or EnumClipboardFormats. Assuming the application finds a format it can use, it will then call GetClipboardData with the desired format identifier as a parameter to get a handle to the data. Finally, it will call CloseClipboard.
Now let’s take a look at how you can find what data being written to the clipboard using the debugger. (Note that all of my notes are taken from a Win7/2008 R2 system – so things might vary slightly in different versions of the OS.) Since the clipboard is part of Win32k.sys, you’ll need to use a kernel debugger. I like to use win32k!InternalSetClipboardData+0xe4 as a breakpoint. The nice thing about this offset is that it is right after we’ve populated the RDI register with data from SetClipboardData in a structure known as tagCLIP.
kd> u win32k!InternalSetClipboardData+0xe4-c L5
fffff960`0011e278 894360 mov dword ptr [rbx+60h],eax
fffff960`0011e27b 8937 mov dword ptr [rdi],esi
fffff960`0011e27d 4c896708 mov qword ptr [rdi+8],r12
fffff960`0011e281 896f10 mov dword ptr [rdi+10h],ebp
fffff960`0011e284 ff15667e1900 call qword ptr[win32k!_imp_PsGetCurrentProcessWin32Process (fffff960`002b60f0)]
kd> dt win32k!tagCLIP
+0x000 fmt : Uint4B
+0x008 hData : Ptr64 Void
+0x010fGlobalHandle : Int4B
Here’s what a call to SetClipboardData from Notepad looks like:
Child-SP RetAddr Call Site
fffff880`0513a940 fffff960`0011e14f win32k!InternalSetClipboardData+0xe4
fffff880`0513ab90 fffff960`000e9312 win32k!SetClipboardData+0x57
fffff880`0513abd0 fffff800`01482ed3 win32k!NtUserSetClipboardData+0x9e
00000000`001dfad8 00000000`7792e494 USER32!ZwUserSetClipboardData+0xa
00000000`001dfae0 000007fe`fc5b892b USER32!SetClipboardData+0xdf
00000000`001dfb20 000007fe`fc5ba625 COMCTL32!Edit_Copy+0xdf
00000000`001dfb60 00000000`77929bd1 COMCTL32!Edit_WndProc+0xec9
00000000`001dfc00 00000000`779298da USER32!UserCallWinProcCheckWow+0x1ad
00000000`001dfcc0 00000000`ff5110bc USER32!DispatchMessageWorker+0x3b5
00000000`001dfd40 00000000`ff51133c notepad!WinMain+0x16f
00000000`001dfdc0 00000000`77a2652d notepad!DisplayNonGenuineDlgWorker+0x2da
00000000`001dfe80 00000000`77b5c521 kernel32!BaseThreadInitThunk+0xd
So here, we can dt RDI as a tagCLIP to see what was written:
kd> dt win32k!tagCLIP @rdi
+0x000 fmt : 0xd
+0x008 hData : 0x00000000`00270235 Void
+0x010fGlobalHandle : 0n1
Fmt is the clipboard format. 0xd is 13, which indicates this data is Unicode text. We can’t just ‘du’ the value in hData, however, because this is a handle, not a direct pointer to the data. So now we need to look up the handle. To do that, we need to look at a win32k global structure – gSharedInfo:
Evaluate expression: -7284261440224 = fffff960`002f3520
kd> dt win32k!tagSHAREDINFO fffff960`002f3520
+0x000 psi : 0xfffff900`c0980a70 tagSERVERINFO
+0x008 aheList : 0xfffff900`c0800000 _HANDLEENTRY
+0x010 HeEntrySize : 0x18
+0x018 pDispInfo : 0xfffff900`c0981e50 tagDISPLAYINFO
+0x020ulSharedDelta : 0
+0x028 awmControl :  _WNDMSG
+0x218DefWindowMsgs : _WNDMSG
+0x228DefWindowSpecMsgs : _WNDMSG
aheList in gSharedInfo contains an array of handle entries, and the last 2 bytes of hData multiplied by the size of a handle entry the address of our handle entry:
kd> ?0x00000000`00270235 & FFFF
Evaluate expression: 565 = 00000000`00000235
unsigned int64 0x18
kd> ? 0xfffff900`c0800000 + (0x235*0x18)
Evaluate expression: -7693351766792 = fffff900`c08034f8
kd> dt win32k!_HANDLEENTRY fffff900`c08034f8
+0x000 phead : 0xfffff900`c0de0fb0 _HEAD
+0x008 pOwner : (null)
+0x010 bType : 0x6 ''
+0x011 bFlags : 0 ''
+0x012 wUniq : 0x27
If we look in phead at offset 14, we’ll get our data (this offset may vary on different platforms):
kd> du fffff900`c0de0fb0 + 0x14
fffff900`c0de0fc4 "Hi NTDebugging readers!"
Let’s consider one other scenario. I copied some text out of Wordpad, and a number of SetClipboardData calls were made to accommodate different formats. The Unicode format entry looks like this:
Breakpoint 0 hit
fffff960`0011e284 ff15667e1900 call qword ptr[win32k!_imp_PsGetCurrentProcessWin32Process (fffff960`002b60f0)]
+0x008 hData : (null)
+0x010fGlobalHandle : 0n0
hData is null! Why is that? It turns out that the clipboard allows an application to pass in null to SetClipboardData for a given format. This indicates that the application can provide the data in that format, but is deferring doing so until it is actually needed. Sure enough, if I paste the text into Notepad, which needs the text in Unicode, Windows sends a WM_RENDERFORMAT message to the WordPad window, and WordPad provides the data in the new format. Of course, if the application exits before populating all of its formats, Windows needs all of the formats rendered. In this case, Windows will send the WM_RENDERALLFORMATS message so other applications can use the clipboard data after the source application has exited.
That’s all for now. Next time we’ll look at how applications can monitor the clipboard for changes using two hooks. If you want to know more about using the clipboard in your code, this is a great reference.
Part 2 of this article can be found here:
Matt Burrough here again. On rare occasions when debugging, we'll actually know (or strongly suspect) what the root cause of a problem is at the beginning of our analysis - but we still need to investigate to confirm our assertion. The following is a case study for an issue I worked on recently where the print spooler was hanging.
This customer had recently upgraded their print servers from Windows 2003 to Windows 2008 R2. After the upgrade, the spooler would frequently go unresponsive, and no jobs could be printed. Rebooting the server provided some relief, but the problem would reoccur as the jobs started coming in again.
One of my peers had completed an initial analysis of a user mode memory dump of the spooler process, and found that spooler seemed to be blocked waiting on PrintIsolationHost.exe. For those not familiar with recent changes to the print spooler's design - Print Isolation Host was added in Windows 7/2008 R2 as a way to isolate print drivers from each other and from the spooler process so a crash in one driver doesn't take down the entire printing environment. Given the large number of print drivers found on some print servers, this can be a great help for stability and availability of the spooler. See http://blogs.technet.com/b/askperf/archive/2009/10/08/windows-7-windows-server-2008-r2-print-driver-isolation.aspx if you would like more details on Print Isolation Host.
Unfortunately for my team mate, the data collected did not include dumps of Print Isolation Host, so he requested that the next time the problem happened that both spooler and PrintIsolationHost dumps would be gathered. The customer had configured his server for the "Isolated" Print Isolation Host option during troubleshooting, which places each driver in its own process. (The default is shared which places all drivers in one PrintIsolationHost.exe instance. Driver isolation is configured using the Print Management Console.)
The next morning, the newly requested data came in, and since the problem was urgent, I began looking at the new dumps immediately. This dataset included both a spoolsv.exe dump, as well as nearly two dozen PrintIsolationHost dumps! I knew from the past analysis that I should start with the PrintIsolationHost data - but where to begin? In order to triage the dumps, I wrote a small batch file to open each dump (luckily, they were sequentially numbered), write the call stacks in each thread to a file, and close the log. On every iteration, the script created a new cmd.txt file, which contained a set of commands that were passed to the debugger. This allowed me to name the debugger output files sequentially with names that matched their dump (e.g. PrintIsolationHost1.txt contained output from PrintIsolationHost1.dmp).
echo .reload > c:\data\cmd.txt
echo .logopen c:\data\PrintIsolationHost%x%.txt >> c:\data\cmd.txt
echo ~*kc >> c:\data\cmd.txt
echo .logclose >> c:\data\cmd.txt
echo qq >> c:\data\cmd.txt
c:\debuggers\kd.exe -cf c:\data\cmd.txt -z "c:\data\PrintIsolationHost%x%.DMP"
set /A x+=1
IF %x% LEQ 23 GOTO Top
Now that I had a directory full of text files, I used my favorite differencing tool to compare the stacks in each text file. I used the first PrintIsolationHost file as a reference. It only had four stacks, and these were common to all of the other files:
. 0 Id: 40c.1c28 Suspend: 0 Teb: 000007ff`fffde000 Unfrozen
1 Id: 40c.23c0 Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen
2 Id: 40c.2598 Suspend: 0 Teb: 000007ff`fffd3000 Unfrozen
3 Id: 40c.820 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen
I was able to rule out a number of other PrintIsolationHost instances that were either identical to this one (aside from the process/thread IDs and Tebs), or which just had one or two additional idle worker threads (like thread 3 above).
Things got interesting when I looked at two of the PrintIsolationHost dumps. Both had these two stacks not found in any other the other dumps (note that I do not have symbols for the 3rd-party print processor ProseWarePrintProcessor):
2 Id: 20a4.1328 Suspend: 0 Teb: 000007ff`fffda000 Unfrozen
*** ERROR: Symbol file could not be found. Defaulted to export symbols for ProseWarePrintProcessor.dll -
6 Id: 20a4.1668 Suspend: 0 Teb: 000007ff`fffac000 Unfrozen
Interesting. Thread 6 is running the DllMain code for the ProseWarePrintProcessor DLL, which holds the loader lock. It is waiting on a critical section. Meanwhile, thread 2 is waiting on the loader lock. So who holds the critical section that thread 6 wants? First, let's find what lock thread 6 wants:
# Child-SP RetAddr Call Site
00 00000000`0104eb18 00000000`777fe4e8 ntdll!ZwWaitForSingleObject+0xa
01 00000000`0104eb20 00000000`777fe3db ntdll!RtlpWaitOnCriticalSection+0xe8
02 00000000`0104ebd0 00000000`750c5d6b ntdll!RtlEnterCriticalSection+0xd1
03 00000000`0104ec00 00000000`750c6256 ProseWarePrintProcessor+0xabf
04 00000000`0104ec30 00000000`750c7015 ProseWarePrintProcessor+0xfaa
05 00000000`0104f090 00000000`777dc76c ProseWarePrintProcessor+0x1d69
06 00000000`0104f1f0 00000000`777dc42f ntdll!LdrpInitializeThread+0x17c
07 00000000`0104f2f0 00000000`777dc32e ntdll!LdrpInitialize+0x9f
08 00000000`0104f360 00000000`00000000 ntdll!LdrInitializeThunk+0xe
0:006> .frame /c /r 2
rax=0000000300d1001a rbx=00000000750ca330 rcx=00000000001d0000
rdx=0000000000000040 rsi=0000000000000001 rdi=0000000000000004
rip=00000000777fe3db rsp=000000000104ebd0 rbp=00000000ff000000
r8=00000000002c6a00 r9=00000000002c6a10 r10=0000000000000073
r11=0000000000000001 r12=000007fffffd6000 r13=00000000778e2660
iopl=0 nv up ei pl zr na po nc
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
00000000`777fe3db 83f801 cmp eax,1
0:006> ub 00000000`750c5d6b <<<< Let's see what gets passed to RtlEnterCriticalSection
00000000`750c5d4e f9 stc
00000000`750c5d4f ff743f33 push qword ptr [rdi+rdi+33h]
00000000`750c5d53 d2ff sar bh,cl
00000000`750c5d55 15aeb4ffff adc eax,0FFFFB4AEh
00000000`750c5d5a 85c0 test eax,eax
00000000`750c5d5c 7533 jne ProseWarePrintProcessor+0xae5 (00000000`750c5d91)
00000000`750c5d5e 488d0dcb450000 lea rcx,[ProseWarePrintProcessor+0x5084 (00000000`750ca330)] <<<< Critical section is in rcx
00000000`750c5d65 ff15bdb3ffff call qword ptr [ProseWarePrintProcessor+0x1128 (00000000`750c1128)]
0:006> u ntdll!RtlEnterCriticalSection
00000000`77802fc0 fff3 push rbx
00000000`77802fc2 4883ec20 sub rsp,20h
00000000`77802fc6 f00fba710800 lock btr dword ptr [rcx+8],0
00000000`77802fcc 488bd9 mov rbx,rcx <<<< Critical section is saved in RBX. RBX isn't modified between here and our current position
00000000`77802fcf 0f83e9b1ffff jae ntdll!RtlEnterCriticalSection+0x31 (00000000`777fe1be)
00000000`77802fd5 65488b042530000000 mov rax,qword ptr gs:[30h]
00000000`77802fde 488b4848 mov rcx,qword ptr [rax+48h]
00000000`77802fe2 c7430c01000000 mov dword ptr [rbx+0Ch],1
0:006> r rbx
Last set context:
rbx=00000000750ca330 <<<< This is the address of the critical section thread 6 is waiting for
Now let's look at the threads in this process, and the held locks:
# 0 Id: 20a4.180c Suspend: 0 Teb: 000007ff`fffde000 Unfrozen
1 Id: 20a4.2294 Suspend: 0 Teb: 000007ff`fffdc000 Unfrozen
. 2 Id: 20a4.1328 Suspend: 0 Teb: 000007ff`fffda000 Unfrozen
3 Id: 20a4.1c34 Suspend: 0 Teb: 000007ff`fffd8000 Unfrozen
4 Id: 20a4.5bc Suspend: 0 Teb: 000007ff`fffd4000 Unfrozen
5 Id: 20a4.3c4 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen
CritSec ntdll!LdrpLoaderLock+0 at 00000000778e7490
CritSec ProseWarePrintProcessor+5084 at 00000000750ca330
Scanned 201 critical sections
That's not good! We can see that thread 6 indeed owns the loader lock, which thread 2 is waiting for. But thread 2 owns the ProseWarePrintProcessor lock - and thread 6 is waiting for it! This is a classic deadlock. In fact, Raymond Chen even described this on his blog. More information about LdrInitialize can be found here.
So we know that there is a deadlock in the Print Isolation Host, but why exactly does this cause spooler to hang? Here's where we work backwards. We know that thread 6 is handling DLL initialization, but what is thread 2 doing? From the stack we can see it is handling an RPC request that called into ProseWarePrintProcessor. Let's determine who called into this thread.
00 00000000`0085dd88 00000000`777fe4e8 ntdll!ZwWaitForSingleObject+0xa
01 00000000`0085dd90 00000000`777fe3db ntdll!RtlpWaitOnCriticalSection+0xe8
02 00000000`0085de40 00000000`777db9e7 ntdll!RtlEnterCriticalSection+0xd1
03 00000000`0085de70 000007fe`fd963706 ntdll!LdrLockLoaderLock+0x6d
04 00000000`0085deb0 00000000`750c58f3 KERNELBASE!GetModuleFileNameW+0x96
05 00000000`0085df10 00000000`750c5d77 ProseWarePrintProcessor!InstallPrintProcessor+0x647
06 00000000`0085e380 00000000`750c51d9 ProseWarePrintProcessor!InstallPrintProcessor+0xacb
07 00000000`0085e3b0 000007fe`f96766b4 ProseWarePrintProcessor!ControlPrintProcessor+0x25
08 00000000`0085e3e0 000007fe`fe2f23d5 PrintIsolationProxy!sandbox::PrintProcessor::ControlPrintProcessor+0x34
09 00000000`0085e420 000007fe`fe39b68e rpcrt4!Invoke+0x65
0a 00000000`0085e470 000007fe`fe2f48d6 rpcrt4!Ndr64StubWorker+0x61b
0b 00000000`0085ea30 000007fe`fdd50883 rpcrt4!NdrStubCall3+0xb5
0c 00000000`0085ea90 000007fe`fdd50ccd ole32!CStdStubBuffer_Invoke+0x5b
0d 00000000`0085eac0 000007fe`fdd50c43 ole32!SyncStubInvoke+0x5d
0e 00000000`0085eb30 000007fe`fdc0a4f0 ole32!StubInvoke+0xdb
0f 00000000`0085ebe0 000007fe`fdd514d6 ole32!CCtxComChnl::ContextInvoke+0x190
10 00000000`0085ed70 000007fe`fdd5122b ole32!AppInvoke+0xc2
11 00000000`0085ede0 000007fe`fdd4fd6d ole32!ComInvokeWithLockAndIPID+0x52b
12 00000000`0085ef70 000007fe`fe2e50f4 ole32!ThreadInvoke+0x30d
13 00000000`0085f010 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14
14 00000000`0085f040 000007fe`fe2e775b rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146
15 00000000`0085f160 000007fe`fe2e769b rpcrt4!RPC_INTERFACE::DispatchToStub+0x9b
16 00000000`0085f1a0 000007fe`fe2e7632 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0x5b
17 00000000`0085f220 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x422
18 00000000`0085f300 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d
19 00000000`0085f430 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf
1a 00000000`0085f570 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5
1b 00000000`0085f600 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b
1c 00000000`0085f690 00000000`776a652d ntdll!TppWorkerThread+0x3f8
1d 00000000`0085f990 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd
1e 00000000`0085f9c0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
I know that the code in frame 19 deals with processing the RPC and has a record of the calling process' PID and TID. In fact, from a bit of code review, I know that at this portion of the code, we get back a value that contains a ntdll!_CLIENT_ID structure at offset 8:
000007fe`fe302ba6 ff151c050a00 call qword ptr [rpcrt4!_imp_AlpcGetMessageFromCompletionList (000007fe`fe3a30c8)]
000007fe`fe302bac 4885c0 test rax,rax
000007fe`fe302baf 0f84d1020000 je rpcrt4!LRPC_ADDRESS::ProcessIO+0x3c6 (000007fe`fe302e86)
000007fe`fe302bb5 4c8bac2488000000 mov r13,qword ptr [rsp+88h]
000007fe`fe302bbd 41bc01000000 mov r12d,1
000007fe`fe302bc3 33d2 xor edx,edx
000007fe`fe302bc5 488bf8 mov rdi,rax
000007fe`fe302bc8 41bfff000000 mov r15d,0FFh
Reviewing the assembly, from ProcessIO+0xe6 to where we are now (ProcessIO+0x3bf), we don't modify rdi again - and rdi is nonvolatile - so if we switch to that frame and check out rdi+8, we'll know who called this thread!
0:002> .frame /c /r 19
rax=0000e5a47e7646ad rbx=0000000000000001 rcx=00000000750ca330
rdx=0000000000000000 rsi=00000000002b8aa0 rdi=00000000002c4a80
rip=000007fefe302e7f rsp=000000000085f430 rbp=0000000000ecff90
r8=000000000085e2d8 r9=0000000000000002 r10=0000000000000000
r11=0000000000000246 r12=0000000000ec8bf0 r13=0000000000000000
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000246
000007fe`fe302e7f 4c8d357ad1fbff lea r14,[rpcrt4!COMMON_ResubmitListen <PERF> (rpcrt4+0x0) (000007fe`fe2c0000)]
0:002> dt _CLIENT_ID rdi+8
+0x000 UniqueProcess : 0x00000000`00002694 Void
+0x008 UniqueThread : 0x00000000`000005e8 Void
0:002> ? 0x00000000`00002694
Evaluate expression: 9876 = 00000000`00002694
0:002> ? 0x00000000`000005e8
Evaluate expression: 1512 = 00000000`000005e8
So now we know that the caller was process 0x2694 and thread 0x5e8, or 9876 and 1512 in decimal, respectively. Our current process (PrintIsolationHost.exe) is PID 20a4 (see above ~ output), or 8356 decimal. So who is 9876? I happen to have a process listing from the data collection:
 0 64 9876 spoolsv.exe Svcs: Spooler
Command Line: C:\Windows\System32\spoolsv.exe
 0 64 8356 PrintIsolationHost.exe
Command Line: C:\Windows\system32\PrintIsolationHost.exe -Embedding
Okay, so I know the caller is thread 5e8 in spooler. Loading up the spooler dump, what is that thread doing?
00000000`77801b6a c3 ret
# Child-SP RetAddr Call Site
00 00000000`0649d898 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa
01 00000000`0649d8a0 000007fe`fe2f4e42 rpcrt4!LRPC_CCALL::SendReceive+0x156
02 00000000`0649d960 000007fe`fdd528c0 rpcrt4!I_RpcSendReceive+0x42
03 00000000`0649d990 000007fe`fdd5282f ole32!ThreadSendReceive+0x40
04 00000000`0649d9e0 000007fe`fdd5265b ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xa3
05 00000000`0649da80 000007fe`fdc0daaa ole32!CRpcChannelBuffer::SendReceive2+0x11b
06 00000000`0649dc40 000007fe`fdc0da0c ole32!CAptRpcChnl::SendReceive+0x52
07 00000000`0649dd10 000007fe`fdd5205d ole32!CCtxComChnl::SendReceive+0x68
08 00000000`0649ddc0 000007fe`fe39b949 ole32!NdrExtpProxySendReceive+0x45
09 00000000`0649ddf0 000007fe`fdd521d0 rpcrt4!NdrpClientCall3+0x2e2
0a 00000000`0649e0b0 000007fe`fdc0d8a2 ole32!ObjectStublessClient+0x11d
0b 00000000`0649e440 000007fe`fb059070 ole32!ObjectStubless+0x42
0c 00000000`0649e490 000007fe`fb057967 localspl!sandbox::PrintProcessorExecuteObserver::ControlPrintProcessor+0x10
0d 00000000`0649e4c0 000007fe`fb055e27 localspl!sandbox::PrintProcessorAdapterImpl::ControlPrintProcessor+0x27
0e 00000000`0649e4f0 000007fe`faff7392 localspl!sandbox::PrintProcessorAdapter::ControlPrintProcessor+0x1b
0f 00000000`0649e520 000007fe`faff8a0a localspl!DeleteJob+0x1ce
10 00000000`0649eae0 00000000`ff41fe25 localspl!SplSetJob+0x49e
11 00000000`0649eb80 000007fe`f9683603 spoolsv!SetJobW+0x25
12 00000000`0649ebc0 00000000`61001ce1 spoolss!SetJobW+0x1f
13 00000000`0649ec00 00000000`61001d7d Contoso!InitializePrintMonitor+0x781
14 00000000`0649ec40 000007fe`faffa674 Contoso!InitializePrintMonitor+0x81d
15 00000000`0649ec70 00000000`ff41c9c7 localspl!SplEndDocPrinter+0x214
16 00000000`0649ecd0 00000000`ff403ba6 spoolsv!EndDocPrinter+0x1f
17 00000000`0649ed00 00000000`ff3fe772 spoolsv!YEndDocPrinter+0x22
18 00000000`0649ed30 000007fe`fe2f23d5 spoolsv!RpcEndDocPrinter+0x3e
19 00000000`0649ed60 000007fe`fe39b68e rpcrt4!Invoke+0x65
1a 00000000`0649edb0 000007fe`fe2dac40 rpcrt4!Ndr64StubWorker+0x61b
1b 00000000`0649f370 000007fe`fe2e50f4 rpcrt4!NdrServerCallAll+0x40
1c 00000000`0649f3c0 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14
1d 00000000`0649f3f0 000007fe`fe2e5679 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146
1e 00000000`0649f510 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x149
1f 00000000`0649f5f0 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d
20 00000000`0649f720 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf
21 00000000`0649f860 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5
22 00000000`0649f8f0 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b
23 00000000`0649f980 00000000`776a652d ntdll!TppWorkerThread+0x3f8
24 00000000`0649fc80 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd
25 00000000`0649fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
It's calling into the print isolation host as we expect. It looks like it is doing that to end a print job, based on an RPC call it received. Using our same method as last time, let's pull out the PID and TID he is responding to:
0:048> .frame /c /r 20
rax=000000000622986c rbx=0000000000000000 rcx=000000000622985c
rdx=000007fefe3a3c40 rsi=00000000027e5150 rdi=0000000008e6abd0
rip=000007fefe302e7f rsp=000000000649f720 rbp=0000000008e43480
r8=0000000000000010 r9=0000000000000000 r10=000007fefe2c0000
r11=0000000000000002 r12=000000000512d8c0 r13=0000000000000000
0:048> dt _CLIENT_ID rdi+8
+0x000 UniqueProcess : 0x00000000`000020a4 Void
+0x008 UniqueThread : 0x00000000`00001c34 Void
Look at that, it's doing work for a different thread back in our Print Isolation Host. Switching back to that dump, what is thread 1c34 doing?
00 00000000`00ebc098 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa
01 00000000`00ebc0a0 000007fe`fe39cc74 rpcrt4!LRPC_CCALL::SendReceive+0x156
02 00000000`00ebc160 000007fe`fe39cf25 rpcrt4!NdrpClientCall3+0x244
03 00000000`00ebc420 000007fe`f9bfd878 rpcrt4!NdrClientCall3+0xf2
04 00000000`00ebc7b0 000007fe`f96845bf winspool!EndDocPrinter+0x15c
05 00000000`00ebc7f0 00000000`750c4102 spoolss!EndDocPrinter+0x2f
06 00000000`00ebc820 00000000`750c5013 ProseWarePrintProcessor+0x4102
07 00000000`00ebe620 000007fe`f9676be2 ProseWarePrintProcessor!PrintDocumentOnPrintProcessor+0xb3
08 00000000`00ebe650 000007fe`fe2f23d5 PrintIsolationProxy!sandbox::PrintProcessor::PrintDocThroughPrintProcessor+0x82
09 00000000`00ebe6b0 000007fe`fe39b68e rpcrt4!Invoke+0x65
0a 00000000`00ebe710 000007fe`fe2f48d6 rpcrt4!Ndr64StubWorker+0x61b
0b 00000000`00ebecd0 000007fe`fdd50883 rpcrt4!NdrStubCall3+0xb5
0c 00000000`00ebed30 000007fe`fdd50ccd ole32!CStdStubBuffer_Invoke+0x5b
0d 00000000`00ebed60 000007fe`fdd50c43 ole32!SyncStubInvoke+0x5d
0e 00000000`00ebedd0 000007fe`fdc0a4f0 ole32!StubInvoke+0xdb
0f 00000000`00ebee80 000007fe`fdd514d6 ole32!CCtxComChnl::ContextInvoke+0x190
10 00000000`00ebf010 000007fe`fdd5122b ole32!AppInvoke+0xc2
11 00000000`00ebf080 000007fe`fdd4fd6d ole32!ComInvokeWithLockAndIPID+0x52b
12 00000000`00ebf210 000007fe`fe2e50f4 ole32!ThreadInvoke+0x30d
13 00000000`00ebf2b0 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14
14 00000000`00ebf2e0 000007fe`fe2e775b rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146
15 00000000`00ebf400 000007fe`fe2e769b rpcrt4!RPC_INTERFACE::DispatchToStub+0x9b
16 00000000`00ebf440 000007fe`fe2e7632 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0x5b
17 00000000`00ebf4c0 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x422
18 00000000`00ebf5a0 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d
19 00000000`00ebf6d0 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf
1a 00000000`00ebf810 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5
1b 00000000`00ebf8a0 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b
1c 00000000`00ebf930 00000000`776a652d ntdll!TppWorkerThread+0x3f8
1d 00000000`00ebfc30 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd
1e 00000000`00ebfc60 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
This thread called into spooler to end a document based on yet another RPC! To be clear, this is what we're looking at so far:
Again, who called into this thread?
0:003> .frame /c /r 19
rax=57633d3cd2c9c145 rbx=0000000000000000 rcx=0000000000861ff0
rdx=ffffffffff88ffe0 rsi=00000000002b8aa0 rdi=0000000000ec6bc0
rip=000007fefe302e7f rsp=0000000000ebf6d0 rbp=00000000002cdef0
r8=000000000003dbf0 r9=00000000000000fe r10=6dc0575d3d3cc4c8
r11=0000000000860020 r12=0000000000ec82b0 r13=0000000000000000
0:003> dt _CLIENT_ID rdi+8
+0x008 UniqueThread : 0x00000000`000014cc Void
So another call from spooler (note the same PID as earlier) - let's go back to spoolsv.dmp. Here is this thread:
00000000`0351e538 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa
00000000`0351e540 000007fe`fe2f4e42 rpcrt4!LRPC_CCALL::SendReceive+0x156
00000000`0351e600 000007fe`fdd528c0 rpcrt4!I_RpcSendReceive+0x42
00000000`0351e630 000007fe`fdd5282f ole32!ThreadSendReceive+0x40
00000000`0351e680 000007fe`fdd5265b ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xa3
00000000`0351e720 000007fe`fdc0daaa ole32!CRpcChannelBuffer::SendReceive2+0x11b
00000000`0351e8e0 000007fe`fdc0da0c ole32!CAptRpcChnl::SendReceive+0x52
00000000`0351e9b0 000007fe`fdd5205d ole32!CCtxComChnl::SendReceive+0x68
00000000`0351ea60 000007fe`fe39b949 ole32!NdrExtpProxySendReceive+0x45
00000000`0351ea90 000007fe`fdd521d0 rpcrt4!NdrpClientCall3+0x2e2
00000000`0351ed50 000007fe`fdc0d8a2 ole32!ObjectStublessClient+0x11d
00000000`0351f0e0 000007fe`fb0591ac ole32!ObjectStubless+0x42
00000000`0351f130 000007fe`fb057882 localspl!sandbox::PrintProcessorExecuteObserver::PrintDocThroughPrintProcessor+0x124
00000000`0351f1f0 000007fe`fb05601d localspl!sandbox::PrintProcessorAdapterImpl::PrintDocumentOnPrintProcessor+0x3a
00000000`0351f220 000007fe`fb013b70 localspl!sandbox::PrintProcessorAdapter::PrintDocumentOnPrintProcessor+0x9d
00000000`0351f270 000007fe`fb014c7c localspl!PrintDocumentThruPrintProcessor+0x46c
00000000`0351fa70 00000000`776a652d localspl!PortThread+0x4d0
00000000`0351fd80 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd
00000000`0351fdb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
Perfect. So now we know that localspl was printing a document, which resulted in all of these RPC calls between spooler and Print Isolation Host, and ultimately the deadlock in Print Isolation Host is holding up this thread. Just out of curiosity, are there any other threads blocked on this wait chain?
CritSec +5a00060 at 0000000005a00060
Scanned 14148 critical sections
Yes there are. Thread 5e8, which we looked at earlier, is holding a lock that 104 other threads are waiting for! This dump had 177 threads, so we know now that thread 5e8, 14cc, and 104 others in spooler are all hung on this deadlock. With about 60% of all the threads in spooler hung, the deadlock in ProseWarePrintProcessor is clearly the cause of our issue. Here's the final wait chain diagram:
To resolve the issue, ProseWarePrintProcessor needs to avoid calling GetModuleFileName while its DllMain routine is still running, since the former requires and the latter holds the Loader Lock.
Hi, this is Bob Golding; I wanted to write a blog about an interesting hardware issue I ran into. Hardware problems can be tricky to isolate. I recently came across one that I thought was interesting and gave an example of how to trace code execution. The machine executed the filler “int 3” instructions generated by the compiler. Execution should never reach these filler instructions, so we needed to determine how the instruction pointer got there.
What was the issue?
The issue was a bug check 8E (unhandled exception). The exception was a debug exception (80000003), because a filler INT 3 instruction was executed.
1: kd> .bugcheck
Bugcheck code 0000008E
Arguments 80000003 8082e3e0 f78aec38 00000000
Below is the trap frame (the trap frame is the third argument in the bugcheck code). Note that the actual trapping instruction is 8082e3e0, the instruction pointer is incremented before the INT 3 generates a trap. The correct EIP is reported in the bug check values.
1: kd> .trap f78aec38
ErrCode = 00000000
eax=0e364a01 ebx=f78aed50 ecx=f78aecb8 edx=000000e1 esi=f7727ec0 edi=f75d9ca2
eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000202
8082e3e1 cc int 3
The machine executed code at the end of a function that should never be executed. This should never happen.
1: kd> ub 8082e3e1
8082e3c6 32c9 xor cl,cl
8082e3c8 ff2508118080 jmp dword ptr [nt!_imp_KfLowerIrql (80801108)]
8082e3ce 64a124010000 mov eax,dword ptr fs:[00000124h]
8082e3d4 c6403d01 mov byte ptr [eax+3Dh],1
8082e3d8 ff2598108080 jmp dword ptr [nt!_imp_HalRequestSoftwareInterrupt (80801098)]
8082e3de cc int 3 <<< This is after the function.
8082e3df cc int 3
8082e3e0 cc int 3
OK, So Now What?
Now, we need to find the execution path that caused the machine to execute this INT 3. There are places to look to find clues that will tell us. The first place to start looking is the stack. If a “call” instruction was made, the return will be pushed on the stack. This way we can try to determine if we arrived at this bad instruction pointer from a call or a ret instruction.
Using the value from esp in the above trap frame, let’s dump the stack.
1: kd> dps f78aecac
f78aecac 80a5d8fc hal!HalpIpiHandler+0xcc <<< Interesting?
In looking at the stack dump, we see that there may have been a call from HalpIpiHandler. Let’s dump the code leading up to hal!HalpIpiHandler+0xcc to see what it did.
1: kd> ub 80a5d8fc
80a5d8e1 e8d82b0000 call hal!HalBeginSystemInterrupt (80a604be)
80a5d8e6 0ac0 or al,al
80a5d8e8 7509 jne hal!HalpIpiHandler+0xc3 (80a5d8f3)
80a5d8ea 83c408 add esp,8
80a5d8ed ff2510b0a580 jmp dword ptr [hal!_imp_Kei386EoiHelper (80a5b010)]
80a5d8f3 6a00 push 0
80a5d8f5 55 push ebp
80a5d8f6 ff1520b0a580 call dword ptr [hal!_imp__KiIpiServiceRoutine (80a5b020)]
In the above assembly, we can see that there is a call made using a pointer in the import table. Now, let’s have a look at that pointer.
1: kd> dd 80a5b020 l 1
The pointer is very close to the instruction we trapped on. Is this a coincidence?
It looks like due to an effective address calculation failure, the machine starting executing at 8082e3e0 instead of 8082e3e4. Somewhere in the data path the processor executing this instruction stream dropped bit three, turning a 4 into a 0.
1: kd> ?0y0100
Evaluate expression: 4 = 00000000`00000004
1: kd> ?0y0000
Evaluate expression: 0 = 00000000`00000000
What does all of this mean?
There is some circumstantial evidence here that the machine was in the IPI handler. The IPI Handler is used in multiprocessor systems so that one processor may interrupt another. So, how can we further prove this is where we were? Let’s try to match the trap frame registers to the assembly from HalpIpiHandler before it calls KiIpiServiceRoutine.
eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0 nv up ei pl nz na po nc
1: kd> ub 80a5d8fc l10
80a5d8c4 c74508000ddbba mov dword ptr [ebp+8],0BADB0D00h <<< EBP+ 8 does equal badb0d00
80a5d8cb 895d00 mov dword ptr [ebp],ebx <<< ebp does equal ebx
80a5d8ce 897d04 mov dword ptr [ebp+4],edi <<< ebp+4 does equal edi
80a5d8d1 68e1000000 push 0E1h
80a5d8d6 83ec04 sub esp,4
80a5d8d9 54 push esp
80a5d8da 68e1000000 push 0E1h
80a5d8df 6a1d push 1Dh
80a5d8e6 0ac0 or al,al <<< al is equal to 1
Below is the stack data that the instructions above produced (borrowed from the dps output above). It matches, so the machine did execute the above instructions.
f78aecac 80a5d8fc hal!HalpIpiHandler+0xcc <<< Return Pushed by call instruction
f78aecb0 f78aecc0 <<< Pushed EBP
f78aecb4 00000000 <<< Pushed a 0
Finally, the last bit of evidence:
Let's view the IPI state using the debugger command !ipi. From this output we can see that processor 1 is the receiver of a cross interrupt from Processor 0. This is consistent with the data we found on the stack.
1: kd> !ipi
IPI State for Processor 0
As a sender, awaiting packet completion from processor 1.
TargetSet 2 PacketBarrier e IpiFrozen 2 [Frozen]
IpiFrame 8089a570 SignalDone 00000000 RequestSummary 0
Packet State Active WorkerRoutine nt!KiFlushTargetMultipleTb
Parameter 00000000 Parameter 80899f10 Parameter 80899f04
IPI State for Processor 1
As a receiver, the following unhandled requests are pending: [Packet] [DPC]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame b5ad7be8 SignalDone ffdff120 RequestSummary 2 [DPC]
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleTb
Parameter 00000000 Parameter b5ad7950 Parameter b5ad7948
What went wrong?
Based on the evidence in this dump it appears that call instruction transferred execution to the wrong address. The machine ended up executing at address 8082e3e0 instead of 8082e3e4, a single bit difference. This same bit was flipped in several crashes from this machine, so all the evidence pointed to a faulty processor. After replacing the processor we were running on when we bugchecked, the issue did not occur again.
Hardware can sometimes cause pretty specific failures, such as the flipped bit we see here. To determine that this failure was a hardware issue, we had to reconstruct the execution path and trace how we ended up at the failing instruction. We were able to match the register contents to what they would have been before the call to KiIpiServiceRoutine. This demonstrated that the call should have been made to KiIpiServiceRoutine, but it unexpectedly went to the wrong address.
Hi, it's the Debug Ninja back again with another debugging adventure. Recently I have encountered several instances where processes fail to initialize, and a review of available resources showed that there was no obvious resource exhaustion. A more in depth review found that there were no available string atoms in the global atom table.
Global atoms are organized on a per-session basis. If atoms cannot be allocated in session 0, services may fail to start or processes launched by various services may fail to start. However, a user logged in to a different session will not experience any such failures.
String atoms are numbered from 0xC000 through 0xFFFF, providing a maximum of 0x4000 atoms per session. For more information on atoms, and atom tables, see http://technet.microsoft.com/en-us/query/ms649053.
When there are no more string atoms available, calls to APIs that allocate string atoms will fail. Because atoms are often allocated at process or dll init time, the most common symptom is that processes fail to initialize. The process may cleanly exit without an error. You are likely experiencing this problem if you debug your application and find that the failure originates from an API that allocates string atoms such as RegisterClass, RegisterClassEx, GlobalAddAtom, or AddAtom.
To determine if the global string atom table is full you will need to perform a kernel debug. This can be a live debug or a post-mortem debug using a dump.
First identify the session where the failures have occurred and set the process context to a process in this session. In my example, w3wp.exe was launching a process and this process failed to initialize.
2: kd> !process 0 0 w3wp.exe
SessionId: 0 Cid: 1668 Peb: fffdf000 ParentCid: 08ec
DirBase: 8a2df000 ObjectTable: fffff8a0128bbe40 HandleCount: 441.
2: kd> .process /p /r fffffa8005083060
Implicit process is now fffffa80`05083060
Loading User Symbols
Next we need to analyze the global atom table. The pointer to the table is stored in the UserAtomTableHandle global.
2: kd> dq win32k!UserAtomTableHandle l1
The UserAtomTableHandle has a pointer to a handle table at offset 0x10 in 64-bit, and offset 0x8 in 32-bit. Note that although the atom table is defined as a _RTL_ATOM_TABLE, the format shown by dt is for user mode and does not apply to the UserAtomTableHandle in kernel mode.
2: kd> dq fffff8a0`05e5bc70+10 l1
2: kd> dt nt!_HANDLE_TABLE fffff8a0`05db7740
+0x000 TableCode : 0xfffff8a0`109c8001
+0x008 QuotaProcess : (null)
+0x010 UniqueProcessId : 0x00000000`00000184 Void
+0x018 HandleLock : _EX_PUSH_LOCK
+0x020 HandleTableList : _LIST_ENTRY [ 0xfffff8a0`05db7760 - 0xfffff8a0`05db7760 ]
+0x030 HandleContentionEvent : _EX_PUSH_LOCK
+0x038 DebugInfo : (null)
+0x040 ExtraInfoPages : 0n0
+0x044 Flags : 0
+0x044 StrictFIFO : 0y0
+0x048 FirstFreeHandle : 0x10004
+0x050 LastFreeHandleEntry : 0xfffff8a0`10ca4ff0 _HANDLE_TABLE_ENTRY
+0x058 HandleCount : 0x3fc0
+0x05c NextHandleNeedingPool : 0x10400
+0x060 HandleCountHighWatermark : 0x3fc1
The FirstFreeHandle contains the handle number that will be given to the next handle allocated from this table. This value is encoded, to get the next handle number we need to right shift the FirstFreeHandle by 2 bits.
2: kd> ?00010004>>2
Evaluate expression: 16385 = 00000000`00004001
The result from above, 0x4001, is greater than the number of possible string atoms. As I mentioned earlier, there is a limit of 0x4000 string atoms. Now we know that the session is out of string atoms.
The next step is to dump the string atoms to identify whether there is an observable pattern in the leaked strings. The !atom command only works in user mode, so we need to dump the kernel mode strings manually. An atom table is comprised of multiple buckets. Each bucket is the head of a list of atoms. The buckets start at offset 0x20 in the atom table in 64-bit, and offset 0x10 in 32-bit.
2: kd> dq fffff8a0`05e5bc70+20
fffff8a0`05e5bc90 fffff8a0`05e5ba60 fffff8a0`05db7be0
fffff8a0`05e5bca0 fffff8a0`08cf1770 fffff8a0`05e5b3d0
fffff8a0`05e5bcb0 fffff8a0`05ea9020 fffff8a0`05e5b8e0
fffff8a0`05e5bcc0 fffff8a0`05ea9b10 fffff8a0`05ea9910
fffff8a0`05e5bcd0 fffff8a0`05ea9f00 fffff8a0`05e5b650
fffff8a0`05e5bce0 fffff8a0`05cda290 fffff8a0`05ea9e80
fffff8a0`05e5bcf0 fffff8a0`05e5b200 fffff8a0`05ea9e30
fffff8a0`05e5bd00 fffff8a0`05e5b7e0 fffff8a0`06c56210
2: kd> dq
fffff8a0`05e5bd10 fffff8a0`06d6b5a0 fffff8a0`05ea9d50
fffff8a0`05e5bd20 fffff8a0`05e5b790 fffff8a0`05e5b9d0
fffff8a0`05e5bd30 fffff8a0`06bd9bc0 fffff8a0`05ea9c90
fffff8a0`05e5bd40 fffff8a0`05e5b0c0 fffff8a0`06ae2020
fffff8a0`05e5bd50 fffff8a0`05e5b930 fffff8a0`04d2af40
fffff8a0`05e5bd60 fffff8a0`05e5b690 fffff8a0`05e5b980
fffff8a0`05e5bd70 fffff8a0`05e5b490 fffff8a0`05e5b410
fffff8a0`05e5bd80 fffff8a0`05e5ba20 fffff8a0`05e5b4f0
fffff8a0`05e5bd90 fffff8a0`05e5baa0 fffff8a0`05e5b390
fffff8a0`05e5bda0 fffff8a0`05e5b840 fffff8a0`05ea9c50
fffff8a0`05e5bdb0 fffff8a0`05e5b250 00000000`00000000
fffff8a0`05e5bdc0 00000000`00000000 00000000`00000000
fffff8a0`05e5bdd0 00000000`00000000 00000000`00000000
fffff8a0`05e5bde0 00000000`00000000 00000000`00000000
fffff8a0`05e5bdf0 00000000`00000000 00000000`00000000
fffff8a0`05e5be00 00000000`00000000 00000000`00000000
The quick and dirty way to dump the buckets is with !list. I am sure that some will say it is tedious to dump each bucket list by hand and that there are easier ways to accomplish this. To prevent this article from becoming a lesson on debugger scripting, I am leaving that as an exercise to the reader.
2: kd> !list "-t nt!_RTL_ATOM_TABLE_ENTRY.HashLink -e -x \"du @$extret+10\" fffff8a0`05e5ba60"
<snip strings that don't match a pattern>
Dumping the atoms I found that there is a continuous pattern of the string ControlOfs followed by 16 hexadecimal numbers. Some time spent with your favorite search engine should find other reports of atom leaks involving the string ControlOfs, and that these leaks have been identified as a problem in some specific software. In this instance the programmer using that software needs to change their application to avoid the problem.
Hello all, Scott Olson here again to share another interesting issue I recently debugged with pool corruption and found that using special pool does not work with large pool allocations (pool allocations greater than a PAGE_SIZE).
Here is an example of a valid large page allocation. Notice the size is 0x1fb0 and a PAGE_SIZE is 0x1000 or 4kb.
0: kd> !pool fffffa80`0dba6fa0
Pool page fffffa800dba6fa0 region is Nonpaged pool
*fffffa800dba5000 : large page allocation, Tag is Io , size is 0x1fb0 bytes
Pooltag Io : general IO allocations, Binary : nt!io
In Windows 7, at the end of the large pool allocation it will have an allocation tag of “Frag” then a “Free” tag with the rest of the page size and is stored on the free pool list for allocation less than a page in size.
0: kd> dc fffffa800dba5000 fffffa800dba5000+0x1fb0-4
fffffa80`0dba5000 00558001 32373242 00000000 00000000 ..U.B272........
fffffa80`0dba5010 55555555 55555555 98764321 01b75f55 UUUUUUUU!Cv.U_..
fffffa80`0dba5020 00000001 00000001 704e6ff0 fffff981 .........oNp....
fffffa80`0dba6f80 55555555 55555555 55555555 55555555 UUUUUUUUUUUUUUUU
fffffa80`0dba6f90 55555555 55555555 55555555 55555555 UUUUUUUUUUUUUUUU
fffffa80`0dba6fa0 55555555 55555555 00001fb0 00000000 UUUUUUUU........
0: kd> dc
fffffa80`0dba6fb0 02010100 67617246 55555555 55555555 ....FragUUUUUUUU
fffffa80`0dba6fc0 00040101 65657246 55555555 55555555 ....FreeUUUUUUUU
fffffa80`0dba6fd0 00802170 fffff880 0e49cf70 fffffa80 p!......p.I.....
fffffa80`0dba6fe0 15cc8fe8 fffff981 3b9c50a7 00000005 .........P.;....
Displayed with the !pool command:
0: kd> !pool fffffa80`0dba6fb0
Pool page fffffa800dba6fb0 region is Nonpaged pool
*fffffa800dba6fb0 size: 10 previous size: 0 (Allocated) *Frag
Owning component : Unknown (update pooltag.txt)
fffffa800dba6fc0 size: 40 previous size: 10 (Free) Free
The example above demonstrates how this normally works. The downside to this architecture is that if a driver were to overrun its pool allocation then special pool would not be useful because the large pool allocation has to be page-aligned. Special pool detects pool overruns by putting the data at the end of the page, which would not be feasible with a large pool allocation.
In Windows 7 there is a check while freeing the pool memory that will determine if this allocation had written past the end of its allocation, and if so will bug check the machine with a Stop 0x19 BAD_POOL_HEADER with the first parameter being a 0x21. Here is the definition along with what each parameter means:
The pool is already corrupt at the time of the current request.
This may or may not be due to the caller.
The internal pool links must be walked to figure out a possible cause of
the problem, and then special pool applied to the suspect tags or the driver
verifier to a suspect driver.
Arg1: 0000000000000021, the data following the pool block being freed is corrupt. Typically this means the consumer (call stack ) has overrun the block.
Arg2: fffffa800dc57000, The pool pointer being freed.
Arg3: 0000000000002180, The number of bytes allocated for the pool block.
Arg4: 006b0072006f0077, The corrupted value found following the pool block.
Here is an example of what this corruption looks like compared to the above valid large pool allocation:
0: kd> !pool fffffa800dc57000
Pool page fffffa800dc57000 region is Nonpaged pool
fffffa800dc57000 is not a valid large pool allocation, checking large session pool...
fffffa800dc57000 is freed (or corrupt) pool
Bad allocation size @fffffa800dc57000, zero is invalid
*** An error (or corruption) in the pool was detected;
*** Attempting to diagnose the problem.
*** Use !poolval fffffa800dc57000 for more details.
Pool page [ fffffa800dc57000 ] is __inVALID.
Analyzing linked list...
[ fffffa800dc57000 ]: invalid previous size [ 0x38 ] should be [ 0x0 ]
Scanning for single bit errors...
Next, I dump the allocation from the start to the end. Notice the size of the allocation is stored in the bugcheck code as argument 3.
0: kd> dc fffffa800dc57000 fffffa800dc57000+2180-4
fffffa80`0dc57000 00000038 0000000e 00000000 00000000 8...............
fffffa80`0dc57010 a24da497 01ccc5d6 c827993c 41946d1f ..M.....<.'..m.A
fffffa80`0dc57020 c0d75c9b b7cff1a5 00000000 00000020 .\.......... ...
fffffa80`0dc57030 000021e0 00000006 0000006c 00000110 .!......l.......
fffffa80`0dc57040 00000208 000003b8 00000208 00000660 ............`...
fffffa80`0dc57050 00000208 00000910 00000208 00000bb0 ................
fffffa80`0dc59150 002d0033 00300031 0063002e 006d006f 3.-.1.0...c.o.m.
fffffa80`0dc59160 006c002e 00660065 00680074 006e0061 ..l.e.f.t.h.a.n.
fffffa80`0dc59170 006e0064 00740065 006f0077 006b0072 d.n.e.t.w.o.r.k.
This should be the end of the allocation. The next thing we see should be the “Frag” and “Free” tags.
fffffa80`0dc59180 003a0073 0061006d 0061006e 00650067 s.:.m.a.n.a.g.e.
fffffa80`0dc59190 0065006d 0074006e 0038003a 00390036 m.e.n.t.:.8.6.9.
fffffa80`0dc591a0 0062003a 00670069 0075006c 00790063 :.b.i.g.l.u.c.y.
fffffa80`0dc591b0 0064002d 00740061 002d0061 006e0069 -.d.a.t.a.-.i.n.
fffffa80`0dc591c0 00650064 00650078 002d0073 00740063 d.e.x.e.s.-.c.t.
fffffa80`0dc591d0 006c0072 0031005f 00000031 00000000 r.l._.1.1.......
We clearly see that the Frag and Free tag have been overwritten with some string value which is causing the corruption. At this point, you would need to look at the current stack to determine which driver had allocated the memory, and review the code to investigate when this corruption could have occurred.
Yesterday's blog prompted some questions about how to set up a debugger for a Windows OS running in a Hyper-V VM. I was surprised that I wasn't able to find good, publicly available, Microsoft issued documentation for this configuration.
The first step is to configure the Windows OS in the VM to enable a kernel debugger on COM1. One would use these same steps if you were preparing the OS to be debugged using a null modem cable. Hyper-V will allow us to redirect the COM port so that we don't need such a cable.
Next, configure Hyper-V to redirect the COM1 port to a named pipe. We will use this pipe in place of a traditional null modem cable.
After the OS and the VM are configured for debugging, we need to connect a debugger.
To test the debugger connection in windbg, from the ‘Debug’ menu choose ‘Break’. This should cause the server to break into the debugger and display a kd> prompt. Please note that breaking into the debugger will cause the OS running in the VM to halt until you tell the debugger to go, the OS will appear to be hung during this time. The command 'g' followed by Enter will tell the debugger to ‘go’ causing the VM to resume operation.