Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.
A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.
This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.
Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.
Let's see what we can tell from the dump!
The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:
kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 8738e300, The address that the exception occurred at
Arg3: b9b3dc7c, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
FAULTING_IP:
+ffffffff8738e300
8738e300 0000 add [eax],al
TRAP_FRAME: b9b3dc7c -- (.trap ffffffffb9b3dc7c)
ErrCode = 00000002
eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64
eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0 nv up ei pl zr na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
8738e300 0000 add [eax],al ds:0023:00000001=??
Resetting default scope
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x8E
LAST_CONTROL_TRANSFER: from 8051d6a7 to 8053331e
STACK_TEXT:
WARNING: Frame IP not in any known module. Following frames may be wrong.
b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet
0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc
77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93
...
From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.
What’s at 0x8738e300?
kd> dc 8738e300
8738e300 00000000 00000001 00000000 87360350 ............P.6.
8738e310 00000000 00000001 f71af9fe f71b0030 ............0...
8738e320 f71afb0e f71afbb4 f71b0098 f71b0214 ................
8738e330 f71afef6 f71aff8e 07fef800 00000000 ................
8738e340 f71afffc 00000000 0a0e000a 644c6d4d ............MmLd
8738e350 8732ea58 870303e0 ffffffff ffffffff X.2.............
8738e360 00000012 00000000 f797f000 f7989905 ................
8738e370 0000c000 00500050 e1971458 00160016 ....P.P.X.......
kd> !pool 8738e300 2
Pool page 8738e300 region is Nonpaged pool
*8738e2f8 size: 50 previous size: 8 (Allocated) *NV
Owning component : Unknown (update pooltag.txt)
Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).
So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.
kd> kv L8
ChildEBP RetAddr Args to Child
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090
The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:
kd> ub bf801619
win32k!ValidateHwnd+0x2f:
bf8015ff 8d1c82 lea ebx,[edx+eax*4]
bf801602 8bc1 mov eax,ecx
bf801604 c1e810 shr eax,0x10
bf801607 663b430a cmp ax,[ebx+0xa]
bf80160b 75ad jnz win32k!ValidateHwnd+0x3d (bf8015ba)
bf80160d 807b0801 cmp byte ptr [ebx+0x8],0x1
bf801611 7573 jnz win32k!ValidateHwnd+0xff (bf801686)
bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]
OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.
kd> dd bf98b2e0 L1
bf98b2e0 804e4a15
Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?
kd> ln 804e4a15
(804e4a15) nt!PsGetCurrentThread | (804e4a51) nt!KeSetEventBoostPriority
Exact matches:
nt!PsGetCurrentThread = <no type information>
Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.
So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.
At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)
Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:
Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!
The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!
I recently was engaged on an issue where a server was depleting NonPagedPool over a period of a few days. Ordinarily, we would just use a tool like PoolMon to identify the offending pool tag and then find the driver that uses that pool tag using the method in this article.
However, what made this case interesting was the pool tag and that we were unable to identify the driver using the normal methodology. You’ll see what I mean in a moment. The engineer supplied me with a kernel dump of the server while it was in-state and this is what I found.
Let’s start by taking a look at the virtual memory usage:
2: kd> !vm
*** Virtual Memory Usage ***
Physical Memory: 851420 ( 3405680 Kb)
Page File: \??\C:\pagefile.sys
Current: 3584000 Kb Free Space: 3568552 Kb
Minimum: 3584000 Kb Maximum: 3584000 Kb
Available Pages: 573277 ( 2293108 Kb)
ResAvail Pages: 800628 ( 3202512 Kb)
Locked IO Pages: 1067 ( 4268 Kb)
Free System PTEs: 25102 ( 100408 Kb)
Free NP PTEs: 335 ( 1340 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 22 ( 88 Kb)
Modified PF Pages: 22 ( 88 Kb)
NonPagedPool Usage: 31369 ( 125476 Kb) ß Very high
NonPagedPool Max: 31986 ( 127944 Kb)
********** Excessive NonPaged Pool Usage *****
PagedPool 0 Usage: 19071 ( 76284 Kb)
PagedPool 1 Usage: 735 ( 2940 Kb)
PagedPool 2 Usage: 747 ( 2988 Kb)
PagedPool 3 Usage: 720 ( 2880 Kb)
PagedPool 4 Usage: 746 ( 2984 Kb)
PagedPool Usage: 22019 ( 88076 Kb)
PagedPool Maximum: 38912 ( 155648 Kb)
********** 3 pool allocations have failed **********
So we can see that NPP usage is very high given the server is using the /3GB switch which limits NPP to 128MB by default. We need to identify what pool tag is associated with the high NPP usage:
2: kd> !poolused /t2 2
Sorting by NonPaged Pool Consumed
Pool Used:
NonPaged Paged
Tag Allocs Used Allocs Used
None 246479 50827424 0 0 call to ExAllocatePool
MmCm 1198 18462512 0 0 Calls made to MmAllocateContiguousMemory , Binary: nt!mm
Interesting, so the offending tag is “None”. This means that these allocations were made by calling the function ExAllocatePool instead of ExAllocatePoolWithTag. ExAllocatePool is obsolete and should no longer be used.
Now, I need to find out which driver is calling this function. First, I need to know where ExAllocatePool lives:
2: kd> x nt!ExAllocatePool
e0894d1f nt!ExAllocatePool
Next, I need to search all the drivers to see which one is importing this function:
2: kd> !for_each_module s-d @#Base @#End e0894d1f
f50b8058 e0894d1f e0828e04 e089b708 e084011b .M..............
Hmm, looks suspiciously like an import table, let’s see:
2: kd> dps f50b8058
f50b8058 e0894d1f nt!ExAllocatePool
f50b805c e0828e04 nt!_wcsnicmp
f50b8060 e089b708 nt!ExFreePoolWithTag
f50b8064 e083e30a nt!KeInitializeEvent
<SNIP>
Yep, that’s an import table. You can also verify that this is the import table of a particular module by checking the header (!dh on the module’s base address and look for “Import Address Table Directory”).
As you can see, we have only one driver that imports ExAllocatePool. Let’s see which driver this is:
2: kd> !lmi f50b8058
Loaded Module Info: [f50b8058]
Module: {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}
Base Address: f50b3000
Image Name: {XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX}.sys
I’ve removed the incriminating identifiers from module information displayed above to protect the guilty. It is interesting to note that the driver name was a GUID and that this driver did not exist on the disk. This was because the driver is dynamically created when its parent program loads.
The software package was removed and the server was happy again.
- David
Microsoft is looking for five Windows Internals subject matter experts to come work on a very special five to eight day project on the Redmond campus during the month of May 2008. Candidates must have good communications skills, be non Microsoft employees, have 5+ years experience with windows, be familiar with the Windows Internals book, have kernel and user mode debugging experience, and be C literate.
We prefer candidates strongly represent the Windows IT Professional or Windows development industry. These typically would be 2nd or 3rd tier escalation resources or Windows developers (developing Win32 code, or device drivers) that are working for Gold Partners, ISVs or are independent consultants or MVPs.
A Microsoft NDA (Non Disclosure Agreement) will be required.
Please contact Jeff Dailey at jeffda@microsoft.com with the subject line “Windows Internals Project”
Please include a copy of your resume and details explaining your Windows background, along with your contact info, and your location.
We will cover air travel, lodging and food for non local candidates attending from within the US.
Thank you.
More Info:
The Widows Internals experts selected for this project will be brought to Redmond to aid in developing a new certification program based on the Windows Internals. This is not a hiring effort or product testing session. Though the candidates should have C literacy they will not be writing or reviewing any Microsoft source code.
This is going to be a short blog post, but considering the amount of feedback we’ve received on the our two previous desktop heap posts, I think this is worth blogging about. 32-bit Vista SP1 and 32-bit Windows Server 2008 both have a new value for the default size of interactive desktop heaps. Previously, this heap size was 3 MB by default, but on Vista SP1 and Server 2008 the default interactive heap size is 12 MB. We’ve heard your comments about the problems associated with the heap size, and this new default size should go a long way towards avoiding interactive desktop heap depletion.
On 32-bit systems, running Vista SP1 or Windows Server 2008, you’ll now see that the relevant portion of the registry value looks like this (by default)...
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows
SharedSection=1024,12288,512
64-bit Windows still has a default interactive desktop heap size of 20MB. Keep in mind that in Vista, session view space (the memory range that desktop heaps are allocated from) is now a dynamic address range. This means that in Vista and Server 2008 the individual desktop heap sizes can be increased to larger sizes with less concern over exhausting session view space.
- Matthew Justice
This is a follow-up on the LPC hang blog. The same hang troubleshooting techniques apply to this, but when a named pipe is involved you’ll have to use a slightly different method to following the chain from a client application to the server application. For the purpose of this exercise I’ll use the named pipe server (http://msdn2.microsoft.com/en-us/library/aa365588.aspx) and client (http://msdn2.microsoft.com/en-us/library/aa365592.aspx) given by MSDN.
Here is a diagram of our scenario.
Here we have the client application that is waiting on a read operation to a named pipe. We need to determine what process will be putting data into the named pipe which will let the client application move forward.
Here is the thread in the client application.
THREAD 81e70858 Cid 06bc.06c0 Teb: 7ffdf000 Win32Thread: 00000000 WAIT: (Executive) UserMode Non-Alertable
81baaf7c NotificationEvent
Not impersonating
DeviceMap e25176b8
Owning Process 81b3a020 Image: client.exe
Attached Process N/A Image: N/A
Wait Start TickCount 73194 Ticks: 1599 (0:00:00:24.984)
Context Switch Count 21
UserTime 00:00:00.000
KernelTime 00:00:00.000
Win32 Start Address client (0x00401376)
Start Address kernel32!BaseProcessStartThunk (0x77e617f8)
Stack Init f7450000 Current f744fc04 Base f7450000 Limit f744d000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 2
f744fc1c 808202b6 81e70858 81e70900 00000700 nt!KiSwapContext+0x25 (FPO: [Uses EBP] [0,0,4])
f744fc34 8081fb6e 8207dac0 00000000 81baaf20 nt!KiSwapThread+0x83 (FPO: [Non-Fpo])
f744fc78 8092deb3 81baaf7c 00000000 81f93601 nt!KeWaitForSingleObject+0x2e0 (FPO: [Non-Fpo])
f744fca0 80924ca9 81f936e0 00000103 81baaf20 nt!IopSynchronousServiceTail+0x180 (FPO: [Non-Fpo])
f744fd38 8082350b 00000024 00000000 00000000 nt!NtReadFile+0x5d5 (FPO: [Non-Fpo])
f744fd38 7c8285ec 00000024 00000000 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f744fd64)
0012eed4 7c82776b 77e418b2 00000024 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0012eed8 77e418b2 00000024 00000000 00000000 ntdll!NtReadFile+0xc (FPO: [9,0,0])
0012ef40 004010ab 00000024 0012ef60 00001000 kernel32!ReadFile+0x16c (FPO: [Non-Fpo])
0012ff78 0040131f 00000001 00323038 00323078 client+0x10ab
0012ffc0 77e6f23b 00000000 00000000 7ffd8000 client+0x131f
0012fff0 00000000 00401376 00000000 78746341 kernel32!BaseProcessStart+0x23 (FPO: [Non-Fpo])
We have highlighted the first parameter to the ReadFile function. This parameter is the handle to the file we are trying to read.
Knowing this we can use the !handle extension to display details on this handle. The !handle extension takes three parameters: 1. The handle to get details on, 2: a flag controlling the level of verbosity the extension will display (f being the most verbose level), and 3: the process address.
kd> !handle 00000024 f 81b3a020
processor number 0, process 81b3a020
PROCESS 81b3a020 SessionId: 0 Cid: 06bc Peb: 7ffd8000 ParentCid: 0f2c
DirBase: 1d486000 ObjectTable: e252bbc8 HandleCount: 10.
Image: client.exe
Handle table at e2845000 with 10 Entries in use
0024: Object: 81baaf20 GrantedAccess: 0012019f Entry: e2845048
Object: 81baaf20 Type: (823ceca0) File
ObjectHeader: 81baaf08 (old version)
HandleCount: 1 PointerCount: 3
Directory Object: 00000000 Name: \mynamedpipe {NamedPipe}
We have highlighted the object address (yellow) and the type of the object (green) above. The object is of type file which is defined by the FILE_OBJECT structure. So we can examine that using the dt (Display Type) command. The dt command can take 3 parameters: 1. The structure you want to cast the data as, 2: the address of the object, and 3: the field in the structure to be displayed.
kd> dt nt!_FILE_OBJECT 81baaf20 FsContext2
+0x010 FsContext2 : 0x81b5cc90
The FsContext2 field points to an NPFS CCB structure. The importance of this structure is that it is charged to the server process which created it. The !pool extension can be used to display information about the ownership of the memory the file object is stored in.
kd> !pool 0x81b5cc90 2
Pool page 81b5cc90 region is Nonpaged pool
*81b5cc88 size: 50 previous size: 140 (Allocated) *NpFc Process: 81be0d88
Pooltag NpFc : CCB, client control block, Binary : npfs.sys
PROCESS 81be0d88 SessionId: 0 Cid: 07b4 Peb: 7ffd4000 ParentCid: 0160
DirBase: 02b6f000 ObjectTable: e234a530 HandleCount: 10.
Image: server.exe
Now you have the established the link from the client application and the server application. Now you will be able to move forward in your debugging and figure out why the server process isn’t writing data back to the client.
- Bryan
UPDATE: This is why we *LOVE* our readers! After posting a 32-bit disassembly of 16-bit MBR code, I was gently informed of my misstep by three sharp-eyed readers. I’ve corrected the original post below. Thanks go out to Ramon Sola, Jeroen Frijters and 'pedantic gnome' for keeping us honest. Thanks guys!
Hi Everyone,
One of our readers, Pete, asked a very interesting question about disassembling the MBR and I'd like to take a moment to show you the quick & dirty way to do this.
First, dump the MBR to a file using a low-level sector editor such as the Microsoft Resource Kit utility DiskProbe. Once you've saved the file, launch a program (like Notepad) and attach to it using Windbg (the Windows Debugger).
Next, find a valid, but unoccupied range of memory (the default heap is a good candidate). The memory range needs to be at least 512 (0x200) bytes. For example, here's where I found mine:
000120a0 00000000 00000000 00000000 00000000
000120b0 00000000 00000000 00000000 00000000
000120c0 00000000 00000000 00000000 00000000
000120d0 00000000 00000000 00000000 00000000
000120e0 00000000 00000000 00000000 00000000
000120f0 00000000 00000000 00000000 00000000
00012100 00000000 00000000 00000000 00000000
00012110 00000000 00000000 00000000 00000000
00012120 00000000 00000000 00000000 00000000
<snip>...
Once you've found a good memory range in your process, read the file contents into that memory location using the following command:
0:001> .readmem c:\<<path>>\sector00.bin 120a0 120a0+0x1ff
Tada! You now have the MBR in memory and it is fully examinable just like any other assembly code.
0:001> db 000120a0
000120a0 33 c0 8e d0 bc 00 7c 8e-c0 8e d8 be 00 7c bf 00 3.....|......|..
000120b0 06 b9 00 02 fc f3 a4 50-68 1c 06 cb fb b9 04 00 .......Ph.......
000120c0 bd be 07 80 7e 00 00 7c-0b 0f 85 10 01 83 c5 10 ....~..|........
000120d0 e2 f1 cd 18 88 56 00 55-c6 46 11 05 c6 46 10 00 .....V.U.F...F..
000120e0 b4 41 bb aa 55 cd 13 5d-72 0f 81 fb 55 aa 75 09 .A..U..]r...U.u.
000120f0 f7 c1 01 00 74 03 fe 46-10 66 60 80 7e 10 00 74 ....t..F.f`.~..t
00012100 26 66 68 00 00 00 00 66-ff 76 08 68 00 00 68 00 &fh....f.v.h..h.
00012110 7c 68 01 00 68 10 00 b4-42 8a 56 00 8b f4 cd 13 |h..h...B.V.....
Since this is 16-bit assembly, you can unassemble using ‘ur’:
0:001> ur 120a0000120a0 33c0 xor ax,ax000120a2 8ed0 mov ss,ax000120a4 bc007c mov sp,7C00h000120a7 8ec0 mov es,ax000120a9 8ed8 mov ds,ax000120ab be007c mov si,7C00h000120ae bf0006 mov di,600h000120b1 b90002 mov cx,200h<snip>...
Have fun Pete!
DiskProbe: http://technet.microsoft.com/en-us/library/bb457122.aspx
WinDbg: http://www.microsoft.com/whdc/devtools/debugging/default.mspx