Debugging a bluescreen at home

Debugging a bluescreen at home

  • Comments 15

 

Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.

 

A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.

 

This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.

 

Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.

 

Let's see what we can tell from the dump!

 

The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:

 

kd> !analyze -v

*******************************************************************************

*                                                                             *

*                        Bugcheck Analysis                                    *

*                                                                             *

*******************************************************************************

 

KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)

This is a very common bugcheck.  Usually the exception address pinpoints

the driver/function that caused the problem.  Always note this address

as well as the link date of the driver/image that contains this address.

Some common problems are exception code 0x80000003.  This means a hard

coded breakpoint or assertion was hit, but this system was booted

/NODEBUG.  This is not supposed to happen as developers should never have

hardcoded breakpoints in retail code, but ...

If this happens, make sure a debugger gets connected, and the

system is booted /DEBUG.  This will let us see why this breakpoint is

happening.

Arguments:

Arg1: c0000005, The exception code that was not handled

Arg2: 8738e300, The address that the exception occurred at

Arg3: b9b3dc7c, Trap Frame

Arg4: 00000000

 

Debugging Details:

------------------

 

 

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

 

FAULTING_IP:

+ffffffff8738e300

8738e300 0000             add     [eax],al

 

TRAP_FRAME:  b9b3dc7c -- (.trap ffffffffb9b3dc7c)

ErrCode = 00000002

eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64

eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0         nv up ei pl zr na po nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246

8738e300 0000             add     [eax],al                ds:0023:00000001=??

Resetting default scope

 

DEFAULT_BUCKET_ID:  DRIVER_FAULT

 

BUGCHECK_STR:  0x8E

 

LAST_CONTROL_TRANSFER:  from 8051d6a7 to 8053331e

 

STACK_TEXT: 

WARNING: Frame IP not in any known module. Following frames may be wrong.

b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c

b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d

b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8

0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet

0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

...

 

From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.

 

What’s at 0x8738e300?

 

kd> dc 8738e300

8738e300  00000000 00000001 00000000 87360350  ............P.6.

8738e310  00000000 00000001 f71af9fe f71b0030  ............0...

8738e320  f71afb0e f71afbb4 f71b0098 f71b0214  ................

8738e330  f71afef6 f71aff8e 07fef800 00000000  ................

8738e340  f71afffc 00000000 0a0e000a 644c6d4d  ............MmLd

8738e350  8732ea58 870303e0 ffffffff ffffffff  X.2.............

8738e360  00000012 00000000 f797f000 f7989905  ................

8738e370  0000c000 00500050 e1971458 00160016  ....P.P.X.......

 

kd> !pool 8738e300 2

Pool page 8738e300 region is Nonpaged pool

*8738e2f8 size:   50 previous size:    8  (Allocated) *NV 

              Owning component : Unknown (update pooltag.txt)

 

Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).

 

So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.

 

kd> kv L8

ChildEBP RetAddr  Args to Child             

WARNING: Frame IP not in any known module. Following frames may be wrong.

b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])

b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])

b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)

0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090

 

The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:

 

kd> ub bf801619

win32k!ValidateHwnd+0x2f:

bf8015ff 8d1c82           lea     ebx,[edx+eax*4]

bf801602 8bc1             mov     eax,ecx

bf801604 c1e810           shr     eax,0x10

bf801607 663b430a         cmp     ax,[ebx+0xa]

bf80160b 75ad             jnz     win32k!ValidateHwnd+0x3d (bf8015ba)

bf80160d 807b0801         cmp     byte ptr [ebx+0x8],0x1

bf801611 7573             jnz     win32k!ValidateHwnd+0xff (bf801686)

bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]

 

OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.

 

kd> dd bf98b2e0 L1

bf98b2e0  804e4a15

 

Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?

 

kd> ln 804e4a15

(804e4a15)   nt!PsGetCurrentThread   |  (804e4a51)   nt!KeSetEventBoostPriority

Exact matches:

    nt!PsGetCurrentThread = <no type information>

 

Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.

 

So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.

 

At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)

 

Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:

image

Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!

 

The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!

 

Leave a Comment
  • Please add 6 and 8 and type the answer here:
  • Post
  • Chad,

    Thanks for the great post! I enjoy reading walkthroughs of a bugcheck analysis like this - has me trying to think "what's the next step" each step of the way...

  • Nice post.  Thanks for sharing your analysis...

  • Good write up Chad, had a similar problem a few years back with constant bluescreens as soon as the cpu started doing some heavy work... turns out forgot to add the heatsink thermal gel to spread the heat from the cpu...

    Funny how we always think it's always a software problem!

  • I just wanted to say THANK YOU for writing articles like this. I often wonder how to analyze crash output to better my understanding of Windows OS troubleshooting.

    -Steve

  • Oh also if possible could you include a link to your dump files so we can download, import and !analyze like you do? Using your post and comments as a guide to the thought process.

    Thanks again!

    [That would be nice, but remember that memory dumps contain the entire contents of memory at the time they are created. As such, they always have potential to contain sensitive personal information. I wouldn’t ever feel comfortable posting a memory dump from my own personal machine (or any machine in a real production environment) on the web for the general public to see. Sorry!]
  • In the crash dump, there's a reference to nikonscan:

    NikonScan4!GetSource+0x21e93

    Where does the GetSource symbol come from?  I have never seen a third-party driver in the wild with symbols included.

  • Chad,

    Excellent article and writeup, and thanks for sharing this information and experience.  While the !analyze -v in this example could have easily led you astray in the early stages of analysis by the bucket ID being "Driver_Fault" is it not often when you see this string in a debug output "EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s"." a clue that you should verify memory, hardware, etc. is not the culprit.  I know bad drivers can cause bit flips, corrupt memory, etc. and things like that sometimes, but many of the Hardware related bugchecks will typically have this in them from my experience.   I am open to your thoughts and comments regarding this.  Anyways I can see how this could have been a very tricky one to pinpoint, so good job of taking a perhaps nebulious bugcheck to root cause.  With many PCs trying to get more horsepower out of a box with smaller space, this is a great thing to keep in mind as cooling technology designs can be challenging, and dust buildup can negatively impact the capabilities of the cooling system.

  • Hi, i'm learning a lot with your blog. I have a question, what does the 0x90909090 that is at the bottom of the stack trace mean? I suppose it's an address that the debugger couldn't match with any loaded symbol, but why?

    Thanks,

    [The debugger doesn’t have any symbols for Nikonscan4.dll. When it doesn’t have symbols for a module, it will try to keep following the stack anyway, but it’s not always capable of doing so. In this case, it got confused (possibly because this function was compiled using Frame-Pointer Omission) and picked the wrong EBP for the next stack frame, leading to bad results.

    In a case like this, if you can’t get hold of the symbols for the module, you can dump out the contents of the stack manually by using a command like “dps esp” and try to walk further back through the stack on your own.]
  • @ David Moisan:

    GetSource is probably an exported function from NikonScan4.sys; the symbol engine probably resolved to the nearest export, given the (likely) unavailability of a PDB, and the large offset (0x21e93).

    [Yes, this is correct.]
  • Ha transcurrido un mes y medio desde mi último post , ya me vale. Hace unos días, en el blog de solución

  • .NET:Improving.NetPerformanceWithNgen:TheNativeImageGeneratorDefiningRoutesusingRegul...

  • Mike: I have a question, what does the 0x90909090 that is at the bottom of the stack trace mean? I suppose it's an address that the debugger couldn't match with any loaded symbol, but why?

    0x90 is the x86 NOP (no-operation) opcode; could be the top frame was a region of NOPs.

  • I had a similar problem with my video cards overheating. Dust and lint were cause.

  • Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers

  • Indeed a beautiful post. A great learning especially those who are starters.

    Good Work shown Chad.. Hoping to see more from you.

Page 1 of 1 (15 items)