Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.
Here's is what I found:
The system got the following bugcheck:
0: kd> .bugcheckBugcheck code 000000D1Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6
Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference
!analyze -v provides the following information for this bugcheck:
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.Arguments:Arg1: fffff80003a192d0, memory referencedArg2: 0000000000000002, IRQLArg3: 0000000000000000, value 0 = read operation, 1 = write operationArg4: fffff980064aa8b6, address which referenced memoryDebugging Details:------------------READ_ADDRESS: fffff80003a192d0CURRENT_IRQL: 2
So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:
0: kd> kv 3Child-SP RetAddr : Args to Child : Call Sitefffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckExfffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)
Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)
0: kd> .trap fffff800`03219070NOTE: The trap frame does not contain all registers.Some register values may be zeroed or incorrect.rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000r11=0000000000000000 r12=0000000000000000 r13=0000000000000000r14=0000000000000000 r15=0000000000000000iopl=0 nv up ei pl zr na po nctcpip!InetInspectReceiveDatagram+0xf6:fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13
As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.
0: kd> r cr2cr2=fffff80003a192d0
Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)
0: kd> .formats fffff800`032192d0Evaluate expression:Hex: fffff800`032192d0Decimal: -8796040490288Octal: 1777777600000310311320Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000Chars: .....!..Time: ***** Invalid FILETIMEFloat: low 4.74822e-037 high -1.#QNANDouble: -1.#QNAN0: kd> .formats fffff800`03a192d0Evaluate expression:Hex: fffff800`03a192d0Decimal: -8796032101680Octal: 1777777600000350311320Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000Chars: ........Time: ***** Invalid FILETIMEFloat: low 9.49644e-037 high -1.#QNANDouble: -1.#QNAN
Notice that there is a one bit difference between these 2 addresses
11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.
Very interesting. I have issue with my wifes PC and would want to try this fix on that.
Very nice article. Good thing that hardware issues can also be diagnosed with a memory test like the Vista integrated or memtest86+. Good news for the non-kernel debugger enlightened users like me :)
Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers
From elsewhere in the collective.
Strange way of approaching the troubleshooting.
I would have booted into DOS and used goldmen or memtest86+ to check for memory errors -- that would catch the bit without any of the above brainstorming over a crash dump being required.
Furthermore, you usually do not add RAM sticks with different timings, bank sizes, or voltage requirements to the system -- if you can't find the same RAM you pull the old one out.
2.1V is probably not neccessary unless it is a higher clocked / high performance DDR2 memory, better check the specification, RAM can overheat and the BIOS can enable thermal throttling so the system might run slower with more voltage than neccessary.
Finally, if they installed 4 RAM sticks instead of 2 then the "solution" of rising voltage might have to do with RAM but with crappy mainboard and northbridge which aren't able to supply enough "juice" to drive all 4 memory modules.
In any case, I would try running goldmem and memtest86+ at 1.9v, 1.95v, etc until I find the minimum voltage at which the system is stable.
memtest86 should be able to diagnose such a problem, without digging into windows.
This is exactly the kind of walk-through that teaches tips and techniques everyone computer literate and in charge of designing, building, and supporting systems should add to their tool belt. I want to say that this hasn't helped me resolve a problem right now, but adding skills is always a good thing. But there isn't a 'not yet' button.
When I press the shiny green button for "Did this blog post help you resolve a problem?", I get an error:
500 - Internal server error.
There is a problem with the resource you are looking for, and it cannot be displayed.
Keep these articles coming!
And yes, memory testing programs can and do find problems. But I've seen strange errors that testing programs don't find but operating systems, applications, and games do find. Ask me about the 6502 add bug sometime... (showing my gray hair)
The first thing you should assume about a blue screen is that it is caused by software (drivers), so opening the crash dump with a debugger and is the correct first step. It's most likely going to be waste of time if you use Memtest86 as the first step. If you see an obvious exception/instruction mismatch or a bit flip in the debugger, or if the bugcheck error code and/or the stack trace is different in each dump (especially you get random stack trace even with Verifier enabled), then use a hardware stress test program (like Memtest86).
[It's good to see our older articles are still generating interest.]