Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far. We invited the Debug Ninja to come up with a real challenge for this week.
This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL. The challenge this week is to tell us, as specifically as possible, why it crashed. We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.
1: kd> .bugcheck
Bugcheck code 000000D1
Arguments e074281d 00000002 00000001 ba502493
1: kd> kv
ChildEBP RetAddr Args to Child
f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)
f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)
f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)
f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)
1: kd> .trap f78b6544
ErrCode = 00000002
eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618
eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0 nv up ei pl zr na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00250246
tcpip!GetAddrType+0x19f:
ba502493 008b15f84154 add byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??
The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzler. Post your responses to this page (and not the complete debug page), so we can keep all the comments together. Also, we won’t post any of your responses until Friday.
[Update, posted 4/30/2008]
Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem. When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available. In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.
Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler. Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge. Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.
1: kd> dd @ebx+5441F815
e074281d ???????? ???????? ???????? ????????
e074282d ???????? ???????? ???????? ????????
e074283d ???????? ???????? ???????? ????????
e074284d ???????? ???????? ???????? ????????
e074285d ???????? ???????? ???????? ????????
e074286d ???????? ???????? ???????? ????????
e074287d ???????? ???????? ???????? ????????
e074288d ???????? ???????? ???????? ????????
1: kd> !pte e074281d
VA e074281d
PDE at 00000000C0603818 PTE at 00000000C0703A10
contains 000000021B980963 contains E154FC1000000400
pfn 21b980 -G-DA--KWEV not valid
Proto: 00000000E154FC10
1: kd> ub ba50d9d8
tcpip!IPRcvPacket+0x658:
ba50d9c4 51 push ecx
ba50d9c5 50 push eax
ba50d9c6 53 push ebx
ba50d9c7 ff75f4 push dword ptr [ebp-0Ch]
ba50d9ca ff75f8 push dword ptr [ebp-8]
ba50d9cd ff75ec push dword ptr [ebp-14h]
ba50d9d0 ff7508 push dword ptr [ebp+8]
ba50d9d3 e84a040000 call tcpip!DeliverToUser (ba50de22)
1: kd> uf tcpip!DeliverToUser
tcpip!DeliverToUser:
…
tcpip!DeliverToUser+0xc1:
ba50ded9 8b45f0 mov eax,dword ptr [ebp-10h]
ba50dedc 29451c sub dword ptr [ebp+1Ch],eax
ba50dedf 8b471c mov eax,dword ptr [edi+1Ch]
ba50dee2 ff7508 push dword ptr [ebp+8]
ba50dee5 2b4510 sub eax,dword ptr [ebp+10h]
ba50dee8 57 push edi
ba50dee9 8945fc mov dword ptr [ebp-4],eax
ba50deec e817ffffff call tcpip!UpdateIPSecRcvBuf (ba50de08)
tcpip!DeliverToUser+0xd9:
ba50def1 807e0906 cmp byte ptr [esi+9],6
ba50def5 753e jne tcpip!DeliverToUser+0x12f (ba50df35)
tcpip!DeliverToUser+0xdf:
ba50def7 8b430c mov eax,dword ptr [ebx+0Ch]
ba50defa 83b88c00000007 cmp dword ptr [eax+8Ch],7
ba50df01 0f857a510000 jne tcpip!DeliverToUser+0x124 (ba513081)
tcpip!DeliverToUser+0xeb:
ba50df07 837f1000 cmp dword ptr [edi+10h],0
ba50df0b 741a je tcpip!DeliverToUser+0x114 (ba50df27)
tcpip!DeliverToUser+0xf1:
ba50df0d 640fb61551000000 movzx edx,byte ptr fs:[51h]
ba50df15 8b8880010000 mov ecx,dword ptr [eax+180h]
ba50df1b 3bca cmp ecx,edx
ba50df1d 0f857145ffff jne tcpip!DeliverToUser+0x103 (ba502494)
tcpip!DeliverToUser+0x103:
ba502494 8b15f84154ba mov edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]
ba50249a 0fbe12 movsx edx,byte ptr [edx]
ba50249d 3bca cmp ecx,edx
ba50249f 0f8582ba0000 jne tcpip!DeliverToUser+0x114 (ba50df27)
tcpip!DeliverToUser+0x110:
ba50df23 c645f402 mov byte ptr [ebp-0Ch],2
tcpip!DeliverToUser+0x114:
ba50df27 640fb60d51000000 movzx ecx,byte ptr fs:[51h]
ba50df2f 898880010000 mov dword ptr [eax+180h],ecx
ba5024a5 e979ba0000 jmp tcpip!DeliverToUser+0x110 (ba50df23)
[Update: our answer, posted 5/2/2008]
In response: I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response. I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected. If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.
Here is the answer…
Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take. An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.
Why is this instruction telling us to go to a crazy offset? One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys. Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.
Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType. Why is our current instruction in GetAddrType instead of DeliverToUser? It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.
The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser. What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.
Why would eip point to ba50d9d3 instead of ba50d9d4? That is the challenging question in this exercise, because the answer is not obvious. Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault. However, we don’t really have any proof at this time that the hardware has actually done something wrong. Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value. We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.
To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d. I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction. On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset. The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too. This is our opcode:
0f 85 ff ff 45 71
Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction). If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.
1: kd> ?ba50df23+ffff4571
Evaluate expression: -1169152876 = ba502494
If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead. Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.
1: kd> ?ba50df23+ffff4570
Evaluate expression: -1169152877 = ba502493
We still have some unanswered questions. The most common question I get when I’ve found a bit flip is what piece of hardware is broken? Unfortunately I can’t answer this question with the data I have. In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed. Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened. It is unlikely that hardware diagnostics will find this type of failure because it is intermittent. Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem. In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.
Bit flips can be caused by any number of problems. Problems with a processor, a processor riser board, a motherboard, or RAM are common causes. As we have documented before in this blog improper voltage or overheating can also cause these types of problems. However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.
I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser. This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths. This means that blocks of the GetAddrType code are located next to the DeliverToUser code. As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.
1: kd> ub ba502494
tcpip!GetAddrType+0x9e:
ba502474 f6401401 test byte ptr [eax+14h],1
ba502478 0f84bd760000 je tcpip!GetAddrType+0xb4 (ba509b3b)
ba50247e 85db test ebx,ebx
ba502480 0f84bd6b0200 je tcpip!GetAddrType+0xa8 (ba529043)
ba502486 32c0 xor al,al
ba502488 e967770000 jmp tcpip!GetAddrType+0x15a (ba509bf4)
ba50248d 33c9 xor ecx,ecx
ba50248f e993770000 jmp tcpip!GetAddrType+0x19b (ba509c27)
1: kd> u
ba5024aa 90 nop
ba5024ab 90 nop
ba5024ac 90 nop
Congratulations to those who have read this far. Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem. Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.
----
--
Are you sure your code snippet isnt bugged?
Bad return address when KiTrap0E is called. It should be ba502494 instead of ba502493.
I suspect this may be a code corruption.
It is the code address "ba502493" == tcpip!GetAddrType+0x19f that caused the bugcheck.
However, from the "ub ba50d9d8" output, it should call tcpip!DeliverToUser function instead of "tcpip!GetAddrType".
Further searching with "ba5024" in the assembly output shows that the following instruction in tcpip!DeliverToUser will jump to one byte after the crashing address:
"jne tcpip!DeliverToUser+0x103 (ba502494)"
So, I believe this code instruction may be corrupted that the execution jumps one byte before the target "ba502493" which is the garbage instruction "add byte ptr [ebx+5441F815h],cl". It is this garbage instruction that caused the bugcheck.
Jeffrey Tan
Was this a bitflip error causing the jump to the improper address?
kd> .formats ba502493
10111010 01010000 00100100 10010011 <--last bit flipped?
should have been:
.formats ba502494
10111010 01010000 00100100 10010100
code path should have been:
But instead:
We jump to the wrong location and interpret the instruction as:
008b15f84154 <--Should have started at 8b15 instead. So we end up with the garbage below:
To add to what I posted before, bitfip errors are usually (or is it always??) hardware related. I recommend running a memory diagnostic utility (such as the one on the 2008/Vista DVD or running hardware diagnostics from the vendor.
From 0f857145ffff jne ba502494
Its a relative jump of FFFF4571 make that FFFF4570 and you'll jump back one more (negative numbers count downward) to ba502493. So you might have a memmory error at that location. Another possibility that could occur is that because you are jumping into another page the page mechanism might have modified EIP if it was programmed incorrectly, but it shouldn't ever touch EIP.
I see now that it's 3 bits differing instead of just 1 (I should slow down and look at the data more carefully).
Did an interrupt occur and then following the ISR the context was restored incorrectly?
Although the hamming distance in the resultant address is 3 bits, the instruction appears to be using relative addressing, and in the relative address only one bit needs to be flipped.
Anyway, tcpip!DeliverToUser+0x103 (ba502494) seems somewhat strange. Aren't entrypoints typically word aligned?
Oh, and the other strangeness -- there are two addresses which resolve to a name of tcpip!DeliverToUser apparently, perhaps taking error handling code out of the fast path or an overloaded name with the mangled part stripped.
I'll take a guess of why it jump to ba502493 instead of ba502494.
This line
is this:
FORCEINLINE
ULONG
NTAPI
KeGetCurrentProcessorNumber(VOID)
{
__asm { movzx eax, _PCR KPCR.Number }
}
The .trap ecx=edx=1 which explains why it did not jump on ba50df1d. It smells like SMP related issue.
Seems that the bitflip hit the instruction itself this time:
0:000> .formats ffff4571
Evaluate expression:
Hex: ffff4571
Decimal: -47759
Octal: 37777642561
Binary: 11111111 11111111 01000101 01110001
Chars: ..Eq
Time: ***** Invalid
Float: low -1.#QNAN high 0
Double: 2.12197e-314
0:000> .formats ffff4570
Hex: ffff4570
Decimal: -47760
Octal: 37777642560
Binary: 11111111 11111111 01000101 01110000
Chars: ..Ep
original instruction:
faulty instruction:
ba50df1d 0f857045ffff jne tcpip!GetAddrType+0x19f (ba502493)
Well it took me some time to figure it out because my time was being spent in other places.
My process for doing it was simple, the trapped instruction didn't make any sense, it was a 32bit processor with an ebx+offset > 4GB, doesn't seem like an array over run but a bad value loaded into ebx, so I decided to find where ebx was loaded, so I searched for the binary 8b15f841 and found it an enterly intresting place, a byte off from where it should of been (or so I thought), but then I had to do other things, when I came back I started using livekd and unassembling GetAddrType, got distracted again, when I came back I started over and searched for the instruction again, found it and seen that the way it got there was from a jump, so I looked at the jne, thought well its a relative jump, maybe an interrupt happened and modified eip, after a while of other crazy thoughts (processor error with carry flag? a page monitoring program that screwed up after hooking the paging mechanism?). Then I took a look at the comments and seen "bit flipping" comment and it dawned on my, hardware error, something I don't consider very often, to be honest and as it was very close it had to be the answer, flipped bit, eip/jne offset, well as the comment said
"
.formats ba502493
Can't be that, 3 to 4 is a multibit flip 100, 011, and one less than FFFF4571 is a single bit flip.
For those of you that could not make the live chat on 8/13, here is the transcript of the chat session....