Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far.  We invited the Debug Ninja to come up with a real challenge for this week.

This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL.  The challenge this week is to tell us, as specifically as possible, why it crashed.  We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.

1: kd> .bugcheck

Bugcheck code 000000D1

Arguments e074281d 00000002 00000001 ba502493

 

1: kd> kv

ChildEBP RetAddr  Args to Child

f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)

f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)

f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)

 

1: kd> .trap f78b6544

ErrCode = 00000002

eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

tcpip!GetAddrType+0x19f:

ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

 

The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzlerPost your responses to this page (and not the complete debug page), so we can keep all the comments together.  Also, we won’t post any of your responses until Friday.


[Update, posted 4/30/2008]

Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem.  When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available.  In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.

Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler.  Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge.  Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.

 

1: kd> dd @ebx+5441F815

e074281d  ???????? ???????? ???????? ????????

e074282d  ???????? ???????? ???????? ????????

e074283d  ???????? ???????? ???????? ????????

e074284d  ???????? ???????? ???????? ????????

e074285d  ???????? ???????? ???????? ????????

e074286d  ???????? ???????? ???????? ????????

e074287d  ???????? ???????? ???????? ????????

e074288d  ???????? ???????? ???????? ????????

1: kd> !pte e074281d

               VA e074281d

PDE at 00000000C0603818    PTE at 00000000C0703A10

contains 000000021B980963  contains E154FC1000000400

pfn 21b980 -G-DA--KWEV                           not valid

                       Proto: 00000000E154FC10

 

1: kd> ub ba50d9d8

tcpip!IPRcvPacket+0x658:

ba50d9c4 51              push    ecx

ba50d9c5 50              push    eax

ba50d9c6 53              push    ebx

ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

ba50d9ca ff75f8          push    dword ptr [ebp-8]

ba50d9cd ff75ec          push    dword ptr [ebp-14h]

ba50d9d0 ff7508          push    dword ptr [ebp+8]

ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

 

1: kd> uf tcpip!DeliverToUser

tcpip!DeliverToUser:

tcpip!DeliverToUser+0xc1:

ba50ded9 8b45f0          mov     eax,dword ptr [ebp-10h]

ba50dedc 29451c          sub     dword ptr [ebp+1Ch],eax

ba50dedf 8b471c          mov     eax,dword ptr [edi+1Ch]

ba50dee2 ff7508          push    dword ptr [ebp+8]

ba50dee5 2b4510          sub     eax,dword ptr [ebp+10h]

ba50dee8 57              push    edi

ba50dee9 8945fc          mov     dword ptr [ebp-4],eax

ba50deec e817ffffff      call    tcpip!UpdateIPSecRcvBuf (ba50de08)

 

tcpip!DeliverToUser+0xd9:

ba50def1 807e0906        cmp     byte ptr [esi+9],6

ba50def5 753e            jne     tcpip!DeliverToUser+0x12f (ba50df35)

 

tcpip!DeliverToUser+0xdf:

ba50def7 8b430c          mov     eax,dword ptr [ebx+0Ch]

ba50defa 83b88c00000007  cmp     dword ptr [eax+8Ch],7

ba50df01 0f857a510000    jne     tcpip!DeliverToUser+0x124 (ba513081)

 

tcpip!DeliverToUser+0xeb:

ba50df07 837f1000        cmp     dword ptr [edi+10h],0

ba50df0b 741a            je      tcpip!DeliverToUser+0x114 (ba50df27)

 

tcpip!DeliverToUser+0xf1:

ba50df0d 640fb61551000000 movzx   edx,byte ptr fs:[51h]

ba50df15 8b8880010000    mov     ecx,dword ptr [eax+180h]

ba50df1b 3bca            cmp     ecx,edx

ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

 

tcpip!DeliverToUser+0x103:

ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

ba50249a 0fbe12          movsx   edx,byte ptr [edx]

ba50249d 3bca            cmp     ecx,edx

ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

 

tcpip!DeliverToUser+0x110:

ba50df23 c645f402        mov     byte ptr [ebp-0Ch],2

 

tcpip!DeliverToUser+0x114:

ba50df27 640fb60d51000000 movzx   ecx,byte ptr fs:[51h]

ba50df2f 898880010000    mov     dword ptr [eax+180h],ecx

 

tcpip!DeliverToUser+0x114:

ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

 


 

[Update: our answer, posted 5/2/2008]

In response:  I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response.  I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected.  If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.

Here is the answer…

Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take.  An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.

1: kd> .trap f78b6544

ErrCode = 00000002

eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

tcpip!GetAddrType+0x19f:

ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

 

Why is this instruction telling us to go to a crazy offset?  One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys.  Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.

Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType.  Why is our current instruction in GetAddrType instead of DeliverToUser?  It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.

1: kd> ub ba50d9d8

tcpip!IPRcvPacket+0x658:

ba50d9c4 51              push    ecx

ba50d9c5 50              push    eax

ba50d9c6 53              push    ebx

ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

ba50d9ca ff75f8          push    dword ptr [ebp-8]

ba50d9cd ff75ec          push    dword ptr [ebp-14h]

ba50d9d0 ff7508          push    dword ptr [ebp+8]

ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

 

The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser.  What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.

ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

 

Why would eip point to ba50d9d3 instead of ba50d9d4?  That is the challenging question in this exercise, because the answer is not obvious.  Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault.  However, we don’t really have any proof at this time that the hardware has actually done something wrong.  Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value.  We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.

To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d.  I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction.  On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset.  The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too.  This is our opcode:

0f 85 ff ff 45 71

Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction).  If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.

1: kd> ?ba50df23+ffff4571

Evaluate expression: -1169152876 = ba502494

 

If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead.  Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.

1: kd> ?ba50df23+ffff4570

Evaluate expression: -1169152877 = ba502493

 

We still have some unanswered questions.  The most common question I get when I’ve found a bit flip is what piece of hardware is broken?  Unfortunately I can’t answer this question with the data I have.  In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed.  Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened.  It is unlikely that hardware diagnostics will find this type of failure because it is intermittent.  Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem.  In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.

Bit flips can be caused by any number of problems.  Problems with a processor, a processor riser board, a motherboard, or RAM are common causes.  As we have documented before in this blog improper voltage or overheating can also cause these types of problems.  However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.

I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser.  This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths.  This means that blocks of the GetAddrType code are located next to the DeliverToUser code.  As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.

1: kd> ub ba502494

tcpip!GetAddrType+0x9e:

ba502474 f6401401        test    byte ptr [eax+14h],1

ba502478 0f84bd760000    je      tcpip!GetAddrType+0xb4 (ba509b3b)

ba50247e 85db            test    ebx,ebx

ba502480 0f84bd6b0200    je      tcpip!GetAddrType+0xa8 (ba529043)

ba502486 32c0            xor     al,al

ba502488 e967770000      jmp     tcpip!GetAddrType+0x15a (ba509bf4)

ba50248d 33c9            xor     ecx,ecx

ba50248f e993770000      jmp     tcpip!GetAddrType+0x19b (ba509c27)

1: kd> u

tcpip!DeliverToUser+0x103:

ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

ba50249a 0fbe12          movsx   edx,byte ptr [edx]

ba50249d 3bca            cmp     ecx,edx

ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

ba5024aa 90              nop

ba5024ab 90              nop

ba5024ac 90              nop

 

Congratulations to those who have read this far.  Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem.  Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.