NTDebugging Puzzler 0x00000004: This didn’t puzzle the Debug Ninja, how about you?

NTDebugging Puzzler 0x00000004: This didn’t puzzle the Debug Ninja, how about you?

  • Comments 12

Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far.  We invited the Debug Ninja to come up with a real challenge for this week.

This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL.  The challenge this week is to tell us, as specifically as possible, why it crashed.  We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.

1: kd> .bugcheck

Bugcheck code 000000D1

Arguments e074281d 00000002 00000001 ba502493

 

1: kd> kv

ChildEBP RetAddr  Args to Child

f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)

f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)

f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)

f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)

 

1: kd> .trap f78b6544

ErrCode = 00000002

eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

tcpip!GetAddrType+0x19f:

ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

 

The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzlerPost your responses to this page (and not the complete debug page), so we can keep all the comments together.  Also, we won’t post any of your responses until Friday.


[Update, posted 4/30/2008]

Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem.  When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available.  In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.

Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler.  Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge.  Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.

 

1: kd> dd @ebx+5441F815

e074281d  ???????? ???????? ???????? ????????

e074282d  ???????? ???????? ???????? ????????

e074283d  ???????? ???????? ???????? ????????

e074284d  ???????? ???????? ???????? ????????

e074285d  ???????? ???????? ???????? ????????

e074286d  ???????? ???????? ???????? ????????

e074287d  ???????? ???????? ???????? ????????

e074288d  ???????? ???????? ???????? ????????

1: kd> !pte e074281d

               VA e074281d

PDE at 00000000C0603818    PTE at 00000000C0703A10

contains 000000021B980963  contains E154FC1000000400

pfn 21b980 -G-DA--KWEV                           not valid

                       Proto: 00000000E154FC10

 

1: kd> ub ba50d9d8

tcpip!IPRcvPacket+0x658:

ba50d9c4 51              push    ecx

ba50d9c5 50              push    eax

ba50d9c6 53              push    ebx

ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

ba50d9ca ff75f8          push    dword ptr [ebp-8]

ba50d9cd ff75ec          push    dword ptr [ebp-14h]

ba50d9d0 ff7508          push    dword ptr [ebp+8]

ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

 

1: kd> uf tcpip!DeliverToUser

tcpip!DeliverToUser:

tcpip!DeliverToUser+0xc1:

ba50ded9 8b45f0          mov     eax,dword ptr [ebp-10h]

ba50dedc 29451c          sub     dword ptr [ebp+1Ch],eax

ba50dedf 8b471c          mov     eax,dword ptr [edi+1Ch]

ba50dee2 ff7508          push    dword ptr [ebp+8]

ba50dee5 2b4510          sub     eax,dword ptr [ebp+10h]

ba50dee8 57              push    edi

ba50dee9 8945fc          mov     dword ptr [ebp-4],eax

ba50deec e817ffffff      call    tcpip!UpdateIPSecRcvBuf (ba50de08)

 

tcpip!DeliverToUser+0xd9:

ba50def1 807e0906        cmp     byte ptr [esi+9],6

ba50def5 753e            jne     tcpip!DeliverToUser+0x12f (ba50df35)

 

tcpip!DeliverToUser+0xdf:

ba50def7 8b430c          mov     eax,dword ptr [ebx+0Ch]

ba50defa 83b88c00000007  cmp     dword ptr [eax+8Ch],7

ba50df01 0f857a510000    jne     tcpip!DeliverToUser+0x124 (ba513081)

 

tcpip!DeliverToUser+0xeb:

ba50df07 837f1000        cmp     dword ptr [edi+10h],0

ba50df0b 741a            je      tcpip!DeliverToUser+0x114 (ba50df27)

 

tcpip!DeliverToUser+0xf1:

ba50df0d 640fb61551000000 movzx   edx,byte ptr fs:[51h]

ba50df15 8b8880010000    mov     ecx,dword ptr [eax+180h]

ba50df1b 3bca            cmp     ecx,edx

ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

 

tcpip!DeliverToUser+0x103:

ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

ba50249a 0fbe12          movsx   edx,byte ptr [edx]

ba50249d 3bca            cmp     ecx,edx

ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

 

tcpip!DeliverToUser+0x110:

ba50df23 c645f402        mov     byte ptr [ebp-0Ch],2

 

tcpip!DeliverToUser+0x114:

ba50df27 640fb60d51000000 movzx   ecx,byte ptr fs:[51h]

ba50df2f 898880010000    mov     dword ptr [eax+180h],ecx

 

tcpip!DeliverToUser+0x114:

ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

 


 

[Update: our answer, posted 5/2/2008]

In response:  I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response.  I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected.  If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.

Here is the answer…

Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take.  An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.

1: kd> .trap f78b6544

ErrCode = 00000002

eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

tcpip!GetAddrType+0x19f:

ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

 

Why is this instruction telling us to go to a crazy offset?  One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys.  Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.

Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType.  Why is our current instruction in GetAddrType instead of DeliverToUser?  It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.

1: kd> ub ba50d9d8

tcpip!IPRcvPacket+0x658:

ba50d9c4 51              push    ecx

ba50d9c5 50              push    eax

ba50d9c6 53              push    ebx

ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

ba50d9ca ff75f8          push    dword ptr [ebp-8]

ba50d9cd ff75ec          push    dword ptr [ebp-14h]

ba50d9d0 ff7508          push    dword ptr [ebp+8]

ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

 

The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser.  What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.

ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

 

Why would eip point to ba50d9d3 instead of ba50d9d4?  That is the challenging question in this exercise, because the answer is not obvious.  Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault.  However, we don’t really have any proof at this time that the hardware has actually done something wrong.  Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value.  We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.

To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d.  I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction.  On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset.  The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too.  This is our opcode:

0f 85 ff ff 45 71

Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction).  If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.

1: kd> ?ba50df23+ffff4571

Evaluate expression: -1169152876 = ba502494

 

If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead.  Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.

1: kd> ?ba50df23+ffff4570

Evaluate expression: -1169152877 = ba502493

 

We still have some unanswered questions.  The most common question I get when I’ve found a bit flip is what piece of hardware is broken?  Unfortunately I can’t answer this question with the data I have.  In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed.  Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened.  It is unlikely that hardware diagnostics will find this type of failure because it is intermittent.  Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem.  In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.

Bit flips can be caused by any number of problems.  Problems with a processor, a processor riser board, a motherboard, or RAM are common causes.  As we have documented before in this blog improper voltage or overheating can also cause these types of problems.  However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.

I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser.  This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths.  This means that blocks of the GetAddrType code are located next to the DeliverToUser code.  As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.

1: kd> ub ba502494

tcpip!GetAddrType+0x9e:

ba502474 f6401401        test    byte ptr [eax+14h],1

ba502478 0f84bd760000    je      tcpip!GetAddrType+0xb4 (ba509b3b)

ba50247e 85db            test    ebx,ebx

ba502480 0f84bd6b0200    je      tcpip!GetAddrType+0xa8 (ba529043)

ba502486 32c0            xor     al,al

ba502488 e967770000      jmp     tcpip!GetAddrType+0x15a (ba509bf4)

ba50248d 33c9            xor     ecx,ecx

ba50248f e993770000      jmp     tcpip!GetAddrType+0x19b (ba509c27)

1: kd> u

tcpip!DeliverToUser+0x103:

ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

ba50249a 0fbe12          movsx   edx,byte ptr [edx]

ba50249d 3bca            cmp     ecx,edx

ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

ba5024aa 90              nop

ba5024ab 90              nop

ba5024ac 90              nop

 

Congratulations to those who have read this far.  Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem.  Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.





Leave a Comment
  • Please add 5 and 5 and type the answer here:
  • Post
  • tcpip!DeliverToUser+0x114:

    ba50df27 640fb60d51000000 movzx   ecx,byte ptr fs:[51h]

    ba50df2f 898880010000    mov     dword ptr [eax+180h],ecx

    ----

    tcpip!DeliverToUser+0x114:

    ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

    --

    Are you sure your code snippet isnt bugged?

  • f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)

    Bad return address when KiTrap0E is called. It should be ba502494 instead of ba502493.

    tcpip!DeliverToUser+0x103:

    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    [This was the first close response we got, and was a very good start. Is there a specific problem that could cause us to go to ba502493 instead of ba502494?]
  • I suspect this may be a code corruption.

    It is the code address "ba502493" == tcpip!GetAddrType+0x19f that caused the bugcheck.

    However, from the "ub ba50d9d8" output, it should call tcpip!DeliverToUser function instead of "tcpip!GetAddrType".

    Further searching with "ba5024" in the assembly output shows that the following instruction in tcpip!DeliverToUser will jump to one byte after the crashing address:

    "jne tcpip!DeliverToUser+0x103 (ba502494)"

    So, I believe this code instruction may be corrupted that the execution jumps one byte before the target "ba502493" which is the garbage instruction "add     byte ptr [ebx+5441F815h],cl". It is this garbage instruction that caused the bugcheck.

    Jeffrey Tan

    [This answer is very close, but there is more that can be said about this problem. How can we jump to ba502493 instead of ba502494?]
  • Was this a bitflip error causing the jump to the improper address?

    kd> .formats ba502493

    10111010 01010000 00100100 10010011 <--last bit flipped?

    should have been:

    .formats ba502494

    10111010 01010000 00100100 10010100

    code path should have been:

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

    tcpip!DeliverToUser+0x103:

    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    But instead:

    We jump to the wrong location and interpret the instruction as:

    008b15f84154  <--Should have started at 8b15 instead. So we end up with the garbage below:

    tcpip!GetAddrType+0x19f:

    ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

    [Good effort, you’re getting very close to the answer. However, this example would require that three bits be flipped not just the last bit.]
  • To add to what I posted before, bitfip errors are usually (or is it always??) hardware related. I recommend running a memory diagnostic utility (such as the one on the 2008/Vista DVD or running hardware diagnostics from the vendor.

  • From 0f857145ffff jne ba502494

    Its a relative jump of FFFF4571 make that FFFF4570 and you'll jump back one more (negative numbers count downward) to ba502493. So you might have a memmory error at that location.  Another possibility that could occur is that because you are jumping into another page the page mechanism might have modified EIP if it was programmed incorrectly, but it shouldn't ever touch EIP.

    [WE HAVE A WINNER! Congratulations, you are the first person to identify and post the root cause of this problem.]
  • I see now that it's 3 bits differing instead of just 1 (I should slow down and look at the data more carefully).

    Did an interrupt occur and then following the ISR the context was restored incorrectly?

  • Although the hamming distance in the resultant address is 3 bits, the instruction appears to be using relative addressing, and in the relative address only one bit needs to be flipped.

    Anyway, tcpip!DeliverToUser+0x103 (ba502494) seems somewhat strange.  Aren't entrypoints typically word aligned?

    Oh, and the other strangeness -- there are two addresses which resolve to a name of tcpip!DeliverToUser apparently, perhaps taking error handling code out of the fast path or an overloaded name with the mangled part stripped.

  • I'll take a guess of why it jump to ba502493 instead of ba502494.

    This line

    ba50df0d 640fb61551000000 movzx   edx,byte ptr fs:[51h]

    is this:

    FORCEINLINE

    ULONG

    NTAPI

    KeGetCurrentProcessorNumber(VOID)

    {

       __asm {  movzx eax, _PCR KPCR.Number  }

    }

    The .trap ecx=edx=1 which explains why it did not jump on ba50df1d. It smells like SMP related issue.

  • Seems that the bitflip hit the instruction itself this time:

    0:000> .formats ffff4571

    Evaluate expression:

     Hex:     ffff4571

     Decimal: -47759

     Octal:   37777642561

     Binary:  11111111 11111111 01000101 01110001

     Chars:   ..Eq

     Time:    ***** Invalid

     Float:   low -1.#QNAN high 0

     Double:  2.12197e-314

    0:000> .formats ffff4570

    Evaluate expression:

     Hex:     ffff4570

     Decimal: -47760

     Octal:   37777642560

     Binary:  11111111 11111111 01000101 01110000

     Chars:   ..Ep

     Time:    ***** Invalid

     Float:   low -1.#QNAN high 0

     Double:  2.12197e-314

    original instruction:

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

    faulty instruction:

    ba50df1d 0f857045ffff    jne     tcpip!GetAddrType+0x19f (ba502493)

  • Well it took me some time to figure it out because my time was being spent in other places.

    My process for doing it was simple, the trapped instruction didn't make any sense, it was a 32bit processor with an ebx+offset > 4GB, doesn't seem like an array over run but a bad value loaded into ebx, so I decided to find where ebx was loaded, so I searched for the binary 8b15f841 and found it an enterly intresting place, a byte off from where it should of been (or so I thought), but then I had to do other things, when I came back I started using livekd and unassembling GetAddrType, got distracted again, when I came back I started over and searched for the instruction again, found it and seen that the way it got there was from a jump, so I looked at the jne, thought well its a relative jump, maybe an interrupt happened and modified eip, after a while of other crazy thoughts (processor error with carry flag? a page monitoring program that screwed up after hooking the paging mechanism?). Then I took a look at the comments and seen "bit flipping" comment and it dawned on my, hardware error, something I don't consider very often, to be honest and as it was very close it had to be the answer, flipped bit, eip/jne offset, well as the comment said

    "

    .formats ba502493

    .formats ba502494

    "

    Can't be that, 3 to 4 is a multibit flip 100, 011, and one less than FFFF4571 is a single bit flip.

  • For those of you that could not make the live chat on 8/13, here is the transcript of the chat session....

Page 1 of 1 (12 items)