• Ntdebugging Blog

    Announcement: ODbgExt (Open Debugger Extension) on CodePlex

    • 3 Comments

     

    Hello NTDebuggers, I’d like to announce something new for our community to share.  We have decided to host an Open Source Debugger Extension project called ODbgExt on codeplex.com   Right now it’s just the basic framework.  This will be something we can work on together as a community.  Think of it as a debugger extension by the people for the people.  Jason Epperly and I (Jeff Dailey) will be the project managers for the codeplex environment.   You can find the project on http://www.codeplex.com/ODbgExt

     

     

    Project Description


    ODbgExt is an open source debugger extension for WinDbg that is intended to be developed by the debugging community. It is based on the Windows Debugger SDK Sample ‘exts’ and uses the COM Debugger interface IDebugControl. The primary objective of the project is to provide the debugging community a central location to share their debugger extensions that help to isolate common problems in the community and make debugging both live systems and dumps easier. The initial release will be the basic framework that we expect other developers to contribute to. There will be very basic functionality in the initial version. We, GES (Global Escalation Services) intend on doing a series of blogs to talk about writing debugger extensions over the coming year. As we blog about writing extension we will include the code in ODbgExt (This project). This is the same group that runs the
    http://blogs.msdn.com/ntdebugging blog. We encourage you to sign up and contribute your debugger extension ideas and or code to the project. Simply create a codeplex account and request access.


    Things we would like to include:


    • A Graphical representation of Kernel, and User mode execution time by process and thread, while correlating to idle / non-idle time
    • A Graphical representation of idle time for kernel and user mode.
    • A Graphical representation of pool resources used on a per process bases along with handle table counts.
    • Hang detection, scanning for various conditions that could cause hangs in a system or user mode process.
    • Detailed system information such as oldest and newest binaries.
    • Binary info based on vendor name
    • A better dissembler that colorizes calls, jumps and indents to show code flow
    • Extensive use of DML (Debugger Mark-up Language) to enable more point and click debugging within windbg.
    • Support for a SQL Database backend to allow storing information about debugging sessions or binaries in a SQL Database.
    • Warnings when critical thresholds are exceeded such as handle counts over 10,000, Low PTE Conditions, etc.
    • Simplified searching for pool tags in binaries.
    • Support for VBA for Windbg
    • Dump annotation, via dump streaming; the ability to embed data into a dump via the debugger extension and later retrieve it. (Imagine embedded debug notes)
    • Embedding a snapshot of performance data in the dump at the time the dump is taken, ie. CPU, IO etc.

    We hope you are as excited about this project as we are!  Please feel free to suggest more ideas for this project and by all means feel free to sign up and contribute some code!

     

    Thank you, 

     

    Jeff Dailey

    Platforms Global Escalation Services

  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000004: This didn’t puzzle the Debug Ninja, how about you?

    • 12 Comments

    Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlers so far.  We invited the Debug Ninja to come up with a real challenge for this week.

    This server blue screened with a Bug Check 0xD1: DRIVER_IRQL_NOT_LESS_OR_EQUAL.  The challenge this week is to tell us, as specifically as possible, why it crashed.  We’ll give you a hint, the complete answer is not that ebx+5441F815h points to an invalid address.

    1: kd> .bugcheck

    Bugcheck code 000000D1

    Arguments e074281d 00000002 00000001 ba502493

     

    1: kd> kv

    ChildEBP RetAddr  Args to Child

    f78b6544 ba502493 badb0d00 00000001 00000000 nt!_KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78b6544)

    f78b65e0 ba50d9d8 00000020 8c32cab8 00000022 tcpip!GetAddrType+0x19f (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6694 ba50dc56 8c32cab8 8ca71c2c 000005c8 tcpip!IPRcvPacket+0x66c (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b66d4 ba50dd58 00000000 8ca63440 8ca71c0a tcpip!ARPRcvIndicationNew+0x149 (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6710 bada5550 8c4b53b8 00000000 f78b678c tcpip!ARPRcvPacket+0x68 (FPO: [Non-Fpo]) (CONV: stdcall)

    f78b6764 ba9c614b 8cac2ad0 f78b6784 00000002 NDIS!ethFilterDprIndicateReceivePacket+0x1d2 (FPO: [Non-Fpo]) (CONV: stdcall)

     

    1: kd> .trap f78b6544

    ErrCode = 00000002

    eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

    eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

    tcpip!GetAddrType+0x19f:

    ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

     

    The remainder of this puzzler is quite long, click here to see the complete debug notes required to solve this puzzlerPost your responses to this page (and not the complete debug page), so we can keep all the comments together.  Also, we won’t post any of your responses until Friday.


    [Update, posted 4/30/2008]

    Today is Wednesday and so far we haven’t received any complete responses; the partial answers we have received are close but no one has determined the root cause of the problem.  When looking at problems like this it is often useful to ask “why?” when you think you have the answer and think about what other information may be available.  In this situation it is possible to identify a specific problem and even determine appropriate action to fix the problem.

    Perhaps we were a bit overzealous in posting 14 pages of assembly as a puzzler.  Below is an abbreviated debug with less assembly, the link to the complete debug is still available for those of you who want to try the full challenge.  Also, we are going to break our promise not to post responses until Friday and we have posted some responses that we have gotten so far, neither of these responses completely identifies the problem but they are getting close.

     

    1: kd> dd @ebx+5441F815

    e074281d  ???????? ???????? ???????? ????????

    e074282d  ???????? ???????? ???????? ????????

    e074283d  ???????? ???????? ???????? ????????

    e074284d  ???????? ???????? ???????? ????????

    e074285d  ???????? ???????? ???????? ????????

    e074286d  ???????? ???????? ???????? ????????

    e074287d  ???????? ???????? ???????? ????????

    e074288d  ???????? ???????? ???????? ????????

    1: kd> !pte e074281d

                   VA e074281d

    PDE at 00000000C0603818    PTE at 00000000C0703A10

    contains 000000021B980963  contains E154FC1000000400

    pfn 21b980 -G-DA--KWEV                           not valid

                           Proto: 00000000E154FC10

     

    1: kd> ub ba50d9d8

    tcpip!IPRcvPacket+0x658:

    ba50d9c4 51              push    ecx

    ba50d9c5 50              push    eax

    ba50d9c6 53              push    ebx

    ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

    ba50d9ca ff75f8          push    dword ptr [ebp-8]

    ba50d9cd ff75ec          push    dword ptr [ebp-14h]

    ba50d9d0 ff7508          push    dword ptr [ebp+8]

    ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

     

    1: kd> uf tcpip!DeliverToUser

    tcpip!DeliverToUser:

    tcpip!DeliverToUser+0xc1:

    ba50ded9 8b45f0          mov     eax,dword ptr [ebp-10h]

    ba50dedc 29451c          sub     dword ptr [ebp+1Ch],eax

    ba50dedf 8b471c          mov     eax,dword ptr [edi+1Ch]

    ba50dee2 ff7508          push    dword ptr [ebp+8]

    ba50dee5 2b4510          sub     eax,dword ptr [ebp+10h]

    ba50dee8 57              push    edi

    ba50dee9 8945fc          mov     dword ptr [ebp-4],eax

    ba50deec e817ffffff      call    tcpip!UpdateIPSecRcvBuf (ba50de08)

     

    tcpip!DeliverToUser+0xd9:

    ba50def1 807e0906        cmp     byte ptr [esi+9],6

    ba50def5 753e            jne     tcpip!DeliverToUser+0x12f (ba50df35)

     

    tcpip!DeliverToUser+0xdf:

    ba50def7 8b430c          mov     eax,dword ptr [ebx+0Ch]

    ba50defa 83b88c00000007  cmp     dword ptr [eax+8Ch],7

    ba50df01 0f857a510000    jne     tcpip!DeliverToUser+0x124 (ba513081)

     

    tcpip!DeliverToUser+0xeb:

    ba50df07 837f1000        cmp     dword ptr [edi+10h],0

    ba50df0b 741a            je      tcpip!DeliverToUser+0x114 (ba50df27)

     

    tcpip!DeliverToUser+0xf1:

    ba50df0d 640fb61551000000 movzx   edx,byte ptr fs:[51h]

    ba50df15 8b8880010000    mov     ecx,dword ptr [eax+180h]

    ba50df1b 3bca            cmp     ecx,edx

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

     

    tcpip!DeliverToUser+0x103:

    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    ba50249a 0fbe12          movsx   edx,byte ptr [edx]

    ba50249d 3bca            cmp     ecx,edx

    ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

     

    tcpip!DeliverToUser+0x110:

    ba50df23 c645f402        mov     byte ptr [ebp-0Ch],2

     

    tcpip!DeliverToUser+0x114:

    ba50df27 640fb60d51000000 movzx   ecx,byte ptr fs:[51h]

    ba50df2f 898880010000    mov     dword ptr [eax+180h],ecx

     

    tcpip!DeliverToUser+0x114:

    ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

     


     

    [Update: our answer, posted 5/2/2008]

    In response:  I would like to congratulate the people who figured this out and say thank you to all of the people who gave this a try and sent in a response.  I’m not sure if we made a challenge that was too easy, too difficult, or too long, but we did not get as many responses as we had expected.  If you looked at this and chose not to post a comment, or you have looked at our previous puzzlers but chose to skip this one, let us know why so we can post better puzzlers in the future.

    Here is the answer…

    Experienced assembly readers will recognize that this instruction is a valid instruction, but it takes action that no assembly writer would ever take.  An offset of 5441F815 is too big to be realistic; no structure is 5441F815 bytes long.

    1: kd> .trap f78b6544

    ErrCode = 00000002

    eax=8c32cab8 ebx=8c323008 ecx=00000001 edx=00000001 esi=8ca71c18 edi=f78b6618

    eip=ba502493 esp=f78b65b8 ebp=f78b65e0 iopl=0         nv up ei pl zr na pe nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00250246

    tcpip!GetAddrType+0x19f:

    ba502493 008b15f84154    add     byte ptr [ebx+5441F815h],cl ds:0023:e074281d=??

     

    Why is this instruction telling us to go to a crazy offset?  One possible cause is that the instruction is corrupt; a good debug method for this problem would be to unassemble the code around this instruction to see if the adjacent instructions look valid, and then to compare these instructions to ones in a known good version of tcpip.sys.  Because this instruction was not corrupt neither of these steps helped, so I started looking at how we got to this instruction.

    Looking at the prior function on the stack, IPRcvPacket, I see that we actually called DeliverToUser and not GetAddrType.  Why is our current instruction in GetAddrType instead of DeliverToUser?  It is possible that the call went to the wrong place, that DeliverToUser called something else which returned to the wrong place, or that somehow the flow of execution did not go as intended.

    1: kd> ub ba50d9d8

    tcpip!IPRcvPacket+0x658:

    ba50d9c4 51              push    ecx

    ba50d9c5 50              push    eax

    ba50d9c6 53              push    ebx

    ba50d9c7 ff75f4          push    dword ptr [ebp-0Ch]

    ba50d9ca ff75f8          push    dword ptr [ebp-8]

    ba50d9cd ff75ec          push    dword ptr [ebp-14h]

    ba50d9d0 ff7508          push    dword ptr [ebp+8]

    ba50d9d3 e84a040000      call    tcpip!DeliverToUser (ba50de22)

     

    The next step I took was to analyze the DeliverToUser assembly to see how we might have ended up in GetAddrType instead of DeliverToUser.  What I found, and what several of our readers found, was that DeliverToUser contained a jump that went to 1 byte away from where we encountered the bogus instruction.

    ba50df1d 0f857145ffff    jne     tcpip!DeliverToUser+0x103 (ba502494)

     

    Why would eip point to ba50d9d3 instead of ba50d9d4?  That is the challenging question in this exercise, because the answer is not obvious.  Some engineers will conclude that the hardware must not have done what we asked it to do, and say that the hardware must be at fault.  However, we don’t really have any proof at this time that the hardware has actually done something wrong.  Hardware problems usually manifest themselves as a bit flip, where one bit is changed from the correct value.  We have gone from a 4 (0100) to a 3 (0011); a 3 bit change is hard to blame on hardware.

    To answer the question “Why would eip point to ba50d9d3 instead of ba50d9d4?” we need to carefully analyze the instruction at ba50df1d.  I broke out my trusty instruction set reference and looked up the inner workings of the jne instruction.  On page 3-542 of the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2A: Instruction Set Reference, A-M I found the layout for a jne to a 32 bit offset.  The first two bytes are 0F 85, and the next 8 bytes are the offset we are going to jump too.  This is our opcode:

    0f 85 ff ff 45 71

    Remember that by the time we make this jump the instruction pointer will have been moved to the next instruction, that will be at ba50df23 (ba50df1d plus the 6 bytes that are in this instruction).  If we take this offset (ff ff 45 71) and add it to the next instruction we see that we’ll end up at ba502494 as expected.

    1: kd> ?ba50df23+ffff4571

    Evaluate expression: -1169152876 = ba502494

     

    If we change just one bit in this opcode, if we change 71 (0111 0001) to be 70 (0111 0000), we end up at ba502493 instead.  Now we can conclude that the hardware must have made a mistake, and now that we know what bit was flipped we can confidently say that we have identified the mistake that was made.

    1: kd> ?ba50df23+ffff4570

    Evaluate expression: -1169152877 = ba502493

     

    We still have some unanswered questions.  The most common question I get when I’ve found a bit flip is what piece of hardware is broken?  Unfortunately I can’t answer this question with the data I have.  In this case the jne instruction was good when we read it to put it in the dump but obviously it was bad when it was executed.  Thinking about the parts needed to get the instruction from RAM and execute it I could speculate that a processor, riser board, or motherboard could be at fault but I don’t have any data that proves where the problem happened.  It is unlikely that hardware diagnostics will find this type of failure because it is intermittent.  Unless the problem manifests itself while the diagnostics are running, they will likely report that there is no problem.  In the case of this server I wasn’t able to find out specifically what failed, uptime was more important than troubleshooting so a whole new server was put in place of this one and the problem has not been seen since.

    Bit flips can be caused by any number of problems.  Problems with a processor, a processor riser board, a motherboard, or RAM are common causes.  As we have documented before in this blog improper voltage or overheating can also cause these types of problems.  However the potential list of bit-flip causes is nearly endless; power supplies, power cables, incoming power fluctuations, radiation from nearby electronics, even sun spots, can all cause bit flips.

    I should also explain why the stack showed that we were in GetAddrType instead of DeliverToUser.  This is because tcpip.sys has been optimized with Basic Block Tools (BBT) optimization, blocks of code from the same function are located in different pages to reduce the cache hits in common code paths.  This means that blocks of the GetAddrType code are located next to the DeliverToUser code.  As demonstrated below there is a block of GetAddrType code that ends with an unconditional jmp at ba50248f and there is a block of DeliverToUser code that starts at ba502494.

    1: kd> ub ba502494

    tcpip!GetAddrType+0x9e:

    ba502474 f6401401        test    byte ptr [eax+14h],1

    ba502478 0f84bd760000    je      tcpip!GetAddrType+0xb4 (ba509b3b)

    ba50247e 85db            test    ebx,ebx

    ba502480 0f84bd6b0200    je      tcpip!GetAddrType+0xa8 (ba529043)

    ba502486 32c0            xor     al,al

    ba502488 e967770000      jmp     tcpip!GetAddrType+0x15a (ba509bf4)

    ba50248d 33c9            xor     ecx,ecx

    ba50248f e993770000      jmp     tcpip!GetAddrType+0x19b (ba509c27)

    1: kd> u

    tcpip!DeliverToUser+0x103:

    ba502494 8b15f84154ba    mov     edx,dword ptr [tcpip!_imp__KeNumberProcessors (ba5441f8)]

    ba50249a 0fbe12          movsx   edx,byte ptr [edx]

    ba50249d 3bca            cmp     ecx,edx

    ba50249f 0f8582ba0000    jne     tcpip!DeliverToUser+0x114 (ba50df27)

    ba5024a5 e979ba0000      jmp     tcpip!DeliverToUser+0x110 (ba50df23)

    ba5024aa 90              nop

    ba5024ab 90              nop

    ba5024ac 90              nop

     

    Congratulations to those who have read this far.  Hopefully you now understand the cause of a problem that has stumped many engineers, and you have some new tools in your toolbox when you’re investigating a tricky problem.  Remember to keep asking “why?” and not to assume anything; those are the approaches I used to find the source of this problem.





  • Ntdebugging Blog

    The Debug Ninja speaks: Debugging a stop 0x20

    • 3 Comments

    Hello, I am the Debug Ninja.  Recently Jeff approached me about contributing to this debugging blog, and as the Debug Ninja I felt an obligation to share at least a small amount of Ninja knowledge with the world.  Today I will start by explaining how to debug stop 20 blue screens.  Unlike typical blue screens where debugging starts with stack analysis, a stop 20 requires a different approach.

     

    Now you are probably wondering, “Great Debug Ninja, what is a stop 20 blue screen?”  A stop 20’s literal translation is KERNEL_APC_PENDING_DURING_EXIT.  In common language that means that we attempted to terminate a thread while Asynchronous Procedure Calls were disabled for this thread.  The operating system forces a bugcheck under these conditions because if APCs are disabled at thread termination it means a driver has a bug that disabled APCs more times than it enabled them.  Usually these bugs result in difficult to debug crashes or hangs later, so we stop the system at thread termination to make debugging easier.

     

    Perhaps you are now asking “How might a driver disable APC’s more times than it enables them?”  Good question Grasshopper.  As described in the WDK, a driver can disable APCs by entering a critical region, a guarded region, or by raising the IRQL to APC_LEVEL or higher.  However, not all of those methods will result in a stop 20 bugcheck.  Only calls that change the APC disable count in the KTHREAD structure can result in a stop 20.  The APIs KeEnterCriticalRegion, KeWaitForSingleObject, KeWaitForMultipleObjects, KeWaitForMutexObject, or FsRtlEnterFileSystem will decrement the APC disable count.  A driver should then call KeLeaveCriticalRegion, KeReleaseMutex, or FsRtlExitFileSystem to re-enable APCs; these calls increment the APC disable count in the KTHREAD structure.

     

    As you review the APIs mentioned above you will see that there are several ways for a driver writer to get into a situation where APCs are disabled and not re-enabled.  Many of the ways we get into this situation are difficult to debug and require instrumentation that is beyond the scope of this blog.  In this blog we are going to focus on the most common cause a stop 20 blue screen, an orphaned ERESOURCE.  A brief review of the WDK documentation for ExAcquireResourceExclusiveLite and ExAcquireResourceSharedLite will reveal that before you can acquire an ERESOURCE you must first disable normal kernel APC delivery by calling KeEnterCriticalRegion.  This means that if you orphan an ERESOURCE you will leave the APC disable count decremented, and when the thread is terminated the system will bugcheck.

     

    Now you certainly want to ask “Kind Ninja, will you show me how to debug such a problem?”  Absolutely Grasshopper!

     

    We start by opening the dump and checking the cause of the crash.

     

    1: kd> .bugcheck

    Bugcheck code 00000020

    Arguments 00000000 0000fffc 00000000 00000001

     

    Next we check what thread was being terminated; we can see this in the call stack as the first parameter to PspTerminateThreadByPointer.

     

    1: kd> kb

    ChildEBP RetAddr  Args to Child

    b5e57c80 8094c546 00000020 00000000 0000fffc nt!KeBugCheckEx+0x1b

    b5e57d18 8094c63f 00000000 00000000 8bf99330 nt!PspExitThread+0x64c

    b5e57d30 8094c991 8bf99330 00000000 00000001 nt!PspTerminateThreadByPointer+0x4b

    b5e57d54 8088978c 00000000 00000000 05c2ffb8 nt!NtTerminateThread+0x71

    b5e57d54 7c8285ec 00000000 00000000 05c2ffb8 nt!KiFastCallEntry+0xfc

     

    Finally we can look at the list of ERESOURCE structures with !locks to see if our thread owns any of these locks.

     

    1: kd> !locks

    **** DUMP OF ALL RESOURCE OBJECTS ****

    KD: Scanning for held locks....

     

    Resource @ Ninja!NinjaLock (0x808a48c0)    Shared 2 owning threads

        Contention Count = 35

         Threads: 8bf99330-02<*> 8c1d19f0-01<*>

     

    !locks shows us that the thread in question is a shared owner of the Ninja driver’s NinjaLock.  The author of the Ninja driver needs to look at how their driver uses this ERESOURCE and determine why the lock was orphaned, unfortunately that means I need to do more work.  To find the bug that caused this problem I reviewed the code that uses NinjaLock.  That code was acquiring the NinjaLock inside of a try-except block.  I forgot to release the lock in the exception handler, resulting in the orphaned lock that we see here.  I guess that’s why I’m the Debug Ninja, and not the Code Writing Ninja.

  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000003 (Matrix Edition) Some assembly required.

    • 37 Comments


    Hello NTdebuggers, I'm very impressed with the depth of the answers we are seeing from our readers.  As I stated in last week's response, this week's puzzler is going to be harder.  With that said let's take it up a notch.  One of the things that is really cool about be an Escalation Engineer in GES/CPR is how far we go in the pursuit of solving complex problems.  If we're debugging some Microsoft code in a kernel dump or user mode, and our quest takes us into a binary that we don't have code or symbols for, we don't stop, we forge on!  Over the years there are members of our team that have had to port to or support Alpha, PowerPC, MIPs, IA64 and x64, myself included.  As a result most of us have books for just about every mainstream processor under the sun.  It's a good idea if you're going to be debugging on these platforms to have general working knowledge the CPUs you will encounter.  The most common CPU's we deal with are x86, followed by x64 and IA64.  Microsoft doesn't support PPC, MIPS or Alpha anymore unless you're dealing with Xbox consoles, and those are PPC.  That said, this week's challenge is to tell us what the following assembly does.  You can tell us in C, or break it down and comment on the various sections. 


    Some people like cross word puzzles, Most of us in GES/CPR love to esreveR reenignE assembler. Have FUN!

     

    “I don’t even see the code anymore”  Cypher...
    
    0:000> uf myfun
    puzzler3!myfun [c:\source\puzzler\puzzler3\puzzler3\puzzler3.cpp @ 20]:
       20 00cc1480 55              push    ebp
       20 00cc1481 8bec            mov     ebp,esp
       20 00cc1483 81ecf0000000    sub     esp,0F0h
       20 00cc1489 53              push    ebx
       20 00cc148a 56              push    esi
       20 00cc148b 57              push    edi
       20 00cc148c 8dbd10ffffff    lea     edi,[ebp-0F0h]
       20 00cc1492 b93c000000      mov     ecx,3Ch
       20 00cc1497 b8cccccccc      mov     eax,0CCCCCCCCh
       20 00cc149c f3ab            rep stos dword ptr es:[edi]
       26 00cc149e 8b4508          mov     eax,dword ptr [ebp+8]
       26 00cc14a1 50              push    eax
       26 00cc14a2 e803fcffff      call    puzzler3!ILT+165(_strlen) (00cc10aa)
       26 00cc14a7 83c404          add     esp,4
       26 00cc14aa 8945e0          mov     dword ptr [ebp-20h],eax
       28 00cc14ad 8b45e0          mov     eax,dword ptr [ebp-20h]
       28 00cc14b0 8945f8          mov     dword ptr [ebp-8],eax
       28 00cc14b3 eb09            jmp     puzzler3!myfun+0x3e (00cc14be)
       28 00cc14b5 8b45f8          mov     eax,dword ptr [ebp-8]
       28 00cc14b8 83e801          sub     eax,1
       28 00cc14bb 8945f8          mov     dword ptr [ebp-8],eax
       28 00cc14be 837df800        cmp     dword ptr [ebp-8],0
       28 00cc14c2 7e60            jle     puzzler3!myfun+0xa4 (00cc1524)
       30 00cc14c4 c745ec00000000  mov     dword ptr [ebp-14h],0
       30 00cc14cb eb09            jmp     puzzler3!myfun+0x56 (00cc14d6)
       30 00cc14cd 8b45ec          mov     eax,dword ptr [ebp-14h]
       30 00cc14d0 83c001          add     eax,1
       30 00cc14d3 8945ec          mov     dword ptr [ebp-14h],eax
       30 00cc14d6 8b45f8          mov     eax,dword ptr [ebp-8]
       30 00cc14d9 83e801          sub     eax,1
       30 00cc14dc 3945ec          cmp     dword ptr [ebp-14h],eax
       30 00cc14df 7d41            jge     puzzler3!myfun+0xa2 (00cc1522)
       32 00cc14e1 8b4508          mov     eax,dword ptr [ebp+8]
       32 00cc14e4 0345ec          add     eax,dword ptr [ebp-14h]
       32 00cc14e7 0fbe08          movsx   ecx,byte ptr [eax]
       32 00cc14ea 8b5508          mov     edx,dword ptr [ebp+8]
       32 00cc14ed 0355ec          add     edx,dword ptr [ebp-14h]
       32 00cc14f0 0fbe4201        movsx   eax,byte ptr [edx+1]
       32 00cc14f4 3bc8            cmp     ecx,eax
       32 00cc14f6 7e28            jle     puzzler3!myfun+0xa0 (00cc1520)
       34 00cc14f8 8b4508          mov     eax,dword ptr [ebp+8]
       34 00cc14fb 0345ec          add     eax,dword ptr [ebp-14h]
       34 00cc14fe 8a08            mov     cl,byte ptr [eax]
       34 00cc1500 884dd7          mov     byte ptr [ebp-29h],cl
       35 00cc1503 8b4508          mov     eax,dword ptr [ebp+8]
       35 00cc1506 0345ec          add     eax,dword ptr [ebp-14h]
       35 00cc1509 8b4d08          mov     ecx,dword ptr [ebp+8]
       35 00cc150c 034dec          add     ecx,dword ptr [ebp-14h]
       35 00cc150f 8a5101          mov     dl,byte ptr [ecx+1]
       35 00cc1512 8810            mov     byte ptr [eax],dl
       36 00cc1514 8b4508          mov     eax,dword ptr [ebp+8]
       36 00cc1517 0345ec          add     eax,dword ptr [ebp-14h]
       36 00cc151a 8a4dd7          mov     cl,byte ptr [ebp-29h]
       36 00cc151d 884801          mov     byte ptr [eax+1],cl
       38 00cc1520 ebab            jmp     puzzler3!myfun+0x4d (00cc14cd)
       40 00cc1522 eb91            jmp     puzzler3!myfun+0x35 (00cc14b5)
       41 00cc1524 5f              pop     edi
       41 00cc1525 5e              pop     esi
       41 00cc1526 5b              pop     ebx
       41 00cc1527 81c4f0000000    add     esp,0F0h
       41 00cc152d 3bec            cmp     ebp,esp
       41 00cc152f e820fcffff      call    puzzler3!ILT+335(__RTC_CheckEsp) (00cc1154)
       41 00cc1534 8be5            mov     esp,ebp
       41 00cc1536 5d              pop     ebp
       41 00cc1537 c3              ret
    
    

    Good luck, and happy debugging.

    Jeff Dailey-




    In response:  Wow, you folks did it again. I was worried that not many of our readers would respond.  Our entire team was very impressed with the number and quality of the responses we saw.  Congratulations goes out to all those assembler gurus out there that figured out this was a simple bubble sort.  We enjoyed seeing how various people went about solving this.  Some people compiled their code as they worked on reversing the function to verify the assembler.  This is a good approach.  Others just seemed to work it out end to end.  This is the approach I usually end up using because I’m typically in the middle of a debug and don’t actually need the source.

     

    Great work!

     

    Here is the answer….

     

     

    void myfun(char *val)

    {

    00321480  push        ebp 

    00321481  mov         ebp,esp

    00321483  sub         esp,0F0h

    00321489  push        ebx 

    0032148A  push        esi 

    0032148B  push        edi 

    0032148C  lea         edi,[ebp-0F0h]

    00321492  mov         ecx,3Ch

    00321497  mov         eax,0CCCCCCCCh

    0032149C  rep stos    dword ptr es:[edi]

           int i;

           int j;

           int len;

           char t;

     

           len=strlen(val);

    0032149E  mov         eax,dword ptr [val]

    003214A1  push        eax 

    003214A2  call        @ILT+165(_strlen) (3210AAh)

    003214A7  add         esp,4

    003214AA  mov         dword ptr [len],eax

     

           for (i=len;i>0;i--)

    003214AD  mov         eax,dword ptr [len]

    003214B0  mov         dword ptr [i],eax

    003214B3  jmp         myfun+3Eh (3214BEh)

    003214B5  mov         eax,dword ptr [i]

    003214B8  sub         eax,1

    003214BB  mov         dword ptr [i],eax

    003214BE  cmp         dword ptr [i],0

    003214C2  jle         myfun+0A4h (321524h)

           {

                  for(j=0;j<i-1;j++)

    003214C4  mov         dword ptr [j],0

    003214CB  jmp         myfun+56h (3214D6h)

    003214CD  mov         eax,dword ptr [j]

    003214D0  add         eax,1

    003214D3  mov         dword ptr [j],eax

    003214D6  mov         eax,dword ptr [i]

    003214D9  sub         eax,1

    003214DC  cmp         dword ptr [j],eax

    003214DF  jge         myfun+0A2h (321522h)

                  {

                         if (val[j]>val[j+1])

    003214E1  mov         eax,dword ptr [val]

    003214E4  add         eax,dword ptr [j]

    003214E7  movsx       ecx,byte ptr [eax]

    003214EA  mov         edx,dword ptr [val]

    003214ED  add         edx,dword ptr [j]

    003214F0  movsx       eax,byte ptr [edx+1]

    003214F4  cmp         ecx,eax

    003214F6  jle         myfun+0A0h (321520h)

                         {

                               t=val[j];

    003214F8  mov         eax,dword ptr [val]

    003214FB  add         eax,dword ptr [j]

    003214FE  mov         cl,byte ptr [eax]

    00321500  mov         byte ptr [t],cl

                               val[j]=val[j+1];

    00321503  mov         eax,dword ptr [val]

    00321506  add         eax,dword ptr [j]

    00321509  mov         ecx,dword ptr [val]

    0032150C  add         ecx,dword ptr [j]

    0032150F  mov         dl,byte ptr [ecx+1]

    00321512  mov         byte ptr [eax],dl

                               val[j+1]=t;

    00321514  mov         eax,dword ptr [val]

    00321517  add         eax,dword ptr [j]

    0032151A  mov         cl,byte ptr [t]

    0032151D  mov         byte ptr [eax+1],cl

                         }

                  }

    00321520  jmp         myfun+4Dh (3214CDh)

          

           }

    00321522  jmp         myfun+35h (3214B5h)

    }

    00321524  pop         edi 

    00321525  pop         esi 

    00321526  pop         ebx 

    00321527  add         esp,0F0h

    0032152D  cmp         ebp,esp

    0032152F  call        @ILT+335(__RTC_CheckEsp) (321154h)

    00321534  mov         esp,ebp

    00321536  pop         ebp 

    00321537  ret         

     

    Thank You

    Jeff Dailey-

  • Ntdebugging Blog

    More dump forensics, understanding !locks, in this case a filter driver problem

    • 3 Comments

     

    Written by Jeff Dailey: 

     

    Hello NTDebuggers, one of the most important things to understand in kernel debugging hung servers is the output of !locks.  There can be a lot of data and it’s not always clear what is going on.  One of the things I like to do in order to better understand the output is to use a visual representation of the resources involved and the threads that are blocking on those resources.   Before we can do that we need to understand what to look for so we can document it in our diagram. 

     

    It’s a good idea to understand ERESOURCEs in general l before jumping into !locks.  The following MSDN article goes into lots of great detail.  http://msdn2.microsoft.com/en-us/library/aa490224.aspx

     

    Simply put, you will typically see threads either with access to or trying to gain access to resources.   If a thread has access to a resource it will be marked by <*>.   Threads that have access to a resource can block other threads from gaining access to said resource.

     

    You will see threads waiting for shared access.  These threads do not have the <*> and listed above the threads that are Waiting on Exclusive Access.

     

    You will also see threads that are Waiting on Exclusive Access.  These threads are typically blocked waiting for the threads that have access or ownership of the resource to release it.

     

    Let’s take a look at one section of !locks output and annotate each thread section...

     

    Resource @ 0x896d2a68    Shared 1 owning threads  << This info is the ERESOURCE in question.      Contention Count = 15292  << The amount of contention for the object.

        NumberOfSharedWaiters = 1  << This is self explanatory

        NumberOfExclusiveWaiters = 39 << Number of exclusive waiters in the Ex Waiter List

         Threads: 89bd1234-01<*> 896d2020-01   << We have two threads here.  The owner, or shared owner <*>89bd1234 and the shared Waiter 896d2020

         Threads Waiting On Exclusive Access:

                  888ed020       87c036f8       885dc7a0       8bc538b0  << All of these threads are waiting on exclusive access.    

                  88e8cda0       88796988       8905fda0       8974dc10      

     

     

    Note the following output is completely fabricated, so alignment and variable names may not be valid.

     

    The following is some sample output from !locks.  In this scenario I document any ERESOURCE that has any threads waiting on exclusive access.  I also document the ERESOURCES as nodes and show the relationship to the Threads.  The key point is to show the threads involved, the resources they own, and the resources they are blocked on or trying to get exclusive access to.  Ultimately you need to work your way toward the head of the blocking chain of events to figure out what is holding up the entire chain of execution from moving forward.

     

    In this case you will see that a filter driver called MYFILTER has passed an invalid object to KeWaitForSingleObject.  As a result the thread blocked and all the other threads and processes related to those threads froze and could not move forward.  The machine was completely hung.

     

     

    1: kd> !locks

    **** DUMP OF ALL RESOURCE OBJECTS ****

    KD: Scanning for held locks......

     

    Resource @ 0x8a50ee98    Shared 4 owning threads

         Threads: 896856d0-01<*> 89686778-01<*> 896862d0-01<*> 89685da0-01<*>

    KD: Scanning for held locks............................................................

     

    Resource @ 0x896dabcd    Exclusively owned

         Threads: 886e5678-01<*>

    KD: Scanning for held locks..

     

    Resource @ 0x896d2a68    Shared 1 owning threads

        Contention Count = 15292

        NumberOfSharedWaiters = 1

        NumberOfExclusiveWaiters = 39

         Threads: 89bd1234-01<*> 896d2020-01   

         Threads Waiting On Exclusive Access:

                  888ed020       87c036f8       885dc7a0       8bc538b0       

                  88e8cda0       88796988       8905fda0       8974dc10      

                  88d78020       87a7dda0       88b85b20       87b78020      

                  8936e8a0       87dd7ae8       886005a0       88557890      

                  887b3680       87cc2790       87dd4050       87fad8a0      

                  88179580       87b53d70       87cd2775       88ba0578      

                  87b676f8       8886b560       87f68388       89681da0      

                  88952720       888833c0      

     

    KD: Scanning for held locks................

     

    Resource @ 0x8959c790    Exclusively owned

        Contention Count = 4827

        NumberOfExclusiveWaiters = 35

         Threads: 89bd1234-01<*>

         Threads Waiting On Exclusive Access:

                  883e3aa0       88873020       88290020       87f5f588      

                  888154f0       88bd4b28       88cbc448       884bd6c8      

                  881e5da0       8935f518       87bcc978       8889e020      

                  88cb3020       88c92178       87cf9020       88daaac0      

                  89376020       88fe9020       887b29d0       87b6f7f0      

                  87e12020       87b4f498       894ee730       88810020      

                  881a8020       87dd55f0       888d3020       885f6da0      

                  881f7da0       880742e8       87a31b50       879ffb50      

                  88451da0       88646da0       8833a020      

     

    KD: Scanning for held locks.....................................................

    Resource @ 0x88ce81ff    Exclusively owned

        Contention Count = 108

         Threads: 87ad6f78-01<*>

    KD: Scanning for held locks......................................

     

    Resource @ 0x87da48fb    Exclusively owned

         Threads: 87bddda0-01<*>

    KD: Scanning for held locks.

     

    Resource @ 0x87df455c    Exclusively owned

        Contention Count = 2

        NumberOfExclusiveWaiters = 2

         Threads: 886e5678-01<*>

         Threads Waiting On Exclusive Access:

                  89bd1234       87ad6c68      

     

    KD: Scanning for held locks............................................

     

    Resource @ 0x87fcfe30    Shared 1 owning threads

         Threads: 8a60f8a3-01<*> *** Actual Thread 8a60f8a0

    KD: Scanning for held locks...........

     

    Resource @ 0x880ef1cd    Shared 1 owning threads

         Threads: 8a60c3af-01<*> *** Actual Thread 8a60c3a0

    KD: Scanning for held locks.

    27044 total locks, 9 locks currently held

    clip_image002[4]

    Good luck and happy debugging.

  • Ntdebugging Blog

    Would you be interested in a live chat session on debugging techniques?

    • 9 Comments

    We’ve been thinking about hosting a live group chat session to talk with the debugging community.  If we had such a chat, the discussion would focus on debugging techniques and any questions you may have about anything we’ve previously blogged about.  If you’d be interested in participating in a chat session, please answer the survey question below.  Also, feel free to leave a comment on this page if there’s a particular debugging topic you would like for us to cover in the chat.

     


    Thanks everyone for your feedback! This survey is now closed.

  • Ntdebugging Blog

    The Digital DNA of Bugs, Dump Analysis as Forensic Science

    • 2 Comments

    Written by Jeff Dailey:

    As a debugger, have you ever reflected on the interesting parallels between your job and work being done in other industries?  When I think about solving complex computer problems, I think of it as forensics.  The core diagnostics or troubleshooting skill can be applied to anything. I bet if you were to walk around and talk to the people in our group you would find a lot of guys that watch programs like CSI, House, Law and Order etc.  Heck there is even a guy just a few cubicles away that has a House poster on his wall.  Years ago I used to watch Quincy, and other detective shows with the same level of fascination. I started thinking Cops, Coroners, Doctors and Engineers, we all do the same type of work. In fact, at a recent conference we had a forensics expert come in and speak to us. It was really fascinating. Even though the presenter worked in a completely different industry, most of us found each other thinking "Hey I could do that Job!", because in essence, that's what we do already.

    I started having some fun thinking about the parallels between detective shows and our work. 

    Cut to scene:

    It's 2:00AM. The camera zooms in on a pager going off on a night stand…  It can only mean one thing.  Something bad has happened and people are reaching out for help.  The detective wakes up and tells the wife, "Sorry, they need me… I've got to go." 

    Funny,  I've done the same thing, only because someone found a dead server.

    The detective shows up at the scene of the crime.  All the officers on-site are baffled, so they just keep things roped off until the expert gets there.  His years of experience and unique insight will allow him to see things others don't.

    Hmm… This seems familiar only I typically use Live Meeting or Easy Assist...

    Using a combination of specialized tools and methods learned both in school and from other's methods handed down over time, evidence is gathered at the scene so that additional research can be done back at the office.  Witnesses are questioned, "So about what time did this happen?", "Did you hear any unusual noises", "and did you see anyone or anything unusual".  Pictures are taken, Objects are collected, Fibers, and DNA samples are gathered.

    Ok so the scope of the problem is determined and all available data is collected.  Hmm, I do this every day.

    The Mayor calls the office to tell the chief of detectives that we must have this case solved.   It can't happen again.  We must catch the Villain!

    Feel free to substitute Mayor with any high level management figure.  Wow this is either a nasty bad guy or someone's driver is causing pool corruption causing a critical server to crash!

    We now cut to a montage were the detective is in the Lab, using luminal, searching for DNA evidence, reflecting on the core facts of the case, researching past crimes.

    I don't know about you, but I simply refer to this as the debugging process.

    Finally a breakthrough, DNA collected at the scene of the crime identifies a suspect that should not have been at the scene.  In doing additional research the suspect has a history of this type of activity.  The bad guy is hauled in, charges are filed, and the case is solved!

    This would equate to finding root cause, filing a bug, and getting a fix out the door!

    Ultimately that's what we do.  We are all detectives looking for the digital DNA of bugs in the wild affecting our customers.  We hunt them down using tools, expertise, and experience.

    When it comes to collecting critical forensic information and looking for that Digital DNA of a bug it often comes down to getting a dump of the process or system.

    GES (Global Escalation Services, formerly known as CPR) Escalation Engineers have probably looked at more dumps then the average person passes telephone poles in a lifetime.  Don't get me wrong, we do a lot of live debugs also, however dumps are the staple item in our debugging diet.

    To begin, let's go over why we typically ask for a dump.  Customers often think it's drastic to bring down an entire server via a "CRASH DUMP", is it worth it?  The answer is ABSOLUTELY!!!

    About Dumps

       

    Full User-Mode dump

    This dump file includes the entire memory space of a process, the program's executable image itself, the handle table, and other information that will be useful to the debugger.

       

       

    Figure 1: Scope of a Full User-Mode dump

      

    A Kernel Summary Dump

    This dump contains all the memory in use by the kernel at the time of the crash.
    This kind of dump file is significantly smaller than the Complete Memory Dump.

       

    Figure 2: Scope of a Kernel Summary dump

       

    A Full/Complete Memory Dump

    This is the largest dump file. This file contains all the physical memory for the machine at the time of the fault.

       

    Figure 3: Scope of a Full/Complete Memory dump

        

    When you open a dump with WinDbg, you can use the || (pipe pipe) command to determine the type of dump that you are analyzing

       

    So the server is hung, and we request a dump. The dump can be created using a NMI Card, Crash.exe, or CTRL+SCRL LOCK+SCRL LOCK.

    If the problem does not directly involve any user mode processes or we suspect there is a driver at fault we may simply ask for Kernel Summary dump. 

    If the problem involves user mode (application) code along with actions taking place in the kernel, or there are multiple process on the same machine making cross process calls, we will need the full user and kernel mode addresses ranges from the machine (Full Memory dump) so we can debug into the various user mode parts of the code that were being executed at the time the crash dump was captured.

    Now that you have crashed your server and collected a dump for us to look at, you may ask what we can tell from the full / kernel dump.  Well the following is just a small fraction:

    • The state of every thread call stack in every process on the machine.
    • The state of every lock on the machine.
    • The destination for RPC, DCOM, and LPC calls.
    • The CPU Usage of every thread in every process, how much kernel time vs. user time, how long since it last executed.
    • How long has a thread or process been alive?
    • The version, date, size, and checksum for binaries loaded in memory (not paged out) at the time of the dump.
    • What process spawned what process or thread?
    • How much kernel and user mode memory is being used by any given process?
    • What programs are being run in what session?
    • What connections are currently made to what remote resources, things like IP Address and ports?
    • What files are open on the machine, locally and remotely?
    • How long the server has been up.
    • We can reconstruct bitmaps from GDI surfaces.
    • We can dump running scripts.
    • How long any I/O has been outstanding to any device.
    • If any I/O or device has had an I/O error.
    • What the last error was on any thread on the machine.
    • We can figure out who is corrupting memory, if memory corruption detection technologies are employed before the dump is taken.
    • We can at times figure out what driver or process is leaking memory.
    • What handles are being leaked.
    • In most cases we can match the line of source executing during every stack frame of every call to the windows source code and determine why each discreet call in the call stack was made.
    • We can dump parts of the Windows Event log.
    • We can dump out all of the SMB connections info and we can tell what types of activity is happening in the server services.
    • We can integrate the state of the system cache.
    • We can find out what filter drivers are in the I/O chain.

    What about LiveKd?

    Sometimes we need to collect a kernel dump from a server and we are just not able to actually crash the server.  Either the customer does not want us to, or their business will not allow it.  In this case we can use the Sysinternals tool LiveKd.  LiveKd Works with our kernel debugger by installing a device driver and extracting all the memory out of kernel to user-mode so that the debugger can open the rolling snapshot of the kernel.  You can then write this dump file to disk by doing a .dump /f C:\livekd.dmp.  While this will not crash or halt the server, it does give us some kernel information.

    Words of Caution!

    LiveKd does not provide an atomic snap shot of the server because while the driver is collecting the memory from the kernel the operating system is still running. It can take several minutes to get a LiveKd Dump and during this time LiveKd may start reading memory for structures such as lists, arrays, threads and other items that may be changing during the collection of the snapshot.  This being the case, the timing data within the dump is no longer valid.  For example: You are not able to see if one thread has been waiting longer than another based on its idle ticks because the items were not collected simultaneously.  Also, linked lists and pool may appear corrupt or inconsistent in the dump.  This may make some debugger extensions loop endlessly or even crash. However you can use this output to get an idea of what is going on in more general terms in respect to static variables such as handle counts in handle tables.  You can often dump out handles, look at how many threads are in various processes and the types of things they were doing during the window the dump was collected.  It's just very difficult to draw definitive conclusions from this type of dump.  However, in those cases where we have no other option, using LiveKd to get a dump can provide valuable information.

    About User Dumps

    So what about "user dumps"?

    Why, when and how?  We ask for user dumps if we know the problem is limited to just one process having trouble.  User dumps are usually associated with High CPU, Memory Corruption, Memory Consumption, or a Hung process.  A user dump is only providing memory from a single process vs. all processes (See Figure 1 above).  We can only debug the process you are targeting. 

    High CPU: "Three in a Row and we're Good to Go"

    When a process is consuming a lot of CPU we will typically ask for 3-4 dumps, and these are usually taken 10-15 seconds apart.  We recommend using either userdump.exe, adplus –hang (process id) or by attaching windbg.exe to the process and doing a .dump /ma C:\dump1.dmp ,2 ,3, etc.  The /ma switch collects some extra information from the kernel and stores it in the user dump.  Without this extra data in the user dump we would not be able to get things like thread execution time and handle information.  The thread execution time allows the !runaway and .ttime commands to work on the dump.  Without this data we could not tell what thread is consuming CPU. 

    We can then open the successive dumps and check the various states of the threads in the process over time.  If one thread is constantly changing in each dump and consuming more and more CPU this is the thread that is typically at fault.  We then examine the reason each call is being made in that thread's context.  We also check for things like the last error that occurred on that thread using !gle (Get Last Error).

    When it's Hung we Just Need One

    When a process is hung it's typically due to a deadlock, or an application making a call in its window proc that is blocking preventing a repaint event from happening.  In this type of case we need to get a dump once the application has become unresponsive.  You can use WinDbg and dump the process via .dump /ma, use userdump.exe, or adplus –hang.  Once we get this user dump we typically will look for a thread waiting on a critical section, event, semaphore or system call.  You can pretty quickly get an idea of what is going on by doing a ~*kv and looking at the various thread states.  If you see critical sections being waited on you can run !locks and it will tell you what the lock dependency is.

    Adplus –crash and You'll Save Some Cash

    With a process that is crashing, either due to memory corruption, divide by zero, access violations, or any number of potential unhandled exceptions we need to have either a debugger attached ahead of time, or adplus –crash monitoring the process in question.  If we get a dump of a process before it has crashed, or after it has crashed and restarted it will not show us what we need to know.  We need to be watching the process ahead of time so we can catch the dump in state at the time of the event.  Adplus –crash will typically catch a second chance exception.  This is for exceptions that occur in the application code that are not handled by the applications exception handler.  In this case it falls though to the operating system to handle and we then typically tear down the process in question and/or invoke JIT, Just-in-Time debugging (Search for the AEDebug registry key for details).  If you are using windbg.exe and have attached to the process a "Crash" or unhandled exception should halt or break into the debugger.  At this point you can do a .dump /ma C:\my dumpfile.dmp

    What about intermittent problems?

    An intermittent problem can be one of the most difficult to isolate.  Often times a condition may only last a few seconds yet it could critically effect server or application operation.  In this type of scenario we have to get inventive; literally inventing new tools and methods of catching the bug or problem in the act.  A good example is the sample I posted earlier of catching a hung window.  This sample shows how to monitor something; in this case windows message pump responsiveness and take action if the response time falls out of your specified parameters. See "Detecting and automatically dumping hung GUI based windows applications."

    Good luck and happy debugging!

    PS: Doesn't CSI stand for Crashed Server Investigation?

  • Ntdebugging Blog

    Debug puzzler 0x00000002 “Attack of the crazy stack”

    • 17 Comments

    Hi NTDebuggers, I have another puzzler for you.  We started crash2.exe under windbg and it crashed.  Go figure!  Sometimes we have a very limited amount of data available to figure out what went wrong.  That being said, this week’s puzzler only gives you a few clues.    Given this week’s debugger output, what do you suspect the problem is?  What would you do to further isolate the issue or prove your theory?

     

    If there is more data you need to solve it, post a comment / request and I will provide the data for you.  We will post all comments during the week and provide our answer on Friday.  We look forward to your comments.

     

    CommandLine: crash2.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76180000 76290000   C:\Windows\syswow64\kernel32.dll

    (15d0.1688): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=cd7b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

     

    0:000> g

     

    (15d0.1688): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=7619140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> k 123

     

    ChildEBP RetAddr 

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017fefc 66692065 0x65732074

    0017ff3c 0041b9a3 0x66692065

    *** WARNING: Unable to verify checksum for crash2.exe

    0017ffa0 762019f1 crash2!_onexit+0x35

    0017ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    0:000> lm

    start    end        module name

    00400000 00438000   crash2   C (private pdb symbols)  D:\CPRRAMP\source\crash2\debug\crash2.pdb

    76180000 76290000   kernel32   (private pdb symbols)  C:\symbols\wkernel32.pdb\20F7BB5ED22344A2910B27CA7252AE792\wkernel32.pdb

    779b0000 77b00000   ntdll      (private pdb symbols)  C:\symbols\wntdll.pdb\7099E4B6A6984FD08CBC90A4EDD40FD12\wntdll.pdb

     

    0:000> db 66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0x65732074

    65732074  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732084  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    65732094  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320a4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320b4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320c4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320d4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    657320e4  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    0:000> db 0041b9a3

    0041b9a3  c3 e8 42 57 ff ff c3 cc-cc cc cc cc cc cc cc cc  ..BW............

    0041b9b3  cc cc cc cc cc cc ff 74-24 04 e8 42 5c ff ff f7  .......t$..B\...

    0041b9c3  d8 1b c0 f7 d8 59 48 c3-cc cc cc cc 8b 44 24 04  .....YH......D$.

    0041b9d3  a3 78 3c 43 00 a3 7c 3c-43 00 a3 80 3c 43 00 a3  .x<C..|<C...<C..

    0041b9e3  84 3c 43 00 c3 cc cc cc-cc cc cc 6a 10 68 58 16  .<C........j.hX.

    0041b9f3  43 00 e8 cc 5d ff ff 33-ff 57 e8 0f 59 ff ff 59  C...]..3.W..Y..Y

    0041ba03  89 7d fc 39 7d 08 75 1c-be 78 3c 43 00 ff 35 78  .}.9}.u..x<C..5x

    0041ba13  3c 43 00 e8 19 5e ff ff-89 45 e4 c7 45 e0 02 00  <C...^...E..E...

    0:000> db 0x66692065

    66692065  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692075  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692085  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    66692095  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920a5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920b5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920c5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

    666920d5  ?? ?? ?? ?? ?? ?? ?? ??-?? ?? ?? ?? ?? ?? ?? ??  ????????????????

     

    0:000> dd esp

     

    0017ff00  66692065 72756f20 61747320 69206b63

    0017ff10  616c2073 72656772 6f6e6520 54686775

    0017ff20  20736968 6d207369 2065726f 61746164

    0017ff30  726f6620 00737520 6f76e0ae fffffffe

    0017ff40  0041b9a3 0041b9c2 00411181 004146d9

    0017ff50  0017ffa0 00412ac4 00000001 002620a0

    0017ff60  002620d8 6f220936 00000000 00000000

    0017ff70  7efde000 0017ff9c 00000000 00000006

     

    0:000> dds esp

     

    0017ff00  66692065

    0017ff04  72756f20

    0017ff08  61747320

    0017ff0c  69206b63

    0017ff10  616c2073

    0017ff14  72656772

    0017ff18  6f6e6520

    0017ff1c  54686775

    0017ff20  20736968

    0017ff24  6d207369

    0017ff28  2065726f

    0017ff2c  61746164

    0017ff30  726f6620

    0017ff34  00737520

    0017ff38  6f76e0ae

    0017ff3c  fffffffe

    0017ff40  0041b9a3 crash2!_onexit+0x35

    0017ff44  0041b9c2 crash2!atexit+0x9

    0017ff48  00411181 crash2!ILT+380(__RTC_Terminate)

    0017ff4c  004146d9 crash2!_cinit+0x49

    0017ff50  0017ffa0

    0017ff54  00412ac4 crash2!__tmainCRTStartup+0x15e

    0017ff58  00000001

    0017ff5c  002620a0

    0017ff60  002620d8

    0017ff64  6f220936

    0017ff68  00000000

    0017ff6c  00000000

    0017ff70  7efde000

    0017ff74  0017ff9c

    0017ff78  00000000

    0017ff7c  00000006

     

     

    Good luck and happy debugging!

     

    Jeff-


    [Update: more debugger output, per reader request. Posted 4/17/2008]

    Note this cashes without windbg also.

     

    CommandLine: crash2.exe

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    ModLoad: 00400000 00438000   crash2.exe

    ModLoad: 777b0000 77900000   ntdll.dll

    ModLoad: 75a10000 75b20000   C:\Windows\syswow64\kernel32.dll

    (1324.16d0): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=73fe0000 edx=00000000 esi=fffffffe edi=77890094

    eip=777c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    777c0004 cc              int     3

    0:000> g

    (1324.16d0): Access violation - code c0000005 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    eax=0017feec ebx=75a2140f ecx=0042ecc8 edx=00000000 esi=00000002 edi=00001770

    eip=65732074 esp=0017ff00 ebp=6f207473 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010216

    65732074 ??              ???

    0:000> db esp

    0017ff00  65 20 69 66 20 6f 75 72-20 73 74 61 63 6b 20 69  e if our stack i

    0017ff10  73 20 6c 61 72 67 65 72-20 65 6e 6f 75 67 68 54  s larger enoughT

    0017ff20  68 69 73 20 69 73 20 6d-6f 72 65 20 64 61 74 61  his is more data

    0017ff30  20 66 6f 72 20 75 73 00-9b 7d 6b d3 fe ff ff ff   for us..}k.....

    0017ff40  a3 b9 41 00 c2 b9 41 00-81 11 41 00 d9 46 41 00  ..A...A...A..FA.

    0017ff50  a0 ff 17 00 c4 2a 41 00-01 00 00 00 98 21 03 00  .....*A......!..

    0017ff60  d0 21 03 00 03 94 3f d3-00 00 00 00 00 00 00 00  .!....?.........

    0017ff70  00 e0 fd 7e 9c ff 17 00-00 00 00 00 06 00 00 00  ...~............

    0:000> ln 0042ecc8

    (0042ecac)   crash2!`string'+0x1c   |  (0042eccc)   crash2!`string'

    0:000> .formats 0017ff044

    Evaluate expression:

      Hex:     017ff044

      Decimal: 25161796

      Octal:   00137770104

      Binary:  00000001 01111111 11110000 01000100

      Chars:   ..D

      Time:    Mon Oct 19 01:23:16 1970

      Float:   low 4.70085e-038 high 0

      Double:  1.24316e-316

    0:000> dd 0017ff044

    017ff044  ???????? ???????? ???????? ????????

    017ff054  ???????? ???????? ???????? ????????

    017ff064  ???????? ???????? ???????? ????????

    017ff074  ???????? ???????? ???????? ????????

    017ff084  ???????? ???????? ???????? ????????

    017ff094  ???????? ???????? ???????? ????????

    017ff0a4  ???????? ???????? ???????? ????????

    017ff0b4  ???????? ???????? ???????? ????????

    0:000> dds crash2!__onexitbegin

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000

    004343d0  00000000

    0:000> dds crash2!__onexitend

    00434350  59c94d56

    00434354  59c96d56

    00434358  00000001

    0043435c  00000000

    00434360  00000000

    00434364  00000000

    00434368  00000000

    0043436c  00000000

    00434370  00033760

    00434374  00000000

    00434378  00000000

    0043437c  00000000

    00434380  00000000

    00434384  00000000

    00434388  00000000

    0043438c  00000000

    00434390  00000000

    00434394  00000000

    00434398  00000000

    0043439c  00000000

    004343a0  00000000

    004343a4  00000000

    004343a8  00000000

    004343ac  00000000

    004343b0  00000000

    004343b4  00000000

    004343b8  00000000

    004343bc  00000000

    004343c0  00000000

    004343c4  00000000

    004343c8  00000000

    004343cc  00000000

     


    [Update: our answer. Posted 4/18/2008]

    This week we had a lot of people that realized this was a buffer overrun.   You guys are so good I’m going to make next week’s puzzler a little harder!

    Good work all!   They are not all listed but some of the responses I liked the best were:

     

    moltov

    Matthieu

    Doug

    Tal Rosen

     

    This week’s official response

     

     

    Let’s start off with wmain.  You can see here that we push a pointer to a string onto the stack and call fun1 004117a3

     

    0:000> uf crash2!wmain

    crash2!wmain [d:\cprramp\source\crash2\crash2\crash2.cpp @ 11]:

       11 00412520 55              push    ebp

       11 00412521 8bec            mov     ebp,esp

       11 00412523 83ec40          sub     esp,40h

       11 00412526 53              push    ebx

       11 00412527 56              push    esi

       11 00412528 57              push    edi

       12 00412529 686cec4200      push    offset crash2!`string' (0042ec6c) << Pushing param onto the stack

       12 0041252e e870f2ffff      call    crash2!ILT+1950(?fun1YAXPADZ) (004117a3)  << Making call to fun1

       12 00412533 83c404          add     esp,4

       13 00412536 33c0            xor     eax,eax

       14 00412538 5f              pop     edi

       14 00412539 5e              pop     esi

       14 0041253a 5b              pop     ebx

       14 0041253b 8be5            mov     esp,ebp

       14 0041253d 5d              pop     ebp

       14 0041253e c3              ret

     

    Lets dump out the value we are passing.

     

    0:000> da 0042ec6c

    0042ec6c  "This is a test ot see if our sta"

    0042ec8c  "ck is larger enough"

     

    0:000> uf 004117a3

    crash2!fun1 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 17]:

       17 00412550 55              push    ebp

       17 00412551 8bec            mov     ebp,esp

       17 00412553 83ec4c          sub     esp,4Ch

       17 00412556 53              push    ebx

       17 00412557 56              push    esi

       17 00412558 57              push    edi

       19 00412559 8b4508          mov     eax,dword ptr [ebp+8]  << Here we are moving EBP+8 into eax. This is basically just loading the address of parameter one into eax

       19 0041255c 50              push    eax  << We push it onto the stack to make our call to  strcpy

       19 0041255d 8d4df4          lea     ecx,[ebp-0Ch]  << Now we are loading the address of a local on the stack. 

    Note the local is ebp-c   That means that we can only write 0xC bytes to

    this location before we end up overwriting things on the stack like our base pointer and return address.

       19 00412560 51              push    ecx  << Now we push our local variable address onto the stack for our call to strcpy

       19 00412566 83c408          add     esp,8

       20 00412569 8d45f4          lea     eax,[ebp-0Ch]

       20 0041256c 50              push    eax

     

       19 00412561 e8c3eeffff      call    crash2!ILT+1060(_strcpy) (00411429)  << This is where things go WRONG, within the call to strcpy we have overwritten our return address with string data.

     

    Let’s look at the before and after.

     

    0:000> bp 00412560

    0:000> g

    Breakpoint 2 hit

    eax=0042ec6c ebx=75a2140f ecx=0017feec edx=00000000 esi=00000002 edi=00001770

    eip=00412560 esp=0017fe9c ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216

    crash2!fun1+0x10:

    00412560 51              push    ecx

    0:000> k 

    ChildEBP RetAddr 

    0017fef8 00412533 crash2!fun1+0x10 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 19] << The return address is ok here!

    0017ff50 00412ac4 crash2!wmain+0x13 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 12]

    0017ffa0 75a919f1 crash2!__tmainCRTStartup+0x15e [f:\sp\vctools\crt_bld\self_x86\crt\src\crt0.c @ 327]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    0:000> p  << Lets step over the call to strcpy and look again.

    eax=0017feec ebx=75a2140f ecx=0042eca0 edx=00686775 esi=00000002 edi=00001770

    eip=00412569 esp=0017fea0 ebp=0017fef8 iopl=0         nv up ei pl nz ac pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000216

    crash2!fun1+0x19:

    00412569 8d45f4          lea     eax,[ebp-0Ch]

     

    0:000> k

    ChildEBP RetAddr 

    0017fef8 65732074 crash2!fun1+0x19 [d:\cprramp\source\crash2\crash2\crash2.cpp @ 20]

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017ff3c 0041b9a3 0x65732074  << This is not going to be pretty when we ret out of fun1.  We will basically return to nowhere.

    0017ffa0 75a919f1 crash2!_onexit+0x35 [f:\sp\vctools\crt_bld\self_x86\crt\src\onexit.c @ 98]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    0:000> da 0017fef8   if we do a da on the location of the stack frame we can see what is at that location.

    0017fef8  "st ot see if our stack is larger"  << It’s our string!

    0017ff18  " enough"

     

    Finally we have the C code.

     

    #include <windows.h>

     

    void fun1(char * szData);

    void fun2(char * szData);

     

    int _tmain(int argc, _TCHAR* argv[])

    {

                    fun1("This is a test ot see if our stack is larger enough");

                    return 0;

    }

     

    void fun1(char * szdata)

    {

                    char szData1[10];

                    strcpy(szData1,szdata);

                    fun2(szData1);

    }

    void fun2(char *szData)

    {

                    printf("Hello from fun2");

                    strcat(szData,"This is more data for us");

    }

     

     

    Have a great weekend, Good luck, and happy debugging!

     

    Jeff-





  • Ntdebugging Blog

    Closing the Loop: CPU Spike in Winlogon.exe

    • 3 Comments

    We recently dealt with an interesting issue that I would like to share, hope you enjoy. - Jason

    Issue Summary

    Customer reports that within their Citrix server farm (running on Windows Server 2003), when a user logs into or out of a session (seems more pronounced on logoff), ALL users connected to the server experience a ~5-20 second hang. The hang is described as follows:

    • Application in the session (i.e. Outlook and Word) stop accepting keyboard input. When the hang subsides, the typed characters show up all at once.
    • If applications are moved, they do not redraw
    • Cannot click the start menu
    • If the user was running performance monitor (in an attempt to diagnose), there would be a gap in the perfmon data that directly correlated to the hang duration

    Customer has found that during the timeframe of the hang, Winlogon.exe (in session 0) is showing 25% CPU usage.

    Troubleshooting & Analysis

    Where to start… to begin, given the details above, we can assume we are not experiencing a hard hang where the server is completely unresponsive. The user can still move windows and the connection to the server is still active. Keyboard input is working, but seems to be buffered during the issue. The windows not redrawing indicates the message pump for each of the affected applications is stuck waiting for something.

    In this scenario, I focused on the CPU usage for Winlogon under the premise that the other symptoms were side effects of whatever the CPU was working so hard on. On to the 25%; this is an interesting number to spike at. Generally speaking a thread has work to do, or it doesn't. If there was work to do it would normally take the CPU to 100%. So why 25%? In this scenario, each of the servers within the server farm had 4 processors. So we pegged 1 out of the 4 @ 100% resulting in task manager displaying 25% CPU utilization for the Winlogon in question.

    So now we have a CPU spike in Winlogon. Why only the Winlogon in session 0? Going back to the issue summary, the customer reported that if ANY user logs off, ALL users experience a hang, and Winlogon in session 0 spikes. First, let's talk about what winlogon does for a living.

    Sidebar: Winlogon on Windows Server 2003.

    Any time you ask what a feature is or how it works, you should begin your research with the Windows Technical Reference (try a Live search like "winlogon site:http://technet2.microsoft.com"). In this case I focused on the "How Interactive Logon Works" article.

    This article starts with:

    The Windows Server 2003 interactive logon architecture includes the following components:

    • Winlogon
    • Graphical Identification and Authentication (GINA) dynamic-link library (DLL)
    • Local Security Authority (LSA)
    • Authentication packages (NTLM and Kerberos)

    So Winlogon has something to do with Interactive logon.

    What else can we find out about the responsibilities of Winlogon

    • Registers the SAS (the secure attention sequence, a.k.a CTRL-ALT-DEL) during the boot process to keep other programs and processes from using it
    • SAS routine dispatching - When Winlogon recognizes a SAS event or the GINA delivers a SAS, Winlogon calls one of the SAS processing functions of the GINA.
    • Desktop lockdown - Winlogon helps prevent unauthorized user's from gaining access to system resources by locking down the computer desktop. At any time, Winlogon is in one of three possible states: logged on, logged off, or workstation locked
    • User profile loading - After a successful logon, Winlogon loads user profiles into the HKEY_CURRENT_USER registry key.
    • Screen saver control - Winlogon monitors keyboard and mouse activity to determine when to activate screen savers.
    • Multiple network provider support - If there are multiple network providers installed on a Windows-based system, they can be included in the authentication process and in password-updating operations

    Looking at this list of tasks performed by Winlogon, we need to try to determine which task aligns to our symptoms (hang during logon and logoff). I decided to focus on User Profiles due to the relationship between profiles and login and logoff.

    What is a User Profile? Back to Live with "user profiles site:http://technet2.microsoft.com". Out of the results, I went with the "User Profile Structure" link. The page provides the following detail:

    A user profile consists of:

    • A registry hive. The registry is a database used to store computer- and user-specific settings. Portions of the registry can be saved as files, called hives. These hives can then be reloaded for use as necessary. User profiles take advantage of the hive feature to provide roaming profile functionality. The user profile registry hive is the NTuser.dat in file form, and is mapped to the HKEY_CURRENT_USER portion of the registry when the user logs on. The NTuser.dat hive maintains the users environment preferences when the user is logged on. It stores those settings that maintain network connections, Control Panel configurations unique to the user (such as the desktop color and mouse), and application-specific settings. The majority of the settings stored in the registry are opaque to user profiles settings are owned and maintained by individual applications and operating system components.
    • A set of profile folders stored in the file system. User profile files are stored in the file system in the Documents and Settings directory, in a per user folder. The user profile folder is a container for applications and other operating system components to populate with subfolders and per-user data, such as shortcut links, desktop icons, startup applications, documents, configuration files and so forth. Windows Explorer uses the user profile folders extensively for special folders such as the users desktop, start menu and my documents folder.

    With this we can now look at the Winlogon that is spiking the CPU. My step when looking at a CPU spike for a process is to determine if the time spent is in User-mode or Kernel mode. In task manager on the Performance tab you can monitor CPU usage. If you select the View menu and chose to "Show Kernel Times" you can get additional detail showing you if the time is associated with a User-mode module or a Kernel module. In this case the Winlogon spike showed the spike to be in Kernel time. This means the application (Winlogon) asked the OS to do something and it is trying to get it done.

    My second step is to determine what user-mode request led to the spike in Kernel time. To answer this on a production system I chose to use Sysinternal's Process Explorer. I downloaded the tool and set it up so it would be able to obtain symbols from the public MS symbols server. During the repro I monitored the thread CPU time in Winlogon. Here is a screen shot of what Process Explorer looks when looking at the thread activity of an idle Winlogon.

    Figure 1 – Process Properties in Process Explorer

    When the CPU spike occurred we looked at the stack for the spiked thread (double click on the thread with the high CPU time or click the "Stack" button with the thread highlighted):

    ntdll.dll!KiFastSystemCall+0x3
    ntdll.dll!KiFastSystemCallRet
    ntdll.dll!ZwUnloadKey+0xc
    ADVAPI32.dll!LocalBaseRegUnLoadKey+0x51
    ADVAPI32.dll!RegUnLoadKeyW+0x73
    USERENV.dll!MyRegUnLoadKey+0x6d
    USERENV.dll!CUserProfile::UnloadUserProfileP+0x2a2
    USERENV.dll!UnloadUserProfileI+0x198

    RPCRT4.dll!Invoke+0x30
    RPCRT4.dll!NdrStubCall2+0x299
    RPCRT4.dll!NdrServerCall2+0x19
    RPCRT4.dll!DispatchToStubInCNoAvrf+0x38
    RPCRT4.dll!RPC_INTERFACE::DispatchToStubWorker+0x11f
    RPCRT4.dll!RPC_INTERFACE::DispatchToStub+0xa3
    RPCRT4.dll!LRPC_SCALL::DealWithRequestMessage+0x42c
    RPCRT4.dll!LRPC_ADDRESS::DealWithLRPCRequest+0x127
    RPCRT4.dll!LRPC_ADDRESS::ReceiveLotsaCalls+0x430
    RPCRT4.dll!RecvLotsaCallsWrapper+0xd
    RPCRT4.dll!BaseCachedThreadRoutine+0x9d
    RPCRT4.dll!ThreadStartRoutine+0x1b
    kernel32.dll!BaseThreadStart+0x34

    Just to make sure, we repro'd again and looked at the call stack

    ntdll.dll!KiFastSystemCall+0x3
    ntdll.dll!KiFastSystemCallRet
    ntdll.dll!ZwUnloadKey+0xc
    ADVAPI32.dll!LocalBaseRegUnLoadKey+0x51
    ADVAPI32.dll!RegUnLoadKeyW+0x73
    USERENV.dll!MyRegUnLoadKey+0x6d
    USERENV.dll!CUserProfile::UnloadUserProfileP+0x2a2
    USERENV.dll!UnloadUserProfileI+0x198

    RPCRT4.dll!Invoke+0x30
    RPCRT4.dll!NdrStubCall2+0x299
    RPCRT4.dll!NdrServerCall2+0x19
    RPCRT4.dll!DispatchToStubInCNoAvrf+0x38
    RPCRT4.dll!RPC_INTERFACE::DispatchToStubWorker+0x11f
    RPCRT4.dll!RPC_INTERFACE::DispatchToStub+0xa3
    RPCRT4.dll!LRPC_SCALL::DealWithRequestMessage+0x42c
    RPCRT4.dll!LRPC_ADDRESS::DealWithLRPCRequest+0x127
    RPCRT4.dll!LRPC_ADDRESS::ReceiveLotsaCalls+0x430
    RPCRT4.dll!RecvLotsaCallsWrapper+0xd
    RPCRT4.dll!BaseCachedThreadRoutine+0x9d
    RPCRT4.dll!ThreadStartRoutine+0x1b
    kernel32.dll!BaseThreadStart+0x34

    We can see here that both call stacks for the thread in question are RPC threads waiting for incoming requests. The request that came in was from userenv module running in Winlogon associated with the session of the user logging off and came into the Winlogon in session 0. The request is to unload the User Profile (highlighted above). This led to a call to RegUnloadKey call which removes a hive from the registry but does not modify the file containing the registry information.

    We now know the issue is related to unloading the user profile and specifically the registry portion of the user profile. Now we need to determine where we are spending the time in kernel mode.

    Kernrate

    Kernrate is a tool included with the Windows 2003 resource kit. This will let us dig into the kernel mode side of what is going on. I collected the following data from kernrate during the hang condition (Summarized):

    Parsing the output, we see that 72.66% of the Kernel time was spent in the idle process. This is attributed to the 3 idle threads that are on the 3 processors not being used by the Winlogon thread. We see that the Winlogon process accounts for 20.39% of the kernel time (the CPU spike on the 4th processor). If we move into the module summary, you can see the correlated CPU times for the modules, INTELPPM (CPU driver) running on the idle processors and NTOSKRNL running on the 4th processor (this is where we will focus). In the final drill down, we see the function summary for NTOSKRNL and we can identify the function in use 82% of the time was CmpCleanUpKCBCacheTable. The Cm function prefix lets us know this is related to Configuration Manager (the Registry – see Table 2-7 from Microsoft Windows Internals 4th Edition for Commonly used Kernel Prefixes).

    What do we know? We know the issue manifests most during user logoff. We know during the logoff we are trying to unload the profile. We know that in the process of doing this we are trying to unload a registry hive. We know that leads us to spend a lot of CPU time in the Kernel doing CmpCleanUpKCBCacheTable.

    Why does this hang the machine? The registry is protected / synchronized with an ERESOURCE named CmpRegistryLock. While this is held exclusively during this cleanup function, all registry access is blocked (both read and write). This explains all of our symptoms. Applications freeze and do not redraw due to operations that need registry access being done on the GUI thread of a process. Perfmon cannot report due to its dependency on the registry.

    Resolution

    Now that we know the exact behavior, we could align it to a known issue that was introduced with the release of Windows Sever 2003 SP1 (applies to both SP1 and SP2)

    KB927182

    From the KB cause section:

    This problem occurs because of the way that Windows Server 2003 Service Pack 1 (SP1) cleans up user registry information. When a user logs off, there is a five-second delay before Windows Server 2003 SP1 flushes the user registry hive. If the terminal server experiences heavy registry activity, the cached registry information on the terminal server may increase significantly. For example, this behavior may occur when a program scans the user profile during the logoff process. When the terminal server experiences heavy registry activity, it takes Windows longer to flush the data.

    The fix applies to Windows Server 2003 with SP1 or SP2.

  • Ntdebugging Blog

    Hardware bitflipping

    • 8 Comments

    Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.

    Here's is what I found:

    The system got the following bugcheck:

    0: kd> .bugcheck
    Bugcheck code 000000D1
    Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6

    Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference

    !analyze -v provides the following information for this bugcheck:

    DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
    An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.
    Arguments:
    Arg1: fffff80003a192d0, memory referenced
    Arg2: 0000000000000002, IRQL
    Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
    Arg4: fffff980064aa8b6, address which referenced memory

    Debugging Details:
    ------------------

    READ_ADDRESS: fffff80003a192d0

    CURRENT_IRQL: 2

    So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:

    0: kd> kv 3
    Child-SP RetAddr : Args to Child : Call Site
    fffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckEx
    fffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73
    fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)

    Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)

    0: kd> .trap fffff800`03219070
    NOTE: The trap frame does not contain all registers.
    Some register values may be zeroed or incorrect.
    rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011
    rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001
    rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6
    r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000
    r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0 nv up ei pl zr na po nc
    tcpip!InetInspectReceiveDatagram+0xf6:
    fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13

    As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.

    0: kd> r cr2
    cr2=fffff80003a192d0

    Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)

    0: kd> .formats fffff800`032192d0
    Evaluate expression:
    Hex: fffff800`032192d0
    Decimal: -8796040490288
    Octal: 1777777600000310311320
    Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
    Chars: .....!..
    Time: ***** Invalid FILETIME
    Float: low 4.74822e-037 high -1.#QNAN
    Double: -1.#QNAN

    0: kd> .formats fffff800`03a192d0
    Evaluate expression:
    Hex: fffff800`03a192d0
    Decimal: -8796032101680
    Octal: 1777777600000350311320
    Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
    Chars: ........
    Time: ***** Invalid FILETIME
    Float: low 9.49644e-037 high -1.#QNAN
    Double: -1.#QNAN

    Notice that there is a one bit difference between these 2 addresses

    11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000

    11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000

    Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.

Page 1 of 2 (13 items) 12