• Ntdebugging Blog

    Windbg Tip: KN, .Frame , DV, and DT - It's so easy

    • 3 Comments

    Written by Jeff Dailey.

    Hello NTDebuggers, many of us take for granted some of the simple commands in the debugger that make life easy. I was thinking of several in particular that go great together.  The first command would be kn.  Kn will show the current call stack and includes the stack frame number to the far left of the individual call.  

    I loaded up a copy of BadWindow.EXE in Windbg, set my thread to thread Zero via ~0s.  Then I simply type kn.  This dumps out the call stack, and as you can see, to the far left each call there is the stack frame number.

    Using stack frame 2 from the above stack let’s look at the locals it uses.  First we must get in context using .frame and the frame number.

    Now to dump out the locals we can just use the dv (Display Local Variables) command.  (Note you need private symbols for the binary you are debugging for this to work).

    As you can see, we have several locals here.  Now, because I have private symbols for this binary, the debugger already understands the structures and data types for each variable.

    Furthermore because I’m already in the stack frames context I can simply dt (display type) with the name of the variable and the debugger figures out the address.  Let’s dt msg and see what data it contains.

    As you can see  have a couple of structures that are part of this structure.    Look at the right in the above output.  You will notice a data type name of HWND_ and tagPOINT to the right.  These are structures.  The good news is that dt will work recursively and will dump out structures under your current structure.   All you need to do is issue a DT –r.  You can specify the depth by appending a number to –r.

    For more information about the dt command and others mentioned in this blog make sure you take a look at the debugger.chm file that comes with the Windows Debugging Tools.

    Good luck and happy debugging.

  • Ntdebugging Blog

    How to debug WOW64 applications?

    • 3 Comments

     

    Hello, my name is Venkatesh Ganga, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team.  Recently, I worked on an issue where registry reflection was not happening while installing 32bit Office under the system account. This required looking into both the 32bit Office code and the Wow64 code where the registry reflection is implemented.  When attaching to the Wow64 process using the 32bit debugger it’s like debugging a 32bit process on the 32bit machine; there are no 64bit binaries in the process. However, we needed to debug Wow64 to debug the registry reflection code. To do this we attached to the Wow64 process using the 64bit debugger which allows you to see the Wow64 binaries.

    Ø  lm

                Base TimeStamp                     Module

              400000 42435b2a Mar 24 18:28:26 2005 C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE

            77ec0000 45d6cc72 Feb 17 03:35:46 2007 C:\WINDOWS\system32\ntdll.dll

            6b000000 45d6943d Feb 16 23:35:57 2007 C:\WINDOWS\system32\wow64.dll

            6b280000 45d695f3 Feb 16 23:43:15 2007 C:\WINDOWS\system32\wow64win.dll

            78b80000 42438b7a Mar 24 21:54:34 2005 C:\WINDOWS\system32\wow64cpu.dll

     

    There are 2 options for debugging Wow64 applications.

    1.       Using the 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    2.       Using the 32bit debugger

    Using 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    I ran the 32bit version of Internet Explorer on a 64bit machine and attached to it using the 64bit debugger. Here is the thread 0 call stack when viewed from 64bit debugger.

    0:000> kL

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

     

    The above stack only shows the 64 bit calls; we cannot see what the 32 bit calls are doing. To get the 32bit stack you must use one of the below methods.

    Ø  Option 1 : Run “!wow64exts.k”

    0:000> !wow64exts.k

    Walking 64bit Stack…

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

    Walking 32bit Stack...

    ChildEBP          RetAddr          

    002ded98 75ec1c83 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 KERNEL32!BaseProcessStart+0x28

     

    Ø  Option 2 : Switch to x86 mode (using “!wow64exts.sw”) and do KB.

     

    0:000> !wow64exts.sw

    Switched to 32bit mode

    0:000:x86> kb

    ChildEBP          RetAddr           Args to Child                                        

    002ded98 75ec1c83 002f1be8 002dee50 002f1be8 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef 002f1be8 002f1be8 00000000 BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 002f1be8 00000001 00000000 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d 002e2508 00000001 ffffffff SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 00400000 00000000 002e2508 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a 00000000 00000000 7efdf000 IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 004025c2 00000000 000000c8 KERNEL32!BaseProcessStart+0x28

     

    The easiest way to see all of the 32bit call stacks is by switching to  32bit mode (!wow64exts.sw) and doing ~*k.  In addition, you can set breakpoints in 32bit or 64 bit binaries using the 64bit debugger.  Also note  “!peb” will show both the 64bit and 32bit PEB.

     

    Using the 32bit debugger

    As mentioned earlier there is nothing wrong with using the 32 bit debugger.  If you just need to debug the application’s 32bit code, using it is probably the simplest approach.  However, if you need to view Wow64 code or binaries, you must use the 64bit debugger.  Note that these techniques apply to debugging Wow64 dumps and live processes.

    You can find more information about WoW64 applications at http://msdn.microsoft.com/en-us/library/aa384249(VS.85).aspx 

     

  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000007: Interlocked functions

    • 19 Comments

     

    Today, we will have some fun with interlocked functions.

     

    The following section of code is reentrant.  A “well meaning” developer used interlocked functions to avoid serializing on a global table lock.

     

    Initial smoke testing shows that the code works fine.  Sometimes things are not as they appear when doing initial code review.  After several hours of heavy stress testing, the developer finds the machine has bugchecked.  Analysis of the dump showed that the caller of this function had steamrolled through nonpaged pool writing stacks on top of everyone’s pool memory.

     

    The goal of today’s puzzler is to find the bug and describe how it could be fixed with minimal timing impact.

     

    Here are a few details before you begin.

     

    1.       The variable gIndex is an unsigned long global.

    2.       The gLogArray memory was allocated from nonpaged pool and the size of this allocation is correct.

     

    Ready?

     

    00:   PLOG_ENTRY GetNextLogEntry ()

          {

    01:         ULONG IterationCount = MAX_RECORDS;

     

    02:         PLOG_ENTRY pEntry;

     

    03:         do

                {

    04:               if (InterlockedCompareExchange(&gIndex, 0, MAX_RECORDS) == MAX_RECORDS)

     

    05:                     pEntry = &gLogArray[0];

     

    06:               else

     

    07:                     pEntry = &gLogArray[InterlockedIncrement(&gIndex)];

     

    08:               --IterationCount;

     

    09:         } while(InterlockedCompareExchange(&pEntry->Active, 1, 0) != 0 && (IterationCount > 0));

     

    0a:         return (IterationCount > 0) ? pEntry : NULL;

          }

     

     

     

    Happy hunting,

     

    Dennis Middleton “The NTFS Doctor”

     


      [Update: our answer. Posted 6/10/2008]

     

    Thanks for all the great posts!  I saw many interesting answers, and a few unique solutions that I hadn’t considered.

     

    Bug Description

     

    A slight time gap exists between the InterlockedCompareExchange and the InterlockedIncrement.  For this reason, several threads could pass the check for MAX_RECORDS prior to doing the increment.

     

    Assume that N is the number of threads that pass the check for MAX_RECORDS while gIndex is at a particular value.

    If N or more threads are able to pass the check while gIndex is equal to MAX_RECORDS-(N-1), then gIndex would be incremented beyond MAX_RECORDS.

     

    For example, let’s assume that 3 threads passed the check while gIndex was at MAX_RECORDS-2.  Then after the three increments occur, gIndex would be equal to MAX_RECORDS+1.  From that point, invalid pointers would be passed out to the caller.

     

    Possible Solutions

     

    There are several ways to solve this problem.  Some are more efficient than others.  I would avoid doing checks for MAX_RECORDS-1, MAX_RECORDS, or MAX_RECORDS+1 (interlocked or not) since there could potentially be more than two threads involved in the race condition.  Such a solution would only reduce the likelihood of an overflow.

     

    There were a few posts suggesting a lock or critical section for the section of code between the compare and increment.  That would be one solution, but it would also do away with the benefits of using interlocked functions.

     

    In keeping with the philosophy of keeping the code fast and simple, here’s a solution that gives a very good result with minimal impact.

     

    1.       I removed the if () {} else {} and simply allowed gIndex to increment unchecked.  With this change, gIndex can approach its max unsigned long value and possibly wrap - we need to keep the array index in check.

    2.       The modulus operator (added to line 4 below) will divide the incremented gIndex by MAX_RECORDS and use the remainder as the array index.  When dividing, the resultant remainder is always smaller than the divisor (MAX_RECORDS).  For this reason, the array index is guaranteed to be smaller than MAX_RECORDS.  As even multiples of MAX_RECORDS are reached, the array index resets back to zero mathemagically and no interlocked compare is even necessary.

     

     

    00:   PLOG_ENTRY GetNextLogEntry ()

          {

    01:         ULONG IterationCount = MAX_RECORDS;

     

    02:         PLOG_ENTRY pEntry;

     

    03:         do

                {

     

    04:               pEntry = &gLogArray[InterlockedIncrement(&gIndex) % MAX_RECORDS];

     

    05:               --IterationCount;

     

    06:         } while(InterlockedCompareExchange(&pEntry->Active, 1, 0) != 0 && (IterationCount > 0));

     

    07:         return (IterationCount > 0) ? pEntry : NULL;

          }

     

     

    With the fix in place, the code is smaller, faster, easier to read, and most of all - the bug is gone.  When developing code, always try to think small.             

     

    Best Regards,

    NTFS Doctor

     

     

     





  • Ntdebugging Blog

    Beefier Breakpoints Using Debugger Scripts

    • 5 Comments

     

    Written by Ron Stock

    I recently collaborated with a third-party vendor which required me to track down a problem during the Vista upgrade process. The issue came down to catching a Microsoft component opening a specific registry service key during the setup process. Issues like this frequently call for live debugging to catch registry access in real time. I could have set a breakpoint on RegOpenKeyExW() and examined every requested key passed into the function, but this method is very time consuming considering RegOpenKeyExW()  is a highly used code path. This methodology involves setting a break point on the function, waiting for the break in, checking the second parameter (lpSubKey) for a match to the desired registry key (string), and hitting “g” if the key doesn’t match. I could have potentially repeated these steps a hundred times before finding the call containing the key I needed. I wanted some way to set a ‘break on access’ only when my key was touched. We ran into a similar issue requiring the ability to catch a file system function working with a specific file when possibly hundreds of files were in use. So how do you gain the automation needed for this type of debugging? The answer is debugger scripts.

    Let’s run through an example with Windows Explorer.exe using this method. In this scenario we’ll attempt to catch Explorer.exe opening the HKEY_LOCAL_MACHINE\SYSTEM\Setup key.  Windows Explorer opens several keys per second, so the focus is to ‘break  in’ when this specific key is passed to RegOpenKeyEx() without having to manually wade through hundreds of possible keys passed into the function.

    1.       Identify the function in question.

     

    Per MSDN, RegOpenKeyEx is defined as the following, and we’re interested in the second parameter because it contains the name of the key.

    LONG WINAPI RegOpenKeyEx(

      __in        HKEY hKey,

      __in_opt    LPCTSTR lpSubKey,

      __reserved  DWORD ulOptions,

      __in        REGSAM samDesired,

      __out       PHKEY phkResult

    );

    2.       Create the debugger script

    The script logic is straight forward. Our debugger script will break in every time RegOpenKeyEx() is called.  It then checks the second parameter for the string SYSTEM\Setup. If it finds a match it stays broken in, otherwise it returns command to the running thread. Our script will look like this -

    r$t0=poi(esp+8)

    .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

     

    r$t0 is considered a pseudo-register and is used as a variable in this script. Since the second parameter of RegOpenKeyEx() is passed via esp+8, our debugger command program will dereference esp+8 and move it into r$t0. Our pseudo-register now holds the memory address of lpSubKey. The second line of the program simply checks to see if r$t0 is NULL, and will setup our alias named _str at the address held in r$t0 if r$t0 is not NULL. The third line uses the MASM string operator, $spat, looking for a string pattern which matches system\setup. If it doesn’t find a match it prints the string, and returns command back to the executing thread. It’s the equivalent of typing ‘g’ in the debugger. If the string pattern system\setup is found, the string is printed out on the screen, and the debugger remains broken in. Then we can debug the underlying problem in the appropriate context. Now save them into a txt located at c:\debuggers\strings.txt so we can reference it from our breakpoint.

     

    3.       Set the break point.

     

    Next we break into the Explorer process from the kernel debugger

     

    kd> !process 0 0 Explorer.exe

    PROCESS 8340d6a8  SessionId: 1  Cid: 0270    Peb: 7ffd5000  ParentCid: 01b0

        DirBase: 2d48f000  ObjectTable: 90171348  HandleCount: 559.

        Image: explorer.exe

     

     

    kd> !bpid 0270

    Finding wininit.exe (1)...

    Finding winlogon.exe (1)...

    Waiting for winlogon.exe to break.  This can take a couple of minutes...

    Break instruction exception - code 80000003 (first chance)

    Break into process 270 set.  The next break should be in the desired process.

    Break instruction exception - code 80000003 (first chance)

    ntdll!DbgBreakPoint:

    77f17dfe cc              int     3

     

     

    Then we set a breakpoint on ADVAPI32!RegOpenKeyExW instructing it to use our script to evaluate whether SYSTEM\Setup is the key passed in at the second parameter. If it finds a match to our key it stays broken in otherwise execution is passed back to the running Explorer thread.

     

    kd> bp ADVAPI32!RegOpenKeyExW "$$<c:\\debuggers\\strings.txt"

    kd> bl

    0 e 76e8f09d     0001 (0001) ADVAPI32!RegOpenKeyExW "$$<c:\debuggers\strings.txt"

     

    This is the output we would expect to see for all of the non-matches.

     

    kd> g

    kd> r$t0=poi(esp+8)

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    System\CurrentControlSet\Services\DnsCache\Parameters

    kd> r$t0=poi(esp+8)

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    Software\Policies\Microsoft\Windows NT\DnsClient

     

    (Several entries have been removed here to conserve space in the blog)

     

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    System\CurrentControlSet\Services\DnsCache\Parameters

    kd> r$t0=poi(esp+8)

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    Software\Policies\Microsoft\Windows NT\DnsClient

    kd> r$t0=poi(esp+8)

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    System\CurrentControlSet\Services\DNS

    kd> r$t0=poi(esp+8)

    kd> .if (@$t0 != 0) { as /mu ${/v:_str} @$t0 } .else { ad /q ${/v:_str} }

    kd> .if ($spat("${_str}x", "*system\Setup*") == 0) { .echo _str;g } .else { .echo _str }

    System\Setup

    ADVAPI32!RegOpenKeyExW:

    76e8f09d 8bff            mov     edi,edi

     

    Bingo! It detected that System\Setup was the string passed in and remained broken in allowing us to start any necessary debugging.

     

     

    kd> kv

    ChildEBP RetAddr  Args to Child             

    077aeae0 760dfb75 80000002 760dfb94 00000000 ADVAPI32!RegOpenKeyExW (FPO: [Non-Fpo])

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    077aed18 760dfbb7 00000004 00000000 0000003a 0x760dfb75

    077aee20 7678c1b2 00000000 00000000 00000000 0x760dfbb7

    077aee24 00000000 00000000 00000000 00000000 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])

     

    kd> dc 760dfb94

    760dfb94  00790053 00740073 006d0065 0053005c  S.y.s.t.e.m.\.S.

    760dfba4  00740065 00700075 ff530000 4ce81075  e.t.u.p...S.u..L

    760dfbb4  8bffffff 0ffb3bf8 0032c284 89fb3b00  .....;....2..;..

    760dfbc4  840ff47d 0000012b f875ff56 013ee857  }...+...V.u.W.>.

    760dfbd4  f08b0000 840ff33b 000000f2 0000c3e9  ....;...........

    760dfbe4  90909000 ff8b9090 83ec8b55 458b1cec  ........U......E

    760dfbf4  db335308 fc458956 e8f45d89 fffff583  .S3.V.E..]......

    760dfc04  ff18758b 1e891475 fff5a9e8 89c33bff  .u..u........;..

     

    Debugger scripts are another cool tool to add to your debugging arsenal. I’ve personally benefited from this type of scripting many times. Again check the debugger help files for more information on the available commands.

     

    Share this post :
  • Ntdebugging Blog

    Corrupt Page Table Pages Caught in the MDL

    • 1 Comments

    Hello all, Scott Olson here again to share another interesting issue I worked on a while back.  The issue was that after upgrading to Windows XP Service Pack 2 the system would experience random bug checks with memory corruption.  Interestingly, there was a very specific pattern to the corruption - it looked like a PFN address and flags were randomly placed into the page table page in several places in the process.  The memory manager would never do this type of thing and I suspected that a driver was editing user page table pages, which should never be done.

    Let's take a look at the stack:

    kd> kb
      *** Stack trace for last set context - .thread/.cxr resets it
    ChildEBP RetAddr  Args to Child
    f15b1308 80523096 c00862d8 10c5b000 00000000 nt!MiDeletePte+0x198
    f15b13d0 80519776 000001d8 10d20fff 00000000 nt!MiDeleteVirtualAddresses+0x164
    f15b13ec 805b1d74 10c20000 10d20fff f15b14a4 nt!MiDeleteFreeVm+0x20
    f15b148c 8054060c ffffffff 049c6aa8 049c6ab0 nt!NtFreeVirtualMemory+0x42e
    f15b148c 7c90eb94 ffffffff 049c6aa8 049c6ab0 nt!KiFastCallEntry+0xfc
    03e4a398 7c90da54 7c8209b3 ffffffff 049c6aa8 ntdll!KiFastSystemCallRet
    03e4a39c 7c8209b3 ffffffff 049c6aa8 049c6ab0 ntdll!NtFreeVirtualMemory+0xc

    Here is the page table entry for the virtual address:

    kd> !pte 10c5b000
                   VA 10c5b000
    PDE at 00000000C0600430    PTE at 00000000C00862D8
    contains 000000003FC6F867  contains 0000000015E0086F
    pfn 3fc6f ---DA--UWEV    pfn 15e00 ---DA-TUWEV

    This shows that the value 15e0086f is incorrectly put into the page table pages.  This bad value corresponds to a write-through mapping to a range allocated via a call to MmAllocatePagesForMdl.

    c00862d0  00000000 00000000 15e0086f 00000000
    c00862e0  00000000 00000000 00000000 00000000
    c00862f0  00000000 00000000 00000000 00000000
    c0086300  00000000 00000000 00000000 00000000
    c0086310  00000000 00000000 00000000 00000000
    c0086320  00000000 00000000 00000000 00000000
    c0086330  00000000 00000000 00000000 00000000
    c0086340  00000000 00000000 00000000 00000000
    c0086350  00000000 00000000 15e0086f 00000000
    c0086360  00000000 00000000 00000000 00000000
    c0086370  00000000 00000000 00000000 00000000
    c0086380  00000000 00000000 00000000 00000000
    c0086390  00000000 00000000 00000000 00000000
    c00863a0  00000000 00000000 00000000 00000000
    c00863b0  00000000 00000000 00000000 00000000
    c00863c0  00000000 00000000 00000000 00000000
    c00863d0  00000000 00000000 00000000 00000000
    c00863e0  00000000 00000000 00000000 00000000
    c00863f0  00000000 00000000 00000000 00000000
    c0086400  00000000 00000000 00000000 00000000
    c0086410  00000000 00000000 00000000 00000000
    c0086420  00000000 00000000 00000000 00000000
    c0086430  00000000 00000000 00000000 00000000
    c0086440  00000000 00000000 00000000 00000000
    c0086450  00000000 00000000 00000000 00000000
    c0086460  15e0086f 00000000 00000000 00000000
    c0086470  00000000 00000000 00000000 00000000
    c0086480  00000000 00000000 00000000 00000000
    c0086490  00000000 00000000 00000000 00000000
    c00864a0  00000000 00000000 00000000 00000000
    c00864b0  00000000 00000000 00000000 00000000
    c00864c0  00000000 00000000 00000000 00000000
    c00864d0  00000000 00000000 00000000 00000000
    c00864e0  15e0086f 00000000 00000000 00000000
    c00864f0  00000000 00000000 00000000 00000000
    c0086500  00000000 00000000 00000000 00000000
    c0086510  00000000 00000000 00000000 00000000
    c0086520  00000000 00000000 00000000 00000000
    c0086530  00000000 00000000 00000000 00000000
    c0086540  00000000 00000000 00000000 00000000
    c0086550  00000000 00000000 00000000 00000000
    c0086560  15e0086f 00000000 00000000 00000000
    c0086570  00000000 00000000 00000000 00000000
    c0086580  00000000 00000000 00000000 00000000
    c0086590  00000000 00000000 00000000 00000000
    c00865a0  00000000 00000000 00000000 00000000
    c00865b0  00000000 00000000 00000000 00000000
    c00865c0  00000000 00000000 00000000 00000000
    c00865d0  00000000 00000000 00000000 00000000
    c00865e0  00000000 00000000 00000000 00000000
    c00865f0  00000000 00000000 00000000 00000000
    c0086600  00000000 00000000 00000000 00000000
    c0086610  00000000 00000000 00000000 00000000
    c0086620  00000000 00000000 00000000 00000000
    c0086630  00000000 00000000 00000000 00000000
    c0086640  00000000 00000000 00000000 00000000
    c0086650  00000000 00000000 00000000 00000000
    c0086660  00000000 00000000 15e0086f 00000000
    c0086670  00000000 00000000 00000000 00000000
    c0086680  00000000 00000000 00000000 00000000
    c0086690  00000000 00000000 00000000 00000000
    c00866a0  00000000 00000000 00000000 00000000
    c00866b0  00000000 00000000 00000000 00000000
    c00866c0  00000000 00000000 00000000 00000000
    c00866d0  00000000 00000000 00000000 00000000
    c00866e0  00000000 00000000 15e0086f 00000000
    c00866f0  00000000 00000000 00000000 00000000
    c0086700  00000000 00000000 00000000 00000000
    c0086710  00000000 00000000 00000000 00000000
    c0086720  00000000 00000000 00000000 00000000
    c0086730  00000000 00000000 00000000 00000000
    c0086740  00000000 00000000 00000000 00000000
    c0086750  00000000 00000000 00000000 00000000
    c0086760  00000000 00000000 15e0086f 00000000
    c0086770  00000000 00000000 00000000 00000000
    c0086780  00000000 00000000 00000000 00000000
    c0086790  00000000 00000000 00000000 00000000
    c00867a0  00000000 00000000 00000000 00000000
    c00867b0  00000000 00000000 00000000 00000000
    c00867c0  00000000 00000000 00000000 00000000
    c00867d0  00000000 00000000 00000000 00000000
    c00867e0  00000000 00000000 00000000 00000000

    kd> !pfn 15e00
        PFN 00015E00 at address 81BCA800
        flink       00000000  blink / share count 00000001  pteaddress 000AF001
        reference count 0002   Cached     color 0
        restore pte 00000080  containing page        FFEDCB  Active       RW
            ReadInProgress WriteInProgress

    The driver also has an outstanding call MmProbeAndLockPages call on the pages indicated by the reference count of 2.  Thinking that this pfn value is incorrect I decided to search for this value and see what I could find.

    kd> s -d 80000000 l?7fffffff 00015e00
    8022d534  00015e00 0001f190 00041d50 0001f140  .^......P...@...
    86cacbf4  00015e00 0000cd1c 0000cc27 0000cc08  .^......'.......
    86e25cdc  00015e00 0a130005 e56c6946 00000000  .^......Fil.....

    I found a few entries but the middle one looks like it could be an MDL allocation.  So I verified this:

    kd> !pool 86cacbf4 2
    Pool page 86cacbf4 region is Nonpaged pool
    *86cacbd0 size:   80 previous size:   28  (Allocated) *Mdl
                    Pooltag Mdl  : Io, Mdls

    Yes this is an MDL, let's inspect it:

    kd> dt nt!_MDL 86cacbd8
       +0x000 Next             : (null)
       +0x004 Size             : 32
       +0x006 MdlFlags         : 138
       +0x008 Process          : (null)
       +0x00c MappedSystemVa   : 0x00004000
       +0x010 StartVa          : 0xf7baa000
       +0x014 ByteCount        : 0xfff
       +0x018 ByteOffset       : 0

    Notice that the page 15e00 is in the MDL’s page list.

    kd> dd 86cacbd8+1c
    86cacbf4  00015e00 0000cd1c 0000cc27 0000cc08
    86cacc04  0000cc09 0000cc0a 0000cc0b 0000cbec
    86cacc14  0000cbed 0000cbee 0000cbef 0000cbd0
    86cacc24  0000cbd1 0000cbd2 0000cbd3 0000cbd4
    86cacc34  0000cbd5 0000cbd6 00000000 00000000
    86cacc44  00000000 00000000 00000000 00010010

    Next I wanted to see if I could find a driver that may have references to this MDL and I found two:

    kd> s -d 80000000 l?7fffffff 86cacbd8
    86f9c6a0  86cacbd8 0000003d 00000000 0000636a  ....=.......jc..
    86fc7e68  86cacbd8 00000001 00000001 00000000  ................

    Now let's see who owns these

    kd> !pool 86f9c6a0 2
    Pool page 86f9c6a0 region is Nonpaged pool
    *86f9c618 size:   d8 previous size:   30  (Allocated) *Crpt
                    Pooltag Crpt   : Memory corruption driver

    kd> !pool 86fc7e68 2
    Pool page 86fc7e68 region is Nonpaged pool
    *86fc7e00 size:   98 previous size:   40  (Allocated) *Crpt
                    Pooltag Crpt   : Memory corruption driver

    This gives us a pretty convincing probability that this driver is at fault.  So now you may ask, "Why did this problem only start after applying Service Pack 2?"  By default when you install Server Pack 2, Data Execution Prevention (DEP) is enabled on systems that support it.  The support for DEP is in the PAE kernel which uses extra bits to describe the page table entries.  In this crash the solution was to disable DEP until the driver could be corrected.  The driver was incorrectly using the memory mappings by ignoring the extra bits in the page number and causing the memory corruption by writing to the wrong page.  For more information on default DEP settings and enabling/disabling it in Windows see the following article.

    899298  The "Understanding Data Execution Prevention" help topic incorrectly states the default setting for DEP in Windows Server 2003 Service Pack 1
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;899298

  • Ntdebugging Blog

    Understanding NTFS Compression

    • 5 Comments

    As our group’s file systems expert, I see most of the common problems associated with the use/misuse of NTFS compression.  Before you check the “Compress contents to save disk space” checkbox, it might be good to understand how this affects the happy little bits, bytes, & nibbles running around on the disk.

    Often backup applications receive ERROR_DISK_FULL errors attempting to back up compressed files and this causes quite a bit of confusion when there are still several gigabytes of free space on the drive.  Other issues may also occur when copying compressed files.  The goal of this blog is to give the reader a more thorough understanding of what really happens when you compress NTFS files.

    Compression Units…

    NTFS uses a parameter called the “compression unit” to define the granularity and alignment of compressed byte ranges within a data stream.  The size of the compression unit is based entirely on NTFS cluster size (refer to the table below for details).  In the descriptions below, the abbreviation "CU" is used to describe a Compression Unit and/or its size.

    The default size of the CU is 16 clusters, although the actual size of the CU really depends on the cluster size of the disk.  Below is a chart showing the CU sizes that correspond to each of the valid NTFS cluster sizes.

    Cluster Size Compression Unit Compression Unit (hex bytes)
    512 Bytes 8 KB 0x2000
    1 KB 16 KB 0x4000
    2 KB 32 KB 0x8000
    4 KB 64 KB 0x10000
    8 KB 64 KB 0x10000
    16 KB 64 KB 0x10000
    32 KB 64 KB 0x10000
    64 KB 64 KB 0x10000

    Native NTFS compression does not function on volumes where the cluster size is greater than 4KB, but sparse file compression can still be used.

    NTFS Sparse Files…

    The Sparse Files features of NTFS allow applications to create very large files consisting mostly of zeroed ranges without actually allocating LCNs (logical clusters) for the zeroed ranges.

    For the code-heads in the audience, this can be done by calling DeviceIoControl with the FSCTL_SET_SPARSE IO control code as shown below.

    BOOL SetSparse(HANDLE hFile)
         {
               DWORD Bytes;
               return DeviceIoControl(hFile, FSCTL_SET_SPARSE, NULL, 0, NULL, 0, &Bytes, NULL);
         }

    To specify a zeroed range, your application must then call the DeviceIoControl with the FSCTL_SET_ZERO_DATA IO control code.

    BOOL ZeroRange(HANDLE hFile, LARGE_INTEGER RangeStart, LONGLONG RangeLength)
         {
                FILE_ZERO_DATA_INFORMATION  FileZeroData;
                DWORD Bytes;
                FileZeroData.FileOffset.QuadPart = RangeStart.QuadPart;
                FileZeroData.BeyondFinalZero.QuadPart = RangeStart.QuadPart + RangeLength + 1;
                return DeviceIoControl(  hFile,
                                         FSCTL_SET_ZERO_DATA,
                                         &FileZeroData,
                                         sizeof(FILE_ZERO_DATA_INFORMATION),
                                         NULL,
                                         0,
                                         &Bytes,
                                         NULL);
         }

    Because sparse files don't actually allocate space for zeroed ranges, a sparse file can be larger than the parent volume.  To do this, NTFS creates a placeholder VCN (virtual cluster number) range with no logical clusters mapped to them. 

    Any attempt to access a “sparsed” range would result in NTFS returning a buffer full of zeroes.  Accessing an allocated range would result in a normal read of the allocated range.  When data is written to a sparse file, an allocated range is created which is exactly aligned with the compression unit boundaries containing the byte(s) written.  Refer to the example below.  If a single byte write occurs for virtual cluster number VCN 0x3a, then all of Compression Unit 3 (VCN 0x30 - 0x3f) would become an allocated LCN (logical cluster number) range.  The allocated LCN range would be filled with zeroes and the single byte would be written to the appropriate byte offset within the target LCN.

    [...] - ALLOCATED
    (,,,) - Compressed
    {   } - Sparse (or free) range
       / 00000000000000000000000000000000000000000000000000000000000000000000000000000000
    VCN  00000000000000001111111111111111222222222222222233333333333333334444444444444444
       \ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
                CU0             CU1            CU2             CU3             CU4
         |++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++|
         {                                              }[..............]{              }
    Extents  {
        VCN = 0x0 LEN = 0x30                        CU0 - CU2
        VCN = 0x30 LEN = 0x10: LCN = 0x2a21f        CU3
        VCN = 0x10 LEN = 0x10                       CU4
    }

    Below is a screen shot of a 2GB file that was created using the sparse file API’s.

    NewVolume

    Notice that this volume is only 76.9 MB, yet it has a 2 GB file in the root folder.

    NewVolumeProperties

    If we attempt to duplicate this sparse file with the COPY command, it fails.  This is because COPY doesn’t know how to duplicate the sparse attributes on the file, so it attempts to create a real 2GB file in the root of D:.  This can occur in production when attempting to move a large database file from one volume to another.  If you have a database application that uses sparse attributes, then it is a good practice to use the database software’s backup / restore features when moving the database to a different volume.

    CmdPromptCopy

    Now let’s look at the file’s properties with the COMPACT utility.  Notice that the file shows up as compressed, and it has a huge compression ratio.

    CmdPromptCompact

    If you go back and look at the file properties from EXPLORER, you will notice that there is no compression checkbox (or any other indication that the file is compressed).  This is because the shell does not check for the sparse bit on the file.

    AdvancedAttributes

    In short, use caution when moving sparse files from one location to another.  Applications tell the file system the offsets of zeroed ranges, so you should always leave the management of sparse files to the application that created them.  Moving or copying the sparse files manually may cause unexpected results.

    NTFS Compression…

    Now that we have discussed sparse files, we will move on to conventional NTFS compression.

    NTFS compresses files by dividing the data stream into CU’s (this is similar to how sparse files work).  When the stream contents are created or changed, each CU in the data stream is compressed individually.  If the compression results in a reduction by one or more clusters, the compressed unit will be written to disk in its compressed format.  Then a sparse VCN range is tacked to the end of the compressed VCN range for alignment purposes (as shown in the example below).  If the data does not compress enough to reduce the size by one cluster, then the entire CU is written to disk in its uncompressed form.

    This design makes random access very fast since only one CU needs to be decompressed in order to access any single VCN in the file.  Unfortunately, large sequential access will be relatively slower since decompression of many CU’s is required to do sequential operations (such as backups).

    In the example below, the compressed file consists of six sets of mapping pairs (encoded file extents).  Three allocated ranges co-exist with three sparse ranges.  The purpose of the sparse ranges is to maintain VCN alignment on compression unit boundaries.  This prevents NTFS from having to decompress the entire file if a user wants to read a small byte range within the file.  The first compression unit (CU0) is compressed by 12.5% (which makes the allocated range smaller by 2 VCNs).  An additional free VCN range is added to the file extents to act as a placeholder for the freed LCNs at the tail of the CU.  The second allocated compression unit (CU1) is similar to the first except that the CU compressed by roughly 50%.

    NTFS was unable to compress CU2 and CU3, but part of CU4 was compressible by 69%.  For this reason, CU2 & CU3 are left uncompressed while CU4 is compressed from VCNs 0x40 to 0x44.  Thus, CU2, CU3, and CU4 are a single run, but the run contains a mixture of compressed & uncompressed VCNs.

    NOTE:  Each set of brackets represents a contiguous run of allocated or free space.  One set of NTFS mapping pairs describes each set of brackets.

    [...] - ALLOCATED
    (,,,) - Compressed
    {   } - Sparse (or free) range
       / 00000000000000000000000000000000000000000000000000000000000000000000000000000000
    VCN  00000000000000001111111111111111222222222222222233333333333333334444444444444444
       \ 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
               CU0             CU1             CU2             CU3             CU4
         |++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++||++++++++++++++|
         (,,,,,,,,,,,,){}(,,,,,,){      }[...............................,,,,){         }
    Extents  {
        VCN = 0x0 LEN = 0xe : LCN = 0x29e32d        CU0
        VCN = 0xe LEN = 0x2                         CU0
        VCN = 0x10 LEN = 0x8 : LCN = 0x2a291f       CU1
        VCN = 0x18 LEN = 0x8                        CU1
        VCN = 0x20 LEN = 0x25 : LCN = 0x39dd49      CU2 - CU4
        VCN = 0x45 LEN = 0xb                        CU4
    }

    Now we come to the part where we describe limitations of this design.  Below are some examples of what happens when things go wrong while reading / writing compressed files.

    Disk full error during a backup read operation or file copy…

    1. NTFS determines which compression unit is being accessed.
    2. The compression unit’s entire allocated range is read.
    3. If the unit is not compressed, then we skip to step 5.  Otherwise, NTFS would attempt to reserve (but not allocate) the space required to write the decompressed CU back to disk.  If insufficient free space exists on the disk, then the application might get an ERROR_DISK_FULL during the read.
    4. The CU would be decompressed in memory.
    5. The decompressed byte range would be mapped into cache and returned to the requesting application.
    6. If part of the CU is altered in cache…
      1. The reserved disk space from step 3 would become allocated space.
      2. The CU contents would be compressed and flushed back to the newly allocated region (the LCN location will usually not change).
      3. Any recoverable disk space within the CU would be freed.

    Failure to copy a large file to a compressed folder…

    This is the most common problem seen with compression, and currently the solution is to educate users about limitations.  NTFS compression creates approximately one file fragment for every 16 clusters of data.  Because the max cluster size allowed for standard compression is 4K, the largest compression unit allowed is 64KB.  In order to compress a 100 GB file into 64KB sections, you could potentially end up with 1,638,400 fragments.  Encoding 1,638,400 fragments becomes problematic for the file system and can cause a failure to create the compressed file.  On Vista and later, the file copy will fail with STATUS_FILE_SYSTEM_LIMITATION.  On earlier releases, the status code would be STATUS_INSUFFICIENT_RESOURCES.  If possible, avoid using compression on files that are large, or critical to system performance.

    I received feedback from the NTFS Principal Development Lead about this blog.  Fortunately, most of the feedback was good, but he asked that I add a maximum size recommendation.  According to our development team’s research, 50-60 GB is a “reasonable size” for a compressed file on a volume with a 4KB cluster size.  This “reasonable size” goes down sharply for volumes with smaller cluster sizes.

    Compressing .VHD (Virtual Hard Disk) files causes slow virtual machine performance…

    A popular hammer company received a call from a customer complaining “It sure hurts when I hit my thumb with your hammers!”  Likewise, if you compress .VHD files, it is going to be a painful experience, so please compress responsibly.

    Best Regards,

    Dennis Middleton “The NTFS Doctor”

  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000006: Invalid Handle - can you handle it?

    • 5 Comments

    Hi NTDebuggers, this week’s puzzler just so happens to match its number: 0x000000006 = ERROR_INVALID_HANDLE.   That said, let me give you a scenario and the challenge will be to provide the best action plan to isolate the problem.  This should include an explanation of what types of code problems cause invalid handles.  

     

    Scenario 1 :  You have a customer or client that is getting invalid handle errors.  This is causing unhandled exceptions and the application is crashing.   What action plan do you give the customer?  In scenario 1, you don’t have access to the remote machine.  Your customer will execute any action plan you give them, gather the data, and provide you with the results. 

     

    Scenario 2: You have full access to the process and you can debug it live on your own personal machine; you even have source access.  What do you do?

     

    Good luck and happy debugging.

     


     

    [Update: our answer. Posted 5/27/2008]

     

    Hi NTDebuggers, if you’ve been debugging for a while I’m sure you’ve run into an Invalid Handle Error 6 (aka ERROR_INVALID_HANDLE or “The handle is invalid”).  If you see it in your debugger output it indicates you have tried to pass an invalid handle to a call that is either expecting a different handle type or a the value for the handle its self is not valid.   To understand this error in more detail you have to understand what handles are and where they live in the operating system.  Each process has a handle table in the kernel.  Think of this as an array or table of objects in kernel mode.   The handle values are actually multiples of four.  Review http://blogs.msdn.com/oldnewthing/archive/2005/01/21/358109.aspx to see for what the bottom 3 bits are used for.  For this illustration we will be numbering them 1,2,3 etc.

     

     

    When you open a file you will get back a simple pointer-sized value that in the handle table points to a file object.  In this case let’s say you opened a file and you received a handle value of 2.  If you envision the handle table you might see something like this.

     

    ------------- USER ADDRESS SPACE FOR YOUR PROCESS -------------

     

    Your handle for your open file is 2

     

    ------------- KERNEL ADDRESS SPACE FOR YOUR PROCESS -------------

     

    Handle table

     

    Handle [1]=Pointer to ->Reg

    Handle [2]=Pointer to ->File      <<<<<< Your handle is 2  It points to a file object

    Handle [3]=Pointer to ->File

    Handle [4]=Pointer to ->Reg

    Handle [5]=Pointer to ->Process

    Handle [6]=Pointer to ->Thread

    Handle [7]=Pointer to ->Thread

    Handle [8]=Pointer to ->File

    Handle [9]=Pointer to ->File

     

    In the debugger, you can see the handle table is referenced from the EPROCESS object for each process.  In the !process output it is called the ObjectTable.

     

    0: kd> !process 85df8da0 

    PROCESS 85df8da0  SessionId: 0  Cid: 0114    Peb: 7ffda000  ParentCid: 0268

        DirBase: 0acc07e0  ObjectTable: e4b6a4c8  HandleCount:  49

     

     

    Before we get into troubleshooting let’s look at a couple of scenarios for bad handles.

     

    In the first bad handle scenario, let’s say your application closes handle 2 using CloseHandle(), however your application keeps the value 2 in a variable somewhere. In this scenario we will assume no other code has come along and opened a new object that is occupying element 2 of the handle table.  If you go to make a file I/O call using handle 2 the object manager will catch this and your app will get an invalid handle exception.  This is somewhat harmless but your app my crash if you don’t catch the exception and your I/O will obviously not complete.

     

    In the second scenario we’ll say that your file I/O code closes the handle, and holds on to that handle value.  Meanwhile another part of your application opens a registry key or some other object that is not a file object.  Now your file I/O code goes to use handle 2 again.  The problem is it’s not a handle to a file now.  When the registry code opened the registry key element 2 was available so we now have a registry object in that location.  If you go to do a file I/O on handle 2 you will get an invalid handle message.

     

    Now in our final scenario, we close a file handle in our file I/O code and keep the handle around.  In this case though we have different code that is also doing file I/O, It opens a file and gets handle value 2.  Now you have big trouble because if the first file I/O code now writes to handle 2, and it should not be because it closed it!  This handle now belongs to a different file.  This means you have code basically doing a write into an unknown region for that code’s context and file format.   This will result in file corruption and no invalid handle error.

     

    So this really comes down to best practice.  If you’re writing code and close a handle NULL IT OUT!  That way you guarantee it will not be reused accidently at some other point.

     

    CloseHandle(MyHandle);

    MyHandle = NULL;

     

     

    On to debugging:  So how do you track this down? For the remote scenario you could give your customer the following action plan:

     

     

    1.       Start gflags and under the image file tab, enter your application name in the image edit control.

     

    2.       Check Create User mode stack trace database.

     

    3.       Start the process under windbg.   “WinDBG processname.exe”

     

    4.       In windbg Run !htrace -enable

     

    5.       Do a sxe ch.  This will cause windbg to break on an invalid handle exception.

     

    6.       Run sx to confirm  “ch - Invalid handle – break” should be set.

     

    7.       Enter g for go.

     

    8.       Let the process run until you get an invalid handle message.  The debugger should break in.

     

    9.       Now all you have to do is run !htrace

     

     

    Htrace will dump out all the call stacks for each handle that was opened or closed.

    You need to take the invalid handle value and search backward to see where it was last opened, and where it was closed before that.  It’s likely that the close before the last open is your culprit code path.  Make sure that suspect close nulls that handle value out so it does not reuse it. 

     

    In the live debug scenario following the same procedure.  In this case you have the luxury of going back and setting a breakpoint in the offending call stack that freed the handle before the reuse.  You can then figure out why it was not zeroed out.

     

    Good luck and happy debugging.

     

     





  • Ntdebugging Blog

    How to track down High CPU in User Mode Applications - A live debug!

    • 6 Comments

     

    Written by Jeff Dailey.

     

    Hello NTDebuggers,  I’d like to talk about a common issue we deal with on a regular basis.   We are often tasked with finding what functions are using CPU within a user mode process / application.  Typically a user will find an application that is using more CPU then they expect it to, and this can affect overall system performance and responsiveness.   

     

    For this exercise I’ve written some contrived sample code called EATCPU.  It’s included at the bottom of the blog post.    The following image of task manager shows EATCPU consuming 41% CPU time.   A customer or user may tell you this does not “normally” happen.  It’s always good to ask what “normal” is.   In this case we will say normal is ~10%.

    clip_image002[6]

    The best case scenario is a live debug on the process that is running at high CPU levels.  If you’re lucky enough to have a customer / user that will let you do a remote debug, and the problem reproduces on demand, you can take the following action.

     

    You need to install the debugging tools for Windows, and set your symbol path.  If at all possible acquire the symbols for the application you are debugging.  We’ll assume that you are the expert that supports said program.  If it’s written in house, get the symbols from the developer.  If it’s from a third party, that vendor may be willing to provide you with symbols for their product.  Microsoft has most of the symbols for our products available on our symbol server on the internet (!sympath srv*DownstreamStore*http://msdl.microsoft.com/download/symbols.)

     

    The next thing is to attach windbg.exe to the process in question.

    From the debuggers directory, type TLIST, this will list your process.  Get the process id and then run WinDBG.EXE –p PROCESSID, or if your debugging a program like EATCPU, you can run WINDBG C:\program\EATCPU.EXE.

     

    After the debugger is attached or the process is started in the debugger, reproduce the problem.


     

    Microsoft (R) Windows Debugger  Version 6.8.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    CommandLine: eatcpu.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 0041a000   eatcpu.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76780000 76890000   C:\Windows\syswow64\kernel32.dll

    ModLoad: 62bb0000 62cd1000   C:\Windows\WinSxS\x86_microsoft.vc80.debugcrt_1fc8b3b9a1e18e3b_8.0.50727.762_none_24c8a196583ff03b\MSVCR80D.dll

    ModLoad: 75cb0000 75d5a000   C:\Windows\syswow64\msvcrt.dll

    (1090.164): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=712b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

    0:000> g

    (1090.11d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

    0:007> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:007> g

    (1090.17d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

     

    Once the problem has started run the !runaway command.

     

    0:007> !runaway

     User Mode Time

      Thread       Time

       2:c04       0 days 0:01:08.827  ß Note this thread, thread 2:c04 is using more CPU than any other.

       7:17d4      0 days 0:00:00.000  ß Note the other threads are using very little if any CPU.

       6:1a4c      0 days 0:00:00.000

       5:d20       0 days 0:00:00.000

       4:157c      0 days 0:00:00.000

       3:1b28      0 days 0:00:00.000

       1:1134      0 days 0:00:00.000

       0:164       0 days 0:00:00.000

     

    0:007> ~2s  ß Using the thread number 2, set the thread context with the ~Ns command.

    *** WARNING: Unable to verify checksum for eatcpu.exe

    eax=cccccccc ebx=00b93c48 ecx=0000002b edx=00b937a8 esi=00000000 edi=00d9fcf0

    eip=0041169c esp=00d9fcd0 ebp=00d9fd9c iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202

    eatcpu!checkSomething+0x1c:

    0041169c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fcf0=cccccccc

    0:002> k    ß Dump the call stack using k.

     

    If you look at the following call stack, the applications code in this thread starts where you see EATCPU, the code before that is C  Runtime code for begin thread. I want to trace all the code that is running under being thread.  The assumption here is that I’ll find something looping and eating CPU.  To do this I will use the WT command.   However first I need to specify a beginning address for WT to start trace at.

     

    Let's use UF to Unassemble the Function that started our code by taking the return address of  eatcpu!myThreadFunction.

     

    ChildEBP RetAddr 

    00d9fd9c 00411657 eatcpu!checkSomething+0x1c [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]

    00d9fe74 004115a8 eatcpu!trySomething+0x27 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 eatcpu!myThreadFunction+0x38 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c MSVCR80D!_beginthread+0x221

    00d9ffa0 768019f1 MSVCR80D!_beginthread+0x1bc

    00d9ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    00d9ffec 00000000 ntdll!_RtlUserThreadStart+0x23 [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> uf 004115a8 ß This command will unassemble the function at this address beginning to end.  

     

    0:007> uf 004115a8

    eatcpu!myThreadFunction [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 30]:

       30 00411570 55              push    ebp

       30 00411571 8bec            mov     ebp,esp

       30 00411573 81eccc000000    sub     esp,0CCh

       30 00411579 53              push    ebx

       30 0041157a 56              push    esi

       30 0041157b 57              push    edi

       30 0041157c 8dbd34ffffff    lea     edi,[ebp-0CCh]

       30 00411582 b933000000      mov     ecx,33h

       30 00411587 b8cccccccc      mov     eax,0CCCCCCCCh

       30 0041158c f3ab            rep stos dword ptr es:[edi]

       31 0041158e 8b4508          mov     eax,dword ptr [ebp+8]

       31 00411591 8945f8          mov     dword ptr [ebp-8],eax

     

    eatcpu!myThreadFunction+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 33]:

       33 00411594 b801000000      mov     eax,1

       33 00411599 85c0            test    eax,eax

       33 0041159b 7410            je      eatcpu!myThreadFunction+0x3d (004115ad)

     

    eatcpu!myThreadFunction+0x2d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]:

       35 0041159d 8b45f8          mov     eax,dword ptr [ebp-8]

       35 004115a0 8b08            mov     ecx,dword ptr [eax]

       35 004115a2 51              push    ecx

       35 004115a3 e880faffff      call    eatcpu!ILT+35(?trySomethingYAHHZ) (00411028)

       35 004115a8 83c404          add     esp,4

       36 004115ab ebe7            jmp     eatcpu!myThreadFunction+0x24 (00411594)

     

    eatcpu!myThreadFunction+0x3d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 37]:

       37 004115ad 5f              pop     edi

       37 004115ae 5e              pop     esi

       37 004115af 5b              pop     ebx

       37 004115b0 81c4cc000000    add     esp,0CCh

       37 004115b6 3bec            cmp     ebp,esp

       37 004115b8 e8a1fbffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       37 004115bd 8be5            mov     esp,ebp

       37 004115bf 5d              pop     ebp

       37 004115c0 c3              ret

     

    0:002> wt -or 00411570 ß We will use WT to Watch Trace this code.  I’ve selected the starting address of the myThreadFunction function.  I’ve also specified –or to print the return value of each function.  Wt produces very visual output.  It allows you to quickly identify patterns in the way the code executes that would be much more difficult just using the T (TRACE) command. 

     

        8     0 [  0] ntdll!RtlSetLastWin32Error eax = 0

    >> No match on ret

        8     0 [  0] ntdll!RtlSetLastWin32Error

        2     0 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

        9     3 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       12     6 [  0] eatcpu!checkSomething eax = 0

    >> No match on ret

       12     6 [  0] eatcpu!checkSomething

        7     0 [  0] eatcpu!trySomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       10     3 [  0] eatcpu!trySomething eax = 0

    >> No match on ret

       10     3 [  0] eatcpu!trySomething

        9     0 [  0] eatcpu!myThreadFunction  ß I see a pattern,  a loop.   Beginning of loop.

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       18   176 [  0] eatcpu!myThreadFunction  ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError  ß Always look for what might be going wrong! Last error can give you a clue.  We are setting last error at the low level of the loop

     16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0  ß Also note checkSomething is returning ZERO, this might indicate a problem.  You need to look at the code or assembler.

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       27   352 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       36   528 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

    12930 instructions were executed in 12929 events (0 from other threads)

     

    Function Name                               Invocations MinInst MaxInst AvgInst

    eatcpu!ILT+180(?checkSomethingYAHHZ)                 69       1       1       1

    eatcpu!ILT+345(__RTC_CheckEsp)                      210       1       1       1

    eatcpu!ILT+35(?trySomethingYAHHZ)                    70       1       1       1

    eatcpu!_RTC_CheckEsp                                210       2       2       2

    eatcpu!checkSomething                                70      60      74      73

    eatcpu!myThreadFunction                               1     630     630     630

    eatcpu!trySomething                                  71      10      70      68

    kernel32!SetLastError                                70       5       5       5

    ntdll!RtlSetLastWin32Error                           70      16      16      16

     

    0 system calls were executed

     

    eax=cccccccc ebx=00b93c48 ecx=00000002 edx=00b937a8 esi=00000000 edi=00d9fe6c

    eip=0041164c esp=00d9fda8 ebp=00d9fe74 iopl=0         nv up ei pl nz na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206

    eatcpu!trySomething+0x1c:

    0041164c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fe6c=cccccccc

     

    0:002> !gle ß Now that we have broken in let's check and see what the last error is using !GLE (Get Last Error)  This dumps out the last error from the TEB.

    LastErrorValue: (Win32) 0x57 (87) - The parameter is incorrect.

    LastStatusValue: (NTSTATUS) 0xc000000d - An invalid parameter was passed to a service or function.

     

    0:007> bp kernel32!SetLastError  ß Lets set a breakpoint on last error to examine what is going on in the function calling it.

    0:007> g

    Breakpoint 1 hit

    eax=cccccccc ebx=00b93c48 ecx=00000000 edx=00b937a8 esi=00d9fcd0 edi=00d9fd9c

    eip=767913dd esp=00d9fcc8 ebp=00d9fd9c iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    kernel32!SetLastError:

    767913dd 8bff            mov     edi,edi

    0:002> kv ß Get the call stack

    ChildEBP RetAddr 

    0:002> kv

    ChildEBP RetAddr  Args to Child             

    00d9fcc4 004116cb 00000057 00d9fe74 00000000 kernel32!SetLastError (FPO: [Non-Fpo])  ß 0x57 Invalid parameter error,  Why?

    00d9fd9c 00411657 00000000 00d9ff58 00000000 eatcpu!checkSomething+0x4b (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]

    00d9fe74 004115a8 00000000 00000000 00000000 eatcpu!trySomething+0x27 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 0017ff34 4f9f12e9 00000000 eatcpu!myThreadFunction+0x38 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c 00b937a8 00d9ffac 768019f1 MSVCR80D!_beginthread+0x221 (FPO: [Non-Fpo])

    00d9ffa0 768019f1 00b937a8 00d9ffec 77a2d109 MSVCR80D!_beginthread+0x1bc (FPO: [Non-Fpo])

    00d9ffac 77a2d109 00b93c48 00d926a6 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    00d9ffec 00000000 62bb4520 00b93c48 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo]) (CONV: stdcall) [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> !error 00000057 ß double check, using !error, this will decode the error into a human readable string.

    Error code: (Win32) 0x57 (87) - The parameter is incorrect.

     

    0:002> uf checkSomething  ß lets disassemble the function calling SetLastError.

    eatcpu!checkSomething [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]:

       49 00411680 55              push    ebp

       49 00411681 8bec            mov     ebp,esp

       49 00411683 81ecc0000000    sub     esp,0C0h

       49 00411689 53              push    ebx

       49 0041168a 56              push    esi

       49 0041168b 57              push    edi

       49 0041168c 8dbd40ffffff    lea     edi,[ebp-0C0h]

       49 00411692 b930000000      mov     ecx,30h

       49 00411697 b8cccccccc      mov     eax,0CCCCCCCCh

       49 0041169c f3ab            rep stos dword ptr es:[edi]

       50 0041169e 837d0800        cmp     dword ptr [ebp+8],0 ß Check what our first parameter is on the stack, EBP+8 remember  PLUS == Parameters.  Note looking at the stack it’s 00000000     

       50 004116a2 741d            je      eatcpu!checkSomething+0x41 (004116c1ß if parameter 1 ones 0 we are going to jump to this addres, else we execute the following code. (WE JUMP) eatcpu!checkSomething+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 52]:

       52 004116a4 8bf4            mov     esi,esp    ß The green would have been the good code path, non errant.

       52 004116a6 68fa000000      push    0FAh

       52 004116ab ff15a8814100    call    dword ptr [eatcpu!_imp__Sleep (004181a8)] ß Note we sleep or do some work other then eat CPU here if we are passed non ZERO

       52 004116b1 3bf4            cmp     esi,esp

       52 004116b3 e8a6faffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       53 004116b8 b801000000      mov     eax,1 ß We are setting EAX to 1, this means we have succeded

       53 004116bd eb15            jmp     eatcpu!checkSomething+0x54 (004116d4ß Now we jump to the clean up code for the fucntionn

     

    eatcpu!checkSomething+0x41 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]:

       57 004116c1 8bf4            mov     esi,esp  ß This appears to be a failure case.  We did not get an expected parameter so we report an error and return zero.

       57 004116c3 6a57            push    57h ß Pusing error 0x57 on the stack, invalid parameter.

       57 004116c5 ff15a4814100    call    dword ptr [eatcpu!_imp__SetLastError (004181a4)] ß Our call to setlasterror

       57 004116cb 3bf4            cmp     esi,esp

       57 004116cd e88cfaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       58 004116d2 33c0            xor     eax,eax  ß XORing eax with eax will make EAX Zero.  This is an error condition.

     

    eatcpu!checkSomething+0x54 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 60]:

       60 004116d4 5f              pop     edi

       60 004116d5 5e              pop     esi

       60 004116d6 5b              pop     ebx

       60 004116d7 81c4c0000000    add     esp,0C0h

       60 004116dd 3bec            cmp     ebp,esp

       60 004116df e87afaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       60 004116e4 8be5            mov     esp,ebp

       60 004116e6 5d              pop     ebp

       60 004116e7 c3              ret

     

    The key thing that should be observed by this scenario is that when dealing with a high CPU condition there is often some problem at the lower level of some loop condition that prevents the proper execution of code from happening.  If you’re lucky the problem is reported by some error facility in the OS or the application.  In either case you can use the above technique for isolation.

     

    The following is the sample code for EATCPU.     

     

    // eatcpu.cpp : Defines the entry point for the console application.

    //

     

    #include "stdafx.h"

    #include <windows.h>

    #include <process.h>

     

    void myThreadFunction(void *);

    int doSomething(int);

    int trySomething(int);

    int checkSomething(int);

     

    int _tmain(int argc, _TCHAR* argv[])

    {

          int truevalue = 1;

          int falsevalue = 0;

          // let's start some threads.  These should represent worker threads in a process.  Some of them will do “valid work” one will fail to do so.

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&falsevalue );

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

     

          Sleep(10*60000);

          return 0;

    }

     

    void myThreadFunction(void *value)

    {

          int *localvalue = (int *)value;

     

          while(1)

          {

                trySomething(*localvalue);

          }

    }

    int doSomething(int value)

    {

          return trySomething(value);

     

    }

    int trySomething(int value)

    {

          return checkSomething(value);

     

    }

    int checkSomething(int value)

    {

          if(value)

          { // Make sure we have have some valid input parameter. 

            //We will pretend we are doing work, this could be anything , file I/O etc.

                Sleep(250);

                return TRUE;

          }

          else

          { // This is an error condition, this function expects an NON Zero parameter and will report ZERO as an invalid parameter.

     

                SetLastError(ERROR_INVALID_PARAMETER);

                return FALSE;

          }

    }

     

     

     

  • Ntdebugging Blog

    Smoking Gun Pool Corruption

    • 3 Comments

    Hello, my name is Ron Stock and I’m an Escalation Engineer on the Microsoft Platforms Global Escalation Services Team. Today I’m going to talk about pool corruption which manifests itself in various ways.  It’s usually hard to track down because the culprit is long gone when the machine crashes. Tools such as Special Pool make our debug lives easier; however tracking down corruption doesn’t always have to make you pull your hair out. In some cases simply re-tracing the steps of the crash can reveal a smoking gun.

    Let’s take a look at a real world example. First we need to be in the right context so we set the trap frame to give us the register context when the machine crashed.

     

    2: kd> .trap 0xfffffffff470662c

    ErrCode = 00000002

    eax=35303132 ebx=fd24d640 ecx=fd24d78c edx=fd24d784 esi=fd24d598 edi=fd24d610

    eip=e083f7a5 esp=f47066a0 ebp=f47066e0 iopl=0         nv up ei pl nz na po nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010202

    nt!KeWaitForSingleObject+0x25b:

    e083f7a5 ff4818          dec     dword ptr [eax+18h]  ds:0023:3530314a=????????

     

    From the register output we can tell that the system crashed while attempting to dereference a pointer at memory location [eax+18h]. The value stored in register eax is probably the address of a structure given that the code is attempting to dereference offset 18 from the base of eax. Currently eax is pointing to 0x35303132 which is clearly not a valid kernel mode address.  Most kernel mode addresses on 32-bit systems will be above the 0x80000000 range assuming the machine is not using something like the /3GB switch. Our mission now is to determine how eax was set.

     

    First we’ll unassemble the failing function using the UF command.

     

    2: kd> uf nt!KeWaitForSingleObject

    …..

    ……

                   …..

    nt!KeWaitForSingleObject+0x25b:

    e083f7a5 ff4818          dec     dword ptr [eax+18h]

    e083f7a8 8b4818          mov     ecx,dword ptr [eax+18h]

    e083f7ab 3b481c          cmp     ecx,dword ptr [eax+1Ch]

    e083f7ae 0f836ef9ffff    jae     nt!KeWaitForSingleObject+0x2a3 (e083f122)

     

    I truncated the results of the UF output to conserve space in this blog. Instruction e083f7a5 is the line of code that generated the fault so our focus is to determine how the value of eax was set prior to running instruction e083f7a5. Based on the UF output, instruction e083f11c could have jumped to e083f7a5. Let’s investigate how eax is set before instruction e083f11c jumped to the failing line.

     

    nt!KeWaitForSingleObject+0x244:

    e083f107 8d4208          lea     eax,[edx+8]

    e083f10a 8b4804          mov     ecx,dword ptr [eax+4]

    e083f10d 8903            mov     dword ptr [ebx],eax

    e083f10f 894b04          mov     dword ptr [ebx+4],ecx

    e083f112 8919            mov     dword ptr [ecx],ebx

    e083f114 895804          mov     dword ptr [eax+4],ebx

    e083f117 8b4668          mov     eax,dword ptr [esi+68h]

    e083f11a 85c0            test    eax,eax

    e083f11c 0f8583060000    jne     nt!KeWaitForSingleObject+0x25b (e083f7a5) ß--Jump

     

    Instruction e083f117 moves a value into eax so I’m dumping the value here.

     

    2: kd> dd esi+68h l1

    fd24d600  35303132

     

    Bingo! There’s our bad value of 35303132 which is the value of the eax register too, so we probably took this code path. Just to confirm the current value of eax, I’m dumping the register which should mirror the results for eax when using the “r” command to get the full register set.

     

    2: kd> r eax

    Last set context:

    eax=35303132

     

    Now our focus moves to why dword ptr [esi+68h] points to the bad value? Without source code this can be challenging to narrow down, however the !pool command comes in handy for cases like this.  

     

    2: kd> ? esi+68h

    Evaluate expression: -47917568 = fd24d600

     

    Let’s examine fd24d600 a little more in detail using the !pool command. The !pool command neatly displays an entire page of 4k kernel memory listing all of the allocations contained on the page. From the output we can determine that our address is allocated from NonPaged pool and holds some sort of thread data, evidenced by the Thre tag next to our allocation. Notice the asterisk next to fd24d578 indicating the start of our pool. Virtual address fd24d578 is the beginning of an 8 byte pool header, and the header is followed by the actual data blob. Be aware that not all memory is allocated from the pool so the !pool command is not always useful. I have more information on !pool  later in the blog.

     

    2: kd> !pool fd24d600

    Pool page fd24d600 region is Nonpaged pool

     fd24d000 size:  270 previous size:    0  (Allocated)  Thre (Protected)

     fd24d270 size:   10 previous size:  270  (Free)       `.lk

     fd24d280 size:   40 previous size:   10  (Allocated)  Ntfr

     fd24d2c0 size:   20 previous size:   40  (Free)       CcSc

     fd24d2e0 size:  128 previous size:   20  (Allocated)  PTrk

     fd24d408 size:  128 previous size:  128  (Allocated)  PTrk

     fd24d530 size:    8 previous size:  128  (Free)       Mdl

     fd24d538 size:   28 previous size:    8  (Allocated)  Ntfn

     fd24d560 size:   18 previous size:   28  (Free)       Muta

     

    *fd24d578 size:  270 previous size:   18  (Allocated) *Thre (Protected) ß-our pool

     

     fd24d7e8 size:  428 previous size:  270  (Allocated)  Mdl

     fd24dc10 size:   30 previous size:  428  (Allocated)  Even (Protected)

     fd24dc40 size:   30 previous size:   30  (Allocated)  TCPc

     fd24dc70 size:   18 previous size:   30  (Free)       SeTd

     fd24dc88 size:   28 previous size:   18  (Allocated)  Ntfn

     fd24dcb0 size:  128 previous size:   28  (Allocated)  PTrk

     fd24ddd8 size:  228 previous size:  128  (Allocated)  tdLL

     

    I’ll dump out the contents of the allocation using the dc command starting at the pool header for this block of memory. Remember, we expect to move a value from [esi+68] into eax. Later the code dereferences [eax+18] which leads me to believe that eax is the base of a structure. So we expect a valid Kernel mode value to be moved into eax rather than something like a string, otherwise the code wouldn’t dereference an offset.

     

     

    2: kd> dc fd24d578

    fd24d578  0a4e0003 e5726854 00000003 00000002  ..N.Thr.........

    fd24d588  eb10ee70 20000000 e08b5c60 eb136f96  p...... `\...o..

    fd24d598  006e0006 00000000 fd24d5a0 fd24d5a0  ..n.......$...$.

    fd24d5a8  fd24d5a8 fd24d5a8 f4707000 f4704000  ..$...$..pp..@p.

    fd24d5b8  f4706d48 00000000 fd24d700 fd24d700  Hmp.......$...$.

    fd24d5c8  fd24d5c8 fd24d5c8 fd270290 01000100  ..$...$...'.....

    fd24d5d8  00000002 00000000 00000001 01000a02  ................

    fd24d5e8  00000000 fd24d640 32110000 0200009f  ....@.$....2....

    2: kd> dc

    fd24d5f8  00000000 20202020 32313532 000a6953  ....   25125Si.. <-- appears to be a string.

    fd24d608  20202020 20202020 20202020 5c4e4556              VEN\

    fd24d618  32313532 20202020 20202020 20202020  2512           

    fd24d628  00000000 00000000 00000000 00000000  ................

    fd24d638  00000000 00000000 fd24d78c fd24d78c  ..........$...$.

    fd24d648  00000000 fd24d784 fd24d640 30010000  ......$.@.$....0

    fd24d658  00343033 00000000 00000000 00000000  304.............

    fd24d668  00000000 01000000 00000000 00000000  ................

    2: kd> dc

    fd24d678  fd24d598 00000000 00000000 00000000  ..$.............

    fd24d688  fd24d618 fd24d618 fd24d598 fd24d610  ..$...$...$...$.

    fd24d698  00000000 00010102 00000000 00000000  ................

    fd24d6a8  00000000 00000000 e08aeee0 00000000  ................

    fd24d6b8  00000801 0000000f fd270290 0000000f  ..........'.....

    fd24d6c8  fd24d5c0 fd24d6d0 00000000 00000000  ..$...$.........

    fd24d6d8  00000000 00000000 00000000 00000000  ................

    fd24d6e8  00000000 00000000 f4707000 06300612  .........pp...0.

     

    Examining the memory contents above you can clearly see a string overwrite starting around 0xfd24d5f8. The memory we dereferenced, fd24d600 or [esi+68], is right in the middle of the string. The string appears to be a vendor number for a piece of hardware. After examining the setupapi.log and the OEM**.inf  files in the Windows\inf directory we found a similar string and narrowed it down to a third party.

     

    A little more on the !pool command is important to mention. The memory address of interest may not always be allocated from the pool in which case you would encounter a message similar to this.

     

    0: kd> !pool 80000ae5

    Pool page 80000ae5 region is Unknown

    80000000 is not a valid large pool allocation, checking large session pool...

    80000000 is freed (or corrupt) pool

    Bad allocation size @80000000, too large

     

    ***

    *** An error (or corruption) in the pool was detected;

    *** Pool Region unknown (0xFFFFFFFF80000000)

    ***

    *** Use !poolval 80000000 for more details.

    ***

     

    If this had been the case I would have enabled Special Pool to narrow down the culprit.

     

  • Ntdebugging Blog

    Tracking down MmSt paged pool usage

    • 6 Comments

     

    A trend that I’ve noticed recently are cases involving paged pool depletion with high MmSt tag usage that remains after trying KB304101 (PoolUsageMaximum). These pool allocations are used by the memory manager for section object prototype PTEs. There are generally only two options when this happens: 1) upgrade to a 64-bit platform, or 2) reduce the size of the volumes. But we may want to know what mapped files are using this memory. Here is how it can be done. Start with !memusage.

     

    5: kd> !memusage

     loading PFN database

    loading (100% complete)

    Compiling memory usage data (99% Complete).

                 Zeroed:  19073 ( 76292 kb)

                   Free:      0 (     0 kb)

                Standby: 1468824 (5875296 kb)

               Modified:    368 (  1472 kb)

        ModifiedNoWrite:   1927 (  7708 kb)

           Active/Valid: 605772 (2423088 kb)

             Transition:      0 (     0 kb)

                    Bad:      0 (     0 kb)

                Unknown:      0 (     0 kb)

                  TOTAL: 2095964 (8383856 kb)

      Building kernel map

      Finished building kernel map

    Scanning PFN database - (100% complete)

     

    Following this you will see the list of mapped files and their control areas.

     

      Usage Summary (in Kb):

    Control Valid Standby Dirty Shared Locked PageTables  name

    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )

     

    The control area is the address at the far left and has a Segment field that contains the total number of PTEs.

     

    5: kd> dt 8c62a638 _CONTROL_AREA Segment->TotalNumberOfPtes

    nt!_CONTROL_AREA

       +0x000 Segment                    :

          +0x004 TotalNumberOfPtes          : 0x1e8b00

     

    The MmSt allocations contain these PTEs so all we need to do is multiply this by the size of a PTE to get the total size of the MmSt allocations for this control area. Note that there may be multiple allocations for this control area, but this number will reflect the total size all these allocations.

     

    5: kd> ?? 0x1e8b00 * sizeof(nt!_MMPTE)

    unsigned int 0xf45800

     

    So now we know the MmSt size in bytes for a single control area, or mapped file. What if we would like to see the totals for all mapped files from the !memusage output? First, place the !memusage output in a text file and remove all header information. You will also need to remove all tail information including the page file and process summaries. Every line should look like these.

     

    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )

    8b06ac18   516      0     0     0     0     0    No Name for File

     

    We want to include the “No Name for File” entries since those are valid mapped files even though the name could not be located. Next strip out everything but the control area address. You can use Excel or any other tool that allows you to select and delete columns in a text file. Now we have a file with a single column of all the control areas on the system. The following debugger command script can be used to process this file.

     

    $$ countptes.txt script

    r $t2 = 0;

     

    $$ Replace the memusage.txt file name with your file name.

    .foreach /f (ca "memusage.txt") {

        r $t1 = @@c++(((nt!_CONTROL_AREA *)(0x${ca}))->Segment->TotalNumberOfPtes);

        .printf "Control Area %p : %d\n", ${ca}, @$t1;

        r $t2 = @$t2 + @$t1;

    }

     

    .printf "Total PTEs : %d\n", @$t2;

    .printf "MmSt size  : %d bytes\n", (@$t2 * @@c++(sizeof(nt!_MMPTE)));

     

    The following command will execute the script.

     

    5: kd> $$><countptes.txt

     

    This will show the number of PTEs for each control area, followed by a summary.

     

    Total PTEs : 62790244

    MmSt size  : 502321952 bytes

     

    A common high user of MmSt allocations is $Mft. The cache manager will hold the MmSt allocations for these file system metadata files at a cost of up to 4 files per PTE. This technique can be used to determine how much $Mft is using MmSt pool memory by first using findstr at a command prompt to isolate just those values from the !memusage output.

     

    C:\Projects>findstr /c:"$Mft" memusage.txt >mftusage.txt

     

    After stripping out the control area addresses with Excel and running the command script you’ll have the size of the MmSt allocations for just the $Mft files. If this is consuming most of the MmSt bytes then you are limited to the options mentioned at the beginning of this article. There may be other options if something else is the primary user but it will likely involve reducing some heavy load on the system.

     

    -Bryan

Page 19 of 24 (238 items) «1718192021»