• Ntdebugging Blog

    This button doesn’t do anything!

    • 8 Comments

     

    Hello - Matthew here again.  Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago.  From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs.  The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind.  When this happens, how can you determine what is going wrong?

     

    In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure.  If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event.  Or if you suspect that the problem is due to something failing over the network, a network trace could help.

     

    Let's assume that the techniques described above do not move us any closer to understanding the failure.  What next?  If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed?  Let's answer those questions by debugging a sample application...

     

    The Nothing button

     

    Download the sample application here.

     

     

    Here's what we know:

    1. Clicking the “Button 1” button is supposed to display a useful dialog box

    2. For certain users, clicking “Button 1” results in no observable response

    3. You don't have access to the source code or debug symbols for the app

    4. No one seems to know what the “Button 1” button does behind the scenes

    5. The developer of the application hasn't been heard from in 3 years

     

    We need to understand what exactly happens when the user clicks the “Button 1” button.  Every window has a WindowProc function that receives messages upon user input.  Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked.  To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.

     

    So we need to:

     

    a)     Find the control identifier for “Button 1”

    b)    Find the WindowProc for the main application window

    c)     Determine where WM_COMMAND messages for the “Button 1” button are handled

    d)    Figure out what is failing in that code

     

    Let’s get started...

     

     

     

    Find the control identifier for “Button 1”      

     

    Spy++ (available as part of Visual Studio) is the tool for this job.  Click Spy -> Log Messages.  Use the Finder Tool to select the main window of ntdbghang1.exe.  On the messages tab, click “Clear All” then select “WM_COMMAND”.  Click “OK” and SPY++ will start logging.  Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window.  For good measure, click “Button 2” also, and you should see a second line of text.  At this point, your SPY++ window should look something like this....

     

    Spy Windows Msg

     

    We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102).  We’ll need this information later.

     

     

     

    Find the WindowProc for the main application window

     

    We can find the WindowProc address by using SPY++ again.  Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe.  Make sure “Show properties” is selected, and then click “OK”.  You will get a dialog that looks like this....

     

    Spy Window Properties

     

    Note the value of Window Proc.  This is the address of the window procedure for the main window (the exact value may differ on your system).   We know that this function is a WindowProc, which means it is of the form...

     

    LRESULT CALLBACK WindowProc(         

        HWND hwnd,

        UINT uMsg,

        WPARAM wParam,

        LPARAM lParam

    );

     

    The meaning of wParam and lParam differ depending on what uMsg is being processed.  When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101.  The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).

     

    So, we specifically are interested in the case where:

               

                uMsg = WM_COMMAND (literally 0x111)

                wParam = 0x101

     

    While looking at assembly language for the WindowProc, the stack frame will look like this...

     

    ebp    =  “old ebp”

    ebp+4  =  “return address”

    ebp+8  =  hwnd

    ebp+c  =  uMsg

    ebp+10 =  wParam

    ebp+14 =  lParam

     

     

     

    Determine where WM_COMMAND messages for the “Button 1” button are handled

     

    Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg.  Launch windbg.exe and select “File -> Attach to a Process”.  Select “ntdbghang1.exe” from the list and click “OK”.  When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”.  Based on the address on my system, that command will be “u 01002830”.  To unassemble more just type “u” again.  I’m now going to unassemble the relevant code and explain what it means along the way...

     

     

    0:001> u 01002830

    ntdbghang1+0x2830:

     

    The first 3 instructions are a function prologue, just getting everything set up.

    01002830 8bff            mov     edi,edi

    01002832 55              push    ebp

    01002833 8bec            mov     ebp,esp

     

    Here is the code that moves the value of uMsg into ecx

    01002835 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]

     

    Now the app enters the assembly equivalent of a switch statement.

    We are interested in the check for uMsg = WM_COMMAND

     

    if uMsg = WM_CREATE (1) goto 01002893

    01002838 49              dec     ecx

    01002839 7458            je      ntdbghang1+0x2893 (01002893)

    if uMsg = WM_ DESTROY (2) goto 01002889

    0100283b 49              dec     ecx

    0100283c 744b            je      ntdbghang1+0x2889 (01002889)

    if uMsg = WM_CLOSE (0x10) goto 01002889

    0100283e 83e90e          sub     ecx,0Eh

    01002841 743b            je      ntdbghang1+0x287e (0100287e)

    if uMsg = WM_COMMAND (0x111) goto 01002853

    01002843 b801010000      mov     eax,101h

    01002848 2bc8            sub     ecx,eax

    0100284a 7407            je      ntdbghang1+0x2853 (01002853)

     

    From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853.  So let’s go there...

     

    0:001> u 01002853

    ntdbghang1+0x2853:

     

    Move the value of wParam into edx

    01002853 8b5510          mov     edx,dword ptr [ebp+10h]

    If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f

    01002856 0fb7ca          movzx   ecx,dx

    01002859 2bc8            sub     ecx,eax

    0100285b 7412            je      ntdbghang1+0x286f (0100286f)

     

    From the above assembly we can see that right away we are checking for the control ID for button 1.  This is the code path we want to follow....

     

    0:001> u 0100286f

    ntdbghang1+0x286f:

     

    If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b

    Otherwise, call the function at 010027f6

    0100286f c1ea10          shr     edx,10h

    01002872 6685d2          test    dx,dx

    01002875 7524            jne     ntdbghang1+0x289b (0100289b)

    01002877 e87affffff      call    ntdbghang1+0x27f6 (010027f6)

     

    Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6. 

     

     

     

    Figure out what is failing

     

    So now we know what code is going to run every time the button is clicked (the function at 010027f6).  Let’s examine that code and figure out what may be failing.  We’ll use “uf” to unassemble the entire function, as it is rather short.  During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.

     

     

    0:001> uf 010027f6

    ntdbghang1+0x27f6:

     

    Function Prologue

    010027f6 8bff            mov     edi,edi

    010027f8 55              push    ebp

    010027f9 8bec            mov     ebp,esp

     

    010027fb 51              push    ecx

    010027fc 56              push    esi

     

    call func1(0x20) [func1 address at 01002dca]

    set localvar1=0x10

    010027fd 6a20            push    20h

    010027ff c745fc10000000  mov     dword ptr [ebp-4],10h

    01002806 e8bf050000      call    ntdbghang1+0x2dca (01002dca)

     

    Store the result of func1 in esi (call it localvar2)

    0100280b 8bf0            mov     esi,eax

     

    Call func2(localvar2, &localvar1) [func2 address at 0100b484]

    0100280d 59              pop     ecx

    0100280e 8d45fc          lea     eax,[ebp-4]

    01002811 50              push    eax

    01002812 56              push    esi

    01002813 e86c8c0000      call    ntdbghang1+0xb484 (0100b484)

     

    If func2 returns 0, then goto 01002821

    01002818 85c0            test    eax,eax

    0100281a 7405            je      ntdbghang1+0x2821 (01002821)

     

    Call func3() [func3 address at 0100278d]

    ntdbghang1+0x281c:

    0100281c e86cffffff      call    ntdbghang1+0x278d (0100278d)

     

    Call func4(localvar2) [func4 address at 01002ce3]

    ntdbghang1+0x2821:

    01002821 56              push    esi

    01002822 e8bc040000      call    ntdbghang1+0x2ce3 (01002ce3)

     

    Clean up and exit the function

    01002827 59              pop     ecx

    01002828 5e              pop     esi

    01002829 c9              leave

    0100282a c3              ret

     

     

    So let’s translate this to some pseudo-code...

     

    localvar1 = 0x10;

    localvar2 = func1(0x20);

     

    if(func2(localvar2, &localvar1)

    {

           func3();

     

    func4(localvar2);

     

    So the behavior of the function will differ, depending on the results of func2.  Let’s see if we can figure out what func2 does...

     

    0:001> u 0100b484

    ntdbghang1+0xb484:

    0100b484 ff2568110001    jmp     dword ptr [ntdbghang1+0x1168 (01001168)]

    0100b48a cc              int     3

    0100b48b cc              int     3

    0100b48c cc              int     3

     

    0:001> dps 01001168 L1

    01001168  70b88cb1 WINSPOOL!GetDefaultPrinterW

     

     

    GetDefaultPrinterW is a public API function.  Here’s the function prototype from MSDN...

     

    BOOL GetDefaultPrinter(

      LPTSTR pszBuffer,   // printer name buffer

      LPDWORD pcchBuffer  // size of name buffer

    );

     

    This fits our pseudo-code, since the function takes two parameters and returns a BOOL.  Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...

     

    DWORD cchBuffer = 0x10;

    LPWSTR pszBuffer = func1(0x20);

     

    if(GetDefaultPrinterW(pszBuffer,  &cchBuffer))

    {

           func3();

    }

     

    func4(pszBuffer);

     

    Based on the above, we can make a couple of assumptions.  Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10.  So that accounts for all of the functions, except for func3.  We could unassemble func3 to see what it does, but that may not be needed.  From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory.  So GetDefaultPrinterW is a likely potential point of failure.

     

    Let’s test this theory out.  We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.

     

    0:001> bp WINSPOOL!GetDefaultPrinterW

     

    0:001> g

     

    <Now click Button 1>

     

    Breakpoint 0 hit

    eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c

    eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202

    WINSPOOL!GetDefaultPrinterW:

    70b88cb1 8bff            mov     edi,edi

     

    // Dump out the call stack...

    0:000> kb

    ChildEBP RetAddr  Args to Child             

    0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW

    WARNING: Stack unwind information not available. Following frames may be wrong.

    0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818

    0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c

    0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23

    0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b

    0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7

     

    // Go to the return address.  Note that eax=0, meaning GetDefaultPrinterW returned FALSE.

    0:000> gu

    eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c

    eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    ntdbghang1+0x2818:

    01002818 85c0            test    eax,eax

     

    // Check the last error...

    0:000> !gle

    LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.

     

           // Resume execution

    0:000> g

     

    The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER.  It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user.  This explains why the app works for some users and fails for others.  A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.

     

    To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...

     

    VOID Button1_OnClick()

    {

          DWORD cch = 16;

          LPTSTR pPrinterName;

     

          pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));

     

          if(GetDefaultPrinter(pPrinterName, &cch))

          {

                DisplayGoButtonMessage();

          }

     

          free(pPrinterName);

     

          return;

    }

     

    Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger.  I hope you found this useful and can apply to real-world debugging scenarios.  Please post any questions / comments!

  • Ntdebugging Blog

    Hung Window?, No Source?, No Problem!! Part 2

    • 4 Comments

    Written by Jeff Dailey

     

    Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR (critical problem resolution) platforms team.   This blog entry is part 2 of my Hung Window?, No source?, No problem!! Part 1 blog.   In this lab we will be debugging a problem involving multi threaded applications and synchronization objects and the types of things that can go wrong, and how to track them down. This process and training lab is right out of our CPR Training curriculum.  In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post.  You will also need to install the debugging tools for windows.

     

    Debugging tools:

     http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx

     

    After you have both of these installed we can get started.  We are going to debug and figure out why the window stops repainting and does not respond.

     

    Step 1 start badwindow.exe

    Step 2 run dumphungwindow.exe

    Step 3 select Hang \ Hang Type 2 from  the BadWindow.exe menu.

    You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process

     

    ************ OUTPUT *************

    C:\source\dumphungwindow\release>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeffda\AppData\Local\Temp\
    scanning for hung windows

    **

    Hung Window found dumping process (12912) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_5_Pid12912_badwindow.exe.dmp

    Dump complete

     

    Hung Window found dumping process (12912) badwindow.exe

    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_24_Pid12912_badwindow.exe.dmp\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp

    Dump complete*

    ************ OUTPUT *************

     

    Step 4 create a local symbol directory at C:\websymbols

    Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.


    Step 6 start windbg select file\open crash dump and select the first dump file.

    Your initial output should look like this.

     

    Microsoft (R) Windows Debugger  Version 6.7.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]

    User Mini Dump File with Full Memory: Only application data is available

     

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols;srv

    Executable search path is:

    Windows Vista Version 6000 MP (2 procs) Free x86 compatible

    Product: WinNt, suite: SingleUserTS

    Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)

    System Uptime: 11 days 18:41:43.089

    Process Uptime: 0 days 0:00:32.000

    ....................................

    Loading unloaded module list

    .

    eax=00000000 ebx=00000002 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
    eip=777faec5 esp=0017faf4 ebp=0017fb8c iopl=0         nv up ei pl nz na po nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
    ntdll!ZwWaitForMultipleObjects+0x15:
    777faec5 c21400          ret     14h

    0:000> !reload

     

    Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it.
    Type ~* k

     

     

    You will most likely see output similar to this.

    .  0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 

    0017faf0 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15

    0017fb8c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d

    0017fba8 00401502 kernel32!WaitForMultipleObjects+0x18

    0017fbc8 0040139b badwindow!hangtype2+0x42 [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af badwindow!WndProc+0x17b [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

       1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 

    026ffebc 777ecfad ntdll!ZwWaitForSingleObject+0x15

    026fff20 777ecf78 ntdll!RtlpWaitOnCriticalSection+0x154

    026fff48 0040153c ntdll!RtlEnterCriticalSection+0x152

    026fff64 757c2848 badwindow!hangtype2threada+0x2c [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 msvcr80!_endthread+0x4b

    026fffa0 76eb19f1 msvcr80!_endthread+0xcb

    026fffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    026fffec 00000000 ntdll!_RtlUserThreadStart+0x23

    Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 3270.2b10 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efdd000 Unfrozen

     

    Each of these threads represents a call stack.  The most recent call is at the TOP of the stack.  As each call is made the stack grows larger.   Looking at thread 0 you will see that our winproc appears to be blocked on a call to hangtype2, hangtype2 is making a call to WaitForMultipleObjects Lets look more closely at WaitForMultipleObjects

     

    Docs for WaitForMultipleObjects

    http://msdn2.microsoft.com/en-us/library/ms687025.aspx

     

    DWORD WINAPI WaitForMultipleObjects( DWORD nCount, const HANDLE* lpHandles, BOOL bWaitAll, DWORD dwMilliseconds

     

    Lets look at the parameters passed to

     

    0:000> kv

    ChildEBP RetAddr  Args to Child             

    0017faf0 76e4edb5 00000002 0017fb40 00000000 ntdll!ZwWaitForMultipleObjects+0x15 (FPO: [5,0,0])

    0017fb8c 76e430c3 0017fb40 0017fbc4 00000001 kernel32!WaitForMultipleObjectsEx+0x11d (FPO: [Non-Fpo])

    0017fba8 00401502 00000002 0017fbc4 00000001 kernel32!WaitForMultipleObjects+0x18 (FPO: [Non-Fpo])

    0017fbc8 0040139b 00401220 0017fbfc 00401220 badwindow!hangtype2+0x42 (FPO: [0,2,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af 00063d36 00000111 00008004 badwindow!WndProc+0x17b (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 00401220 00063d36 00000111 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d 00000000 00401220 00063d36 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])

    0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])

    0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])

    0017ff0c 00401817 00400000 00000000 00280f8c badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 7efde000 0017fb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

     

    The first parameter is 00000002  this is the number of objects we are waiting on.

    The second parmeter is the address of the array of objects,  Lets dump it out and take a look at the objects

     

    0:000> dd 0017fbc4

    0017fbc4  000000c4 000000c8 0040139b 00401220

    0017fbd4  0017fbfc 00401220 00063d36 0017fc48

    0017fbe4  772a8989 772a894d 53ca28e7 00000000

    0017fbf4  00063d36 00401220 00000000 00000000

    0017fc04  00000000 0017fca0 00000001 00000000

    0017fc14  ffffffff 772a88e5 53c4f4b4 75c12459

    0017fc24  0017fc50 772a87af 00063d36 00000111

    0017fc34  00008004 00000000 00401220 dcbaabcd

     

    0:000> !handle 000000c4

    Handle 000000c4

      Type            Thread

    0:000> !handle 000000c8

    Handle 000000c8

      Type            <Error retrieving type>

     

    Looking at the second value it would appear as if all the info needed to get the handle type info is not in the dump for some reason.  Handles are a index into the handle table in kernel.  It’s possible when the dump was created that no all the handle info was included.  However that’s ok.  We have a simple way to work around this and see what happened.

     

    We can use UF from part 1 of this blogs on badwindow.exe, All we need to do is UF the return address of

    WaitForMultipleObjects.  Lets run through the assembly and see what we are waiting on.

     

    0:000> uf 00401502

    badwindow!hangtype2 [c:\source\badwindow\badwindow\badwindow.cpp @ 334]:

     

    Reserving space on the stack by decrementing ESP (The stack pointer, remember the stack grows down in memory)

      334 004014c0 83ec08          sub     esp,8

     

    Save the state of ESI so it can be restored later.

      334 004014c3 56              push    esi

     

    Get the pointer to _beginthread from the import table and store it in ESI
    Docs on being thread http://msdn2.microsoft.com/en-us/library/kdzttdcb(VS.80).aspx 1 (start address), 2 (stack size), 3 (arglist)

      337 004014c4 8b3580204000    mov     esi,dword ptr [badwindow!_imp___beginthread (00402080)]

     

     

    Push the last arg to _beginthread on the stack this is the arg list for _beginthread in this case 0 we are passing no args.

      337 004014ca 6a00            push    0


    This is our stack space.  Note in the debugger you can do a ? 2EE0h and it will show value in Hex and Dec, this value is 12000 dec.

      337 004014cc 68e02e0000      push    2EE0h

     

    This is the start address for our thread function, in this case hangtype2threada

      337 004014d1 6810154000      push    offset badwindow!hangtype2threada (00401510)

     

    Here we call _beginthread this starts the thread up.  The return value is a thread handle.

      337 004014d6 ffd6            call    esi

     

    Push the last arg to _beginthread on the stack this is the arg list for _beginthread in this case 0 we are passing no args.

      338 004014d8 6a00            push    0

     

    This is our stack space arg for _beingthread

      338 004014da 68e02e0000      push    2EE0h

     

    This is the start address for our thread function, in this case hangtype2threadb

      338 004014df 6870154000      push    offset badwindow!hangtype2threadb (00401570)

     

    We are now storing EAX (The return from the first _beginthreadcall ) into ESP+1ch (on our stack)

      338 004014e4 8944241c        mov     dword ptr [esp+1Ch],eax

     

    Here we call _beginthread this starts the thread up.  The return value is a thread handle.

      338 004014e8 ffd6            call    esi

     

    Any time we add to ESP We are shrinking or cleaning up the stack.

      338 004014ea 83c418          add     esp,18h

     

    We are pushing our wait time for WaitForMultipleObjects in this case 0FFFFFFFFh (-1) Wait forever.

      340 004014ed 6aff            push    0FFFFFFFFh

     

    Storing EAX on the stack, this is the thread handle from our last _beginthread call.

      340 004014ef 8944240c        mov     dword ptr [esp+0Ch],eax

     

    This is our wait logic,  in this case it’s WaitAll, so we will only unblock once all handles are signaled or in this case threads complete running.

      340 004014f3 6a01            push    1

     

    Here we are loading the pointer of the stack location that contains our handles that we will wait on into EAX.

      340 004014f5 8d44240c        lea     eax,[esp+0Ch]

     

    Now we push the pointer to our handles / objects on the to the stack.

      340 004014f9 50              push    eax

     

    And this is the count of objects, 2 in this case both of them threads.

      340 004014fa 6a02            push    2

     

    Now we call our WaitForMultipleObjects call to wait on hangtype2threadb and hangtype2threada to finish executing.

      340 004014fc ff1510204000    call    dword ptr [badwindow!_imp__WaitForMultipleObjects (00402010)]

     

    Restore our ESI register, this will happen when we return. 

      340 00401502 5e              pop     esi 

     

    Dec our stack pointer.

      342 00401503 83c408          add     esp,8

     

    Return we are done.

      342 00401506 c3              ret

     

    Here is the source.

     

    void hangtype2(void)

    {

          HANDLE handles[2];

     

          handles[0] = (HANDLE)_beginthread(hangtype2threada, 12000, NULL);

          handles[1] = (HANDLE)_beginthread(hangtype2threadb, 12000, NULL);

         

          WaitForMultipleObjects(2,handles,1,INFINITE);

     

    }

     

     

    So what went wrong?  Let’s look at our threads again. 

     

    We have our main message pump thread thread 0 waiting on two threads, One is still running badwindow!hangtype2threada and the other one is gone or has completed hangtype2threadb.

     

    0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 

    0017faf0 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15

    0017fb8c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d

    0017fba8 00401502 kernel32!WaitForMultipleObjects+0x18

    0017fbc8 0040139b badwindow!hangtype2+0x42 [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af badwindow!WndProc+0x17b [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Looking at hangtype2threada it would seem that it is blocked on RtlEnterCriticalSection.

     

       1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 

    026ffebc 777ecfad ntdll!ZwWaitForSingleObject+0x15

    026fff20 777ecf78 ntdll!RtlpWaitOnCriticalSection+0x154

    026fff48 0040153c ntdll!RtlEnterCriticalSection+0x152

    026fff64 757c2848 badwindow!hangtype2threada+0x2c [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 msvcr80!_endthread+0x4b

    026fffa0 76eb19f1 msvcr80!_endthread+0xcb

    026fffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    026fffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Lets look and see what is happening with this critical section call..

     

    First lets set our thread context to thread id 1

     

    0:000> ~1s

    eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=00403780 edi=00000000

    eip=777fa69d esp=026ffec0 ebp=026fff20 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    ntdll!ZwWaitForSingleObject+0x15:

    777fa69d c20c00          ret     0Ch

     

    Lets get our call stack and get the first and only arg for entercriticalsection.

     

    0:001> kv

    ChildEBP RetAddr  Args to Child              

    026ffebc 777ecfad 000000cc 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])

    026fff20 777ecf78 00000000 00000000 76e61d5a ntdll!RtlpWaitOnCriticalSection+0x154 (FPO: [Non-Fpo])

    026fff48 0040153c 00403780 00000000 00000000 ntdll!RtlEnterCriticalSection+0x152 (FPO: [Non-Fpo])

    026fff64 757c2848 00000000 51b22bb2 00000000 badwindow!hangtype2threada+0x2c (FPO: [Uses EBP] [1,1,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 76eb19f1 02274358 026fffec msvcr80!_endthread+0x4b (FPO: [Non-Fpo])

    026fffa0 76eb19f1 02274358 026fffec 7782d109 msvcr80!_endthread+0xcb (FPO: [Non-Fpo])

    026fffac 7782d109 02274358 026ffb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    026fffec 00000000 757c286e 02274358 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

     

    We can do a couple of things at this point first lets look at the CS. (Critical Section)

     

    0:001> !cs 00403780

    -----------------------------------------

    Critical section   = 0x00403780 (badwindow!csCritSec1+0x0)

    DebugInfo          = 0x0029bd40

    LOCKED             < It’s LOCKED.

    LockCount          = 0x1

    WaiterWoken        = No

    OwningThread       = 0x00002048  < This is the owning thread.

    RecursionCount     = 0x14

    LockSemaphore      = 0xCC

    SpinCount          = 0x00000000

     

    Are there any other locked critical sections?  !locks will tell us and no this is the only one.

     

    0:001> !locks

     

    CritSec badwindow!csCritSec1+0 at 00403780

    WaiterWoken        No

    LockCount          1

    RecursionCount     20

    OwningThread       2048

    EntryCount         0

    ContentionCount    1

    *** Locked

     

    Scanned 156 critical sections

     

    What thread are running in our process and what is 2048 doing?

     

    0:001> ~

    #  0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    .  1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

     

    Ok here is our problem.  Apparently both threads hangtype2threada and hangtype2threadb were using this same critical section however something happened to hangtype2threadb.  We need to figure out what happened so let’s go take a look at that function.

     

    Looking back where we unassembled badwindow!hangtype2 we got it’s address, lets verify it with a ln (list near), we are lucky enough to have symbols in this case.

     

    0:001> ln 00401570

    c:\source\badwindow\badwindow\badwindow.cpp(370)

    (00401570)   badwindow!hangtype2threadb   |  (004015e0)   badwindow!hangtype3thread

    Exact matches:

        badwindow!hangtype2threadb (void *)

     

    Looks like we have an exact match.  Now lets unassemble it and see what went wrong.

     

     

    0:001> uf 00401570

    badwindow!hangtype2threadb [c:\source\badwindow\badwindow\badwindow.cpp @ 370]:

     

    Save ECX

      370 00401570 51              push    ecx

     

    Save EBX

      370 00401571 53              push    ebx

     

    Move the pointer to sprint into EBX

      371 00401572 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]

     

    Save EBP

      371 00401578 55              push    ebp

     

    Move the pointer to outputdebugstring into ebp

      371 00401579 8b2d14204000    mov     ebp,dword ptr [badwindow!_imp__OutputDebugStringA (00402014)]

     

    Save ESI

      371 0040157f 56              push    esi

     

    Move the pointer to EnterCriticalSection into ESI

      371 00401580 8b351c204000    mov     esi,dword ptr [badwindow!_imp__EnterCriticalSection (0040201c)]

     

    Save EDI

      371 00401586 57              push    edi

     

    Move the pointer for Sleep into EDI

      371 00401587 8b3d0c204000    mov     edi,dword ptr [badwindow!_imp__Sleep (0040200c)]

     

    Save  14h or 20dec to ESP+10h (A local on the stack)  Maybe this is a counter?

      371 0040158d c744241014000000 mov     dword ptr [esp+10h],14h

     

    Push the address of the critical section csCritSec1 00403780 onto the stack.

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)

     

    Call entercriticalsection

      374 0040159a ffd6            call    esi

     

    Push 0xFA, 250Dec on the stack

      376 0040159c 68fa000000      push    0FAh

    Call Sleep (Wait for 250ms)

      376 004015a1 ffd7            call    edi

    Push pointer to value on the stack. When in doubt dump it out.

    0:001> db 004021e4

    004021e4  57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79  We are in hangty

    004021f4  70 65 32 74 68 72 65 61-64 62 00 00 48 00 00 00  pe2threadb..H...

    00402204  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    00402214  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

      377 004015a3 68e4214000      push    offset badwindow!`string' (004021e4)

     

     

    Push pointer on the stack what is it?

    0:001> db 00403380

    00403380  57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79  We are in hangty

    00403390  70 65 32 74 68 72 65 61-64 62 00 00 00 00 00 00  pe2threadb......

    004033a0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    004033b0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

      377 004015a8 6880334000      push    offset badwindow!szTrace (00403380)

     

    Call spirintf

      377 004015ad ffd3            call    ebx

     

    Clean up the stack

      377 004015af 83c408          add     esp,8

     

    This is out buffer we just did the sprintf to

      378 004015b2 6880334000      push    offset badwindow!szTrace (00403380)


    Call outputdebugstring

      378 004015b7 ffd5            call    ebp

     

    Push csCritSec2’s address on the stack

      380 004015b9 6868374000      push    offset badwindow!csCritSec2 (00403768)

     

    Call LeaveCriticalSection for csCritSec2’s

      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection (00402024)]

     

    Decrement a counter on the stack this is a local counting down to zero..

      382 004015c4 836c241001      sub     dword ptr [esp+10h],1

     

    Check counter local counting down to zero if we are not ZERO yet dump to the top of the loop.

      382 004015c9 75ca            jne     badwindow!hangtype2threadb+0x25 (00401595)

     

     

    Restore all the registers and then return

      382 004015cb 5f              pop     edi

      382 004015cc 5e              pop     esi

      382 004015cd 5d              pop     ebp

      382 004015ce 5b              pop     ebx

      387 004015cf 59              pop     ecx

      387 004015d0 c3              ret

     

     

    Did you see the BUG?,  Look closely,   If you need it here is the source.

    void __cdecl hangtype2threadb(void *)

    {

          int i=0;

          while(1)

          {

                EnterCriticalSection(&csCritSec1);

               

                Sleep(250);

                sprintf(szTrace, "We are in hangtype2threadb");

                OutputDebugStringA(szTrace);

               

                LeaveCriticalSection(&csCritSec2);

                i++;

                if(i==20)

                {

                      break;

                }

          }

    }

     

    We are entering one critical section and leaving another.  Then we drop out of the function once we dec our counter to zero and the thread terminates leaving csCritSec1 entered but never left.  The fix for this seems rather simple,  we just need to leave critsec1 vis leave creatsec2.  That should fix it.  But it we don’t have the source how can we verify that?

    SIMPLE! We just modify the machine code in the debugger!   Often if we think we know how to fix something we will edit the code bytes to make the machine code do the right thing and let it run. 

     

    Do do this, from the command line in your debuggers directory run windbg.exe C:\source\badwindow\release\badwindow.exe   asuming you have your bad window sample in the same directory, I do.   When the debugger fires up make sure you have your symbol path set.  .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols.

     

    Our bad funtion call was

     

    Push csCritSec2’s address on the stack  << WRONG CRITICALSECTION

      380 004015b9 6868374000      push    offset badwindow!csCritSec2 (00403768)

     

    Call LeaveCriticalSection for csCritSec2’s

      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection (00402024)]

     

    Push the address of the critical section csCritSec1 00403780onto the stack.   << CORRECT CRITICALSECTION

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)

     

    Call entercriticalsection

      374 0040159a ffd6            call    esi

     

    Remember all we need to do is change what criticalsection was pushed on the stack for the leavecriticalsection call.

     

    004015b9 6868374000  (BAD)   
    00401595 6880374000  (GOOD)

    Now we just do a edit bytes on the bad instruction.

    0:001> eb 004015b9

    004015b9 68 68  << ENTER 68

    68

    004015ba 68 80  << We don’t want 68 enter 80

    80

    004015bb 37  << Now just hit enter to finish editing memory.

     

    Here is the fixed code.

     

    0:001> uf 00401570

    badwindow!hangtype2threadb [c:\source\badwindow\badwindow\badwindow.cpp @ 370]:
      370 00401570 51              push    ecx
      370 00401571 53              push    ebx
      371 00401572 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]
      371 00401578 55              push    ebp
      371 00401579 8b2d14204000    mov     ebp,dword ptr [badwindow!_imp__OutputDebugStringA (00402014)]
      371 0040157f 56              push    esi
      371 00401580 8b351c204000    mov     esi,dword ptr [badwindow!_imp__EnterCriticalSection (0040201c)]
      371 00401586 57              push    edi
      371 00401587 8b3d0c204000    mov     edi,dword ptr [badwindow!_imp__Sleep (0040200c)]
      371 0040158d c744241014000000 mov     dword ptr [esp+10h],14h

    ENTERING CORRET CRITICAL SECTION csCritSec1

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)
      374 0040159a ffd6            call    esi
      376 0040159c 68fa000000      push    0FAh
      376 004015a1 ffd7            call    edi
      377 004015a3 68e4214000      push    offset badwindow!`string' (004021e4)
      377 004015a8 6880334000      push    offset badwindow!szTrace (00403380)
      377 004015ad ffd3            call    ebx
      377 004015af 83c408          add     esp,8
      378 004015b2 6880334000      push    offset badwindow!szTrace (00403380)
      378 004015b7 ffd5            call    ebp

    LEAVING CORRECT CRITICAL SECTION csCritSec1

      380 004015b9 6880374000      push    offset badwindow!csCritSec1 (00403780) 
      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection
    (00402024)]

      382 004015c4 836c241001      sub     dword ptr [esp+10h],1
      382 004015c9 75ca            jne     badwindow!hangtype2threadb+0x25 (00401595)
      382 004015cb 5f              pop     edi
      382 004015cc 5e              pop     esi
      382 004015cd 5d              pop     ebp
      382 004015ce 5b              pop     ebx
      387 004015cf 59              pop     ecx
      387 004015d0 c3              ret

     

    Then just run the code (PRESS G then enter in the debugger) that’s it, it will work!

    Once you have proven this you can go to the developer of the application and recommend they change their code, remember to provide your debug notes.

     

    I hope you found this helpful and I welcome your feedback.

    Thank you,  Jeff-

     

     

  • Ntdebugging Blog

    Hung Window?, No source?, No problem!! Part 1

    • 2 Comments

    Written by Jeff Dailey

     

    Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR Platforms team.   This blog entry is a follow on for how to detect a hung window.   This process and training lab is right out of our CPR Training curriculum.  In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post.  You will also need to install the debugging tools for windows.  

     

    Debugging tools:  http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx

     

    After you have both of these installed we can get started.  We are going to debug and figure out why the window stops repainting and does not respond.

     

    Step 1 start badwindow.exe

    Step 2 run dumphungwindow.exe

    Step 3 select Hang \ Hang Type 1 from  the BadWindow.exe menu.

     

    You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process

     

    ************ OUTPUT *************

    C:\source\dumphungwindow\release>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeffda\AppData\Local\Temp\
    scanning for hung windows

    **

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_46_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp

    Dump complete*

    ************ OUTPUT *************

     

    Step 4 create a local symbol directory at C:\websymbols

    Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.

     

    Step 6 start windbg select file\open crash dump and select the first dump file.

    Your initial output should look like this.

     

    Microsoft (R) Windows Debugger  Version 6.7.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]

    User Mini Dump File with Full Memory: Only application data is available

     

    Symbol search path is: Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    Windows Vista Version 6000 MP (2 procs) Free x86 compatible

    Product: WinNt, suite: SingleUserTS

    Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)

    System Uptime: 11 days 18:41:43.089

    Process Uptime: 0 days 0:00:32.000

    ....................................

    Loading unloaded module list

    .

    eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=004e3750 edi=004d9c60

    eip=777fb2b2 esp=0017ee18 ebp=0017ee44 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    ntdll!NtAlpcSendWaitReceivePort+0x12:

    777fb2b2 c22000          ret     20h

     

    Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it.
    Type ~* k

    You will most likely see output similar to this.

     

    .  0  Id: 1ef4.17a8 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 
    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12
    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f
    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48
    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c
    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25
    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31
    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9
    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f
    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19
    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257
    0017f388 756e6025 dnsapi!Query_Shim+0xbb
    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20
    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d
    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129
    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c
    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17
    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c
    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4
    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74
    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

       1  Id: 1ef4.f80 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 
    0275fef0 76e41220 ntdll!ZwWaitForSingleObject+0x15
    0275ff60 76e41188 kernel32!WaitForSingleObjectEx+0xbe
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for FwcWsp.dll -
    0275ff74 55620611 kernel32!WaitForSingleObject+0x12
    WARNING: Stack unwind information not available. Following frames may be wrong.
    0275ff88 5561f10c FwcWsp!RatDumpStack+0x1bd
    0275ffa0 76eb19f1 FwcWsp!RatDbgPrint+0x156
    0275ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    0275ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

       2  Id: 1ef4.1d04 Suspend: 0 Teb: 7efd7000 Unfrozen
    ChildEBP RetAddr 
    028cfe90 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
    028cff2c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
    028cff48 5560c841 kernel32!WaitForMultipleObjects+0x18
    WARNING: Stack unwind information not available. Following frames may be wrong.
    028cff84 5560c9c0 FwcWsp!WspDisable+0x1b47
    028cffa0 76eb19f1 FwcWsp!WspDisable+0x1cc6
    028cffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    028cffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 1ef4.1d04 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efd7000 Unfrozen

     

    Each of these threads represents a call stack.  The most recent call is at the TOP of the stack.  As each call is made the stack grows larger.  In this case thread 0 is not showing us all it’s stack.  We can get more of the stack be specifying the number of frames we would like to see.

     

    In this case my current thread is 0 so I will issue a k 123 command asking for as many 123 calls to be displayed.

     

    0:000> k 123

    ChildEBP RetAddr 

    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12

    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f

    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48

    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c

    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25

    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9

    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19

    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257

    0017f388 756e6025 dnsapi!Query_Shim+0xbb

    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20

    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d

    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129

    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c

    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17

    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c

    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4

    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74

    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

    0017f7e0 76d7dba3 ws2_32!getxyDataEnt+0xbd

    0017fa1c 0040147d ws2_32!gethostbyname+0xb4

    0017fbc8 004012fc badwindow!hangtype1+0x3d [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af badwindow!WndProc+0xdc [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    As you can see our current thread is the WinMain thread.   Note this from the call badwindow!wWinMain,  this winmain is processing messages, you can see it calling DispatchMessageW this goes back into User32, and user32 has to rout this message for the proper WinProc that is registered for the associated Window.  

     

    Looking further up the stack you see our winproc being called badwindow!WndProc.  Remember it’s critical that your WinProc stays available to process incoming messages.  If it makes a blocking call or runs in a loop you will not be able to process incoming messages and your window will not repaint and your application will appear to be hung.

     

    If you look deeper yet into the call stack you can see that the WndProc calls hangtype1, this calls gethostbyname.    http://msdn2.microsoft.com/en-us/library/ms738524.aspx 

     

    Looking at the docs, Gethostbyname takes one parameter being the server name.  Looks dump it out and see what we are looking for.

     

    0:000> kv 123

    ChildEBP RetAddr  Args to Child             

    0017ee14 76f584c5 00000424 40020000 004d9c60 ntdll!NtAlpcSendWaitReceivePort+0x12 (FPO: [8,0,0])

    0017ee44 76f5899d 00020000 004d9c60 004e3638 rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f (FPO: [Non-Fpo])

    0017ee78 76f58514 0017eed8 0017ef04 0017f2e8 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48 (FPO: [Non-Fpo])

    0017ee90 76f57785 0017eed8 0017ef04 0017eeb4 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c (FPO: [Non-Fpo])

    0017eea0 76f5766b 0017eed8 756e54f0 0017ef04 rpcrt4!I_RpcSendReceive+0x25 (FPO: [Non-Fpo])

    0017eeb4 76f5763a 0017ef04 004d9d18 76ff0146 rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 59a28201 00000000 00000000 rpcrt4!NdrpSendReceive+0x9 (FPO: [0,0,0])

    0017f2c8 756e5ff3 756e54f0 756e5aec 0017f2e8 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 00000000 0279091c 00000001 dnsapi!R_ResolverQuery+0x19 (FPO: [Non-Fpo])

    0017f34c 756e5d34 0279091c 00000001 14000000 dnsapi!Query_PrivateExW+0x257 (FPO: [Non-Fpo])

    0017f388 756e6025 00000001 0279091c 00000001 dnsapi!Query_Shim+0xbb (FPO: [Non-Fpo])

    0017f3b0 74f461b4 0279091c 00000001 14000000 dnsapi!DnsQuery_W+0x20 (FPO: [Non-Fpo])

    0017f3e0 74f4611c 0279091c 00000001 14000000 mswsock!SaBlob_Query+0x2d (FPO: [Non-Fpo])

    0017f424 74f459db 0279091c 0031c2e0 00307118 mswsock!Rnr_DoDnsLookup+0x129 (FPO: [Non-Fpo])

    0017f6c0 76d75a72 027908b8 00000000 0017f794 mswsock!Dns_NSPLookupServiceNext+0x25c (FPO: [Non-Fpo])

    0017f6d8 76d75a52 00307270 027908b8 00000000 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17 (FPO: [Non-Fpo])

    0017f6f4 76d75994 0031bc18 00000000 0017f794 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c (FPO: [Non-Fpo])

    0017f720 76d7590a 00307130 00000000 0017f794 ws2_32!NSQUERY::LookupServiceNext+0xa4 (FPO: [Non-Fpo])

    0017f764 76d7b11f 00307118 00000000 0017f794 ws2_32!WSALookupServiceNextW+0x74 (FPO: [Non-Fpo])

    0017f7b4 76d7ab77 00307118 00000000 0017f7ec ws2_32!WSALookupServiceNextA+0x52 (FPO: [Non-Fpo])

    0017f7e0 76d7dba3 00307118 0000013c 0017fa38 ws2_32!getxyDataEnt+0xbd (FPO: [Non-Fpo])

    0017fa1c 0040147d 0017fa38 00000000 00070520 ws2_32!gethostbyname+0xb4 (FPO: [Non-Fpo])

    0017fbc8 004012fc 00401220 0017fbfc 00401220 badwindow!hangtype1+0x3d (FPO: [Uses EBP] [0,101,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af 00070520 00000111 00008003 badwindow!WndProc+0xdc (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 00401220 00070520 00000111 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d 00000000 00401220 00070520 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])

    0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])

    0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])

    0017ff0c 00401817 00400000 00000000 004a0f84 badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 7efde000 0017206b 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

    0:000> da 0017fa38

    0017fa38  "AServerThatDoesNotExist14"

     

    Generally speaking it’s not a good idea to make any call that could possibly block in your winproc. 

     

    Let’s look deeper into what the caller (hangtype1) is really doing, Feel free to look at the C code as it was included with the project.  I like to look at the assembler,  it’s nice practice for those times you don’t have the source.  We can take the return address from the gethostbyname call 0040147d to figure out how it was called.  A neat trick for disassembling the entire function based on an address within that functions code range is to use the unassembled function or UF command.  In the following sample output I’ll simply UF the return address of the gethostbyname call.   I’ll then walk trough and annotate the disassembly.  Then we will look at the C source.

     

    For reference on intel cpu programming see http://developer.intel.com/design/pentium/manuals/24319001.pdf

     

    0:000> uf 0040147d

    We are making run for locals on the stack, 402 dec bytes.  If you look at the source you will see this is a sznamebuff[400], int I, and pointer to a host structure.
    badwindow!hangtype1 [c:\source\badwindow\badwindow\badwindow.cpp @ 298]:
    298 00401440 81ec94010000    sub     esp,194h 

     

    This is done for  stack checking to ensure none is trying to exploit our stack.
    298 00401446 a100304000      mov     eax,dword ptr [badwindow!__security_cookie (00403000)]
    298 0040144b 33c4            xor     eax,esp
    298 0040144d 89842490010000  mov     dword ptr [esp+190h],eax

    Save ebx register state on the stack
    298 00401454 53              push    ebx 

    Get the function pointer for sprint from the import table.
    299 00401455 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]

    Save the stack base / frame pointer.
    299 0040145b 55              push    ebp

    Get the address of gethostbyname from the import table and save it in EBP.
    299 0040145c 8b2d20214000    mov     ebp,dword ptr [badwindow!_imp__gethostbyname (00402120)]

    Save ESI and EDI on the stack.
    299 00401462 56              push    esi
    299 00401463 57              push    edi

    Zero out ESI

    299 00401464 33f6            xor     esi,esi
    badwindow!hangtype1+0x26 [c:\source\badwindow\badwindow\badwindow.cpp @ 306]:

    Push ESI on the stack and the first pass it will be zero == 0, this is the 3rd parameter to the call we are about to make.
    306 00401466 56              push    esi

    We are loading the effective address of something on the stack, we will see what it is used for when we use EAX, remember that.
    306 00401467 8d442414        lea     eax,[esp+14h]

    Hum what is this?.
    badwindow!`string' (004021ac) Well when in doubt dump it out.
    0:000> db 004021ac
    004021ac  41 53 65 72 76 65 72 54-68 61 74 44 6f 65 73 4e  AServerThatDoesN
    004021bc  6f 74 45 78 69 73 74 25-64 00 00 00 57 65 20 61  otExist%d...We a
    004021cc  72 65 20 69 6e 20 68 61-6e 67 74 79 70 65 32 74  re in hangtype2t
    004021dc  68 72 65 61 64 61 00 00-57 65 20 61 72 65 20 69  hreada..We are i
    004021ec  6e 20 68 61 6e 67 74 79-70 65 32 74 68 72 65 61  n hangtype2threa
    004021fc  64 62 00 00 48 00 00 00-00 00 00 00 00 00 00 00  db..H...........
    0040220c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
    0040221c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    Ok so we are pushing our print formatting string for sprintf this is the second parameter to the call we are about to make.

    306 0040146b 68ac214000      push    offset badwindow!`string' (004021ac)

    Ok here we are pushing EAX,  remember eax is just an address on our stack, and in this case the 1st parameter passed to sprint,  it’s our BUFFER.  szNameBuffer to be exact.
    306 00401470 50              push    eax

     

    Call sprint (Remember we stored it’s address in ebx)
    306 00401471 ffd3            call    ebx

     

    Stacks grown down or negative as they get larger, so adding in this case is cleaning up the stack from our sprintf call
    306 00401473 83c40c          add     esp,0Ch

     

    Get the address of the szNameBuffer
    307 00401476 8d4c2410        lea     ecx,[esp+10h]

     

    Push the address of the name buffer on the stack.
    307 0040147a 51              push    ecx

     

    Call gethostbyname (Rember we stored it’s address in ebp the)  NOTE when you do a return the value is stored in EAX (REMBER THIS)
    307 0040147b ffd5            call    ebp

     

    Push 250 dec FA hex on the stack.  (This is our return address from get host by name we see on the stack)
    310 0040147d 68fa000000      push    0FAh 

     

    Store EAX in EDI We need to save it because EAX will be over written when we come back from Sleep and we may want to use it later.  Eax is the return value from gethostbyname.
    310 00401482 8bf8            mov     edi,eax

     

    Call sleep it takes 1 parameter and that values was just pushed on the stack 250ms ¼ second.
    310 00401484 ff150c204000    call    dword ptr [badwindow!_imp__Sleep (0040200c)]

    Increment ESI this is where we are storing the value int i.   It was zero the first time through now it’s ESI+1 if we loop it will grow
    311 0040148a 83c601          add     esi,1

     

    Compare ESI we know this is in I to 0x64h or 100dec.
    312 0040148d 83fe64          cmp     esi,64h

    Where we have a jump if greater then, that is to say if ESI > 64h or 100dec jump to 00401496
    312 00401490 7f04            jg      badwindow!hangtype1+0x56 (00401496)

    Now we test EDI, this is the return value from gethostbyname.  TEST is a logical compare.  So if you test EDI and EDI and EDI = 0
    304 00401492 85ff            test    edi,edi

    If EDI is ZERO 0 jump to this address je is jump if ZERO this take us to the TOP of the loop.

    304 00401494 74d0            je      badwindow!hangtype1+0x26 (00401466)

     

    The rest of this is us cleaning up the stack and getting ready to drop out of the function.  We can stop here.
    318 00401496 8b8c24a0010000  mov     ecx,dword ptr [esp+1A0h]
    318 0040149d 5f              pop     edi
    318 0040149e 5e              pop     esi
    318 0040149f 5d              pop     ebp
    318 004014a0 5b              pop     ebx
    318 004014a1 33cc            xor     ecx,esp
    318 004014a3 e8c5010000      call    badwindow!__security_check_cookie (0040166d)
    318 004014a8 81c494010000    add     esp,194h
    318 004014ae c3              ret

     

    Here is the C code snipped for hangtype1

     

    void hangtype1(void)

    {

    int i =0;

    char szNameBuff[400];

    struct hostent * myhost=NULL;

     

          // Lets see if any of my favorte machines are on the network.

          while (!myhost)

          {

                sprintf(szNameBuff,"AServerThatDoesNotExist%d",i);

                myhost = gethostbyname (szNameBuff);

                // We are hitting network resource in a loop.

                // Let's not hit the server with to many request at once. We will pause 250ms.

                Sleep(250);

                i++;

                if(i>100)

                {

                      break;

                }

          }

         

    }

     

    So what is happening,  if you recall we saw AServerThatDoesNotExist14 on the stack.  We know from this loop and the format string passed to sprint that this loop will run to 100 and will sleep 250ms with each pass.  That’s 25 seconds total just in sleeps not to mention the network stall on a bad server name.   I realize this code is somewhat contrived but it’s based on the types of things we actually see and troubleshoot every day.

     

    So how do we get this work done without stalling the application?.  The answer is rather simple.  We go multithreaded.   You can look for hangtype1fix in the project.  Instread of calling hangtype1 from our windproc directly we declare a threadproc called hangtype1fixthread

     

    void __cdecl hangtype1fixthread(void *)

    {

           // our blocking function.

           hangtype1();

    }

     

    Then from our winproc instead of calling hangtype1 we call another function called hangtype1fix that creates a new thread that in turn executes out hangtype1 code in another thread context so our winproc can continue to execute and process incoming messages.  The being thread call is asynchronously .  The thread is create immediately and your thread continues to run.

     

    void hangtype1fix(void)

    {

           _beginthread(hangtype1fixthread, 12000, NULL);

    }

     

    I hope you found this blog informative and I welcome your feedback.

     

    Thanks  Jeff-

     

     

  • Ntdebugging Blog

    How Windows Shuts Down

    • 7 Comments

    Hi my name is Bryan,  I'm a escalation engineer on the Microsoft CPR platforms team.   A common problem scenario involves shutting down Windows. When troubleshooting problems during shut down we have to look at the Winlogon.exe process which can be tricky and must be done correctly.

     

    Troubleshooting Techniques

     

    Common Settings

     

    Settings that would affect how Windows shuts down are mostly in HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon, and are documented thoroughly at http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/12314.mspx. The setting to really pay attention is:

     

    DisableCAD (REG_DWORD). When set to 1 the GINA will bypass the Ctrl-Alt-Del dialog and go straight to the logon prompt. The system cannot be shutdown when the logon prompt at the console is displayed.

     

    Another common setting that is sometimes needed for security reasons is in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management.

     

    ClearPageFileAtShutdown (REG_DWORD). When set to 1 the memory manager will clear the paging file on shutdown. If the paging file is large this can significantly the time it takes for the system to shut down.

     

    Winlogon Event Notification

     

    Prior to Windows Vista/2008 Server, Winlogon Notification Packages can be used to reliably monitor the state of Winlogon. These are simple export DLLs that are registered with entry points for each event monitored by the package. Winlogon explicitly calls these entry points when a specific event occurs. Each package handles the notification within the context of the Winlogon process. Keep in mind that a bug in a notification package can cause problems in Winlogon and render the system inaccessible. You can find more information about Winlogon Notification Packages at http://msdn2.microsoft.com/en-us/library/aa380545.aspx.

     

    Starting with Windows Vista support for Winlogon Notification Packages has been pulled. Most of this functionality is still provided through SCM event notifications. See the following link for more information about this.

     

    http://technet2.microsoft.com/WindowsVista/en/library/6ec4ec6d-6b84-44c9-b3af-116589a42b861033.mspx?mfr=true

     

    You can find information about how to write a service here.

     

    http://msdn2.microsoft.com/en-us/library/ms685969.aspx

     

    However these notifications are performed asynchronously so the exact moment of each state within Winlogon cannot be reliably known.

     

    Winlogon Logging

     

    Winlogon logging is obtained by applying the checked build of Winlogon and setting the following registry values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon.

     

    DebugFlags (REG_SZ). This is a comma-separated list of debug flags to capture. The list of flags is Error, Warning, Trace, Init, Timeout, Sas, State, MPR, CoolSwitch, Profile, DebugLsa, DebugSpm, DebugMpr, DebugGo, Migrate, DebugServices, Setup, SC, Notify, and Job.

     

    LogFile (REG_SZ). This is the path to the log file where the events should be written. If this value is missing then the events are written to the debug port.

     

    These options can also be placed in the win.ini file. This option and other debug logging options for Winlogon can be found in these articles.

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;232575

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;221833

     

    Checked builds of Windows binaries are available through MSDN subscriptions. The checked binary applied to the system must match the service pack level of the system.

     

    When looking at the Winlogon logs there are three check points to identify. They are the three ExitWindowsEx calls made during shutdown. These log entries were made during a shutdown initiated with “shutdown –r –f –t 0”.

     

    328.372> Winlogon-Trace: Doing remote-initiated (Administrator) Reboot=true, Force=true

    328.372> Winlogon-Trace: Starting user thread for Logoff, flags = 3807

    328.764> Winlogon-Trace: Doing immediate shutdown, LastGinaRet = b, Flags = 3804

    328.764> Winlogon-Trace: Calling ExitWindowsEx(0x3804, 0)

    .

    .

    .

    328.1528> Winlogon-Trace: ExitWindowsEx called to shut down COM processes

    .

    .

    .

    328.332> Winlogon-Trace: Starting shutdown

    328.332> Winlogon-Trace: Starting user thread for Logoff, flags = 7

    .

    .

    .

    328.284> Winlogon-Trace: Calling ExitWindowsEx(0x7, 0)

     

    If shutdown stalls then based on the log entries you should be able to identify which part of shutdown it is stuck in.

     

    Debugging

     

    In order to debug Winlogon you will need to first install the debugging tools for Windows (http://www.microsoft.com/whdc/devtools/debugging/default.mspx). It contains information about how to set up the debugger for Winlogon debugging. If the debugger is not set up correctly on Winlogon the system can easily get in an unusable state and the system will need to be reinstalled.

     

    You should debug Winlogon whenever a problem manifests itself on the Winlogon desktop where the SAS and logon prompt dialogs are displayed. When debugging Winlogon you should start by looking at the main process thread. If shutdown (or any other Winlogon activity) is hung that thread will show what happened to prevent shutdown from completing. Also look at these Winlogon flags which provide some indication of whether or not the main thread is even processing a shutdown.

     

    0105fe8c winlogon!g_fReadyForShutdown

    0105fdf8 winlogon!ShutdownHasBegun

    01062b3c winlogon!ShutdownInProgress

    01062b30 winlogon!ShutdownTime

     

    I will point out how these variables are modified.

     

    Shutdown Sequence of Events

     

    If you are debugging a shutdown issue you first need to understand the sequence of events that take place during shutdown. This will help isolate what activity may be failing.

     

    RPC Call

     

    An application, local or remote, tells Windows to shut down by calling the Win32 API InitiateSystemShutdownEx (http://msdn2.microsoft.com/en-us/library/aa376874.aspx). This API creates an RPC call over the named pipe InitShutdown. The remote connect and RPC call looks like this in a network trace.

     

    11:08:40.025                 CLIENT         SERVER        SMB              SMB: C; Nt Create Andx, FileName = \InitShutdown

    11:08:40.027                 CLIENT         SERVER        MSRPC        MSRPC: c/o Request: unknown Call=0x1 Opnum=0x2 Context=0x0 Hint=0x20

     

    A local connect would not need to go over the network, but it uses the same mechanism to make this call.

     

    The server side of this RPC call is handled by the function BaseInitiateShutdownEx in the Winlogon.exe process. This RPC worker performs the following actions.

     

    1.       Checks the privilege level of the client. If the client privilege test fails then the error status code is returned. This will typically be ERROR_ACCESS_DENIED (0x5).

    2.       Parses the shutdown command sent by the RPC client. If the command is malformed then the status error code of ERROR_INVALID_PARAMETER (0x57) is returned.

    3.       Checks the Winlogon flags ShutdownInProgress, ShutdownHasBegun, g_fReadyForShutdown, and the current state of the terminal desktop to see if we can shutdown. Winlogon cannot begin the shutdown if a shutdown is in progress or if it is not ready. Winlogon also will not start a shutdown if the force flag was not specified in the command and the desktop is locked. If the ShutdownInProgress or ShutdownHasBegun flag is set then the return value is ERROR_SHUTDOWN_IN_PROGRESS (0x45B). If the g_fReadyForShutdown is not set then the return value is ERROR_NOT_READY (0x15).

     

    Debugger output:

    dd winlogon!g_fReadyForShutdown l 1

    0105fe8c  00000001

     

    dd winlogon!ShutdownInProgress l 1

    01062b3c  00000000

     

    dd winlogon!ShutdownHasBegun l 1

    0105fdf8  00000000

     

    4.       Winlogon initializes the shutdown data including the global variable ShutdownTime. If this variable has been set then we know we’ve gotten this far.

     

    Debugger Output:

    dq winlogon!ShutdownTime l 1

    01062b30  01c7a859`baee0060

     

    .formats 01c7a859`baee0060

    Evaluate expression:

      Time:    Wed Jun  6 12:42:54.506 2007 (GMT-4)

     

    5.       If Winlogon has determined that it can proceed then the ShutdownInProgress flag is set. Subsequent shutdown requests from this point will fail in step 3 when it checks this flag.

    6.       Generates the shutdown audit event if auditing was enabled. Note that shutdown may fail at some point after this even though the audit log is generated.

    7.       Creates another thread to continue working on the shutdown. At this point the RPC worker thread returns to the caller.

     

    To this point there has been no visible indication on the server being shutdown. If the RPC worker fails for some reason then the client application will get an indication of the failure. If the RPC worker successfully hands the shutdown request to the next thread then the client application will get ERROR_SUCCESS (0) as the return code. The client application will get no indication of a failure after that point.

     

    Worker Thread

     

    The RPC worker thread hands control to the LogoffThreadProc thread. If the delay specified in the shutdown command was 0 then the ShutdownInProgress flag is cleared and the ShutdownHasBegun is set. In the Winlogon log you will see a line that starts with “Doing immediate shutdown”. Otherwise you will not see any visible indication at this point that a shutdown is occurring.

     

    If the delay specified in the shutdown command was greater than 0 then the countdown shutdown dialog is displayed.

     

    Shutdown

     

    When that dialog completes the AbortShutdown flag is checked. This flag would get set as a result of a call to AbortSystemShutdown (http://msdn2.microsoft.com/en-us/library/aa376630.aspx). If it is set then the shutdown is aborted. Otherwise the ShutdownInProgress flag is cleared, the ShutdownHasBegun flag is set and if the shutdown was initiated by a system process then we shut the system down immediately at this point with no further clean up.

     

    After these flags are updated the ExitWindowsInProgress flag is set and we call the Win32 API ExitWindowsEx (http://msdn2.microsoft.com/en-us/library/ms893047.aspx). In the Winlogon log you will see a line that starts with “Calling ExitWindowsEx”.

     

    Debugger Output:

    dd winlogon!ExitWindowsInProgress l 1

    0105fd84  00000001

     

    The Win32 API ExitWindowsEx makes an RPC call to CSRSS.EXE. CSRSS synchronously sends a WM_QUERYENDSESSION message to all Windows applications. When an application gets this message it indicates that shutdown can continue and CSRSS then sends the WM_ENDSESSION message. After that the process is terminated. If the application indicates that it cannot be terminated then CSRSS stops processing any further applications and waits for the interactive user to close the application. The ExitWindowsEx call will fail with error ERROR_OPERATION_ABORTED (0x3E3) and the Winlogon flags are reset so that a new shutdown request can be processed.

     

    An application that prevents shutdown from proceeding in this manner can be seen visual since it will be the foreground window on the desktop. To confirm which application returned did this you will need to live debug CSRSS. A return code of 3 from either winsrv!ConsoleClientShutdown or winsrv!UserClientShutdown will indicate the application did this.

     

    Debugger Output:

    0:002> pc

    eax=00000000 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564de esp=0052fe40 ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+7e:

    75a564de ff5740 call dword ptr [edi+0x40]{winsrv!UserClientShutdown (75a9db1f)} ds:0023:00164640=75a9db1f

     

    ; Step past the call.

    0:002> p

    eax=00000003 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564e1 esp=0052fe4c ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+81:

    75a564e1 8bf8             mov     edi,eax

     

    ; The first parameter is a structure that contains the process ID.

    75a564d5 ff75f4           push    dword ptr [ebp-0xc]

    75a564d8 ff750c           push    dword ptr [ebp+0xc]

    75a564db ff75f8           push    dword ptr [ebp-0x8]

    75a564de ff5740           call    dword ptr [edi+0x40]

     

    ; Get the pointer to this structure.

    0:002> dd ebp-8 l 1

    dd ebp-8 l 1

    0052fe60  0018a530

     

    ; The first DWORD is the process ID. The second DWORD is the thread ID.

    0:002> dd 0018a530 l 1

    dd 0018a530

    0018a530  0000066c

     

    ; Break into kernel mode so we can look at all the processes.

    0:002> .breakin

    .breakin

    Break instruction exception - code 80000003 (first chance)

    nt!RtlpBreakWithStatusInstruction:

    8081db0e cc              int     3

     

    ; Get the process object with that process ID.

    kd> !process 0000066c 0

    Searching for Process with Cid == 66c

    PROCESS ff62a638  SessionId: 0  Cid: 066c    Peb: 7ffdf000  ParentCid: 0108

        DirBase: 0390d000  ObjectTable: e1658e38  HandleCount:  51.

        Image: test.exe

     

    Console (text-based) applications are asynchronously sent a separate CTRL_SHUTDOWN_EVENT notification. This means that ExitWindowsEx will proceed no matter how the application chooses to handle this notification.

     

    Services.exe (and all Windows services) is a console application which receives this notification from CSRSS. Services.exe registers a control handler ScShutdownNotificationRoutine which calls ScShutdownAllServices on shutdown. This function traverses through all Windows services that are not stopped or stopping and that accept the shutdown notification and sends them the SERVICE_CONTROL_SHUTDOWN notification. Each service has 20 seconds by default to shutdown. However a service may request more time by calling SetServiceStatus with a wait hint and updated check point. It can do this so long as it continues to respond within the current timeout period. However since Services.exe received this as an asynchronous message from CSRSS.EXE it will not prevent the system from shutting down.

     

    After ExitWindowsEx returns control to Winlogon the ExitWindowsInProgress flag is cleared and LogoffThreadProc exits.

     

    MainLoop

     

    The state of Winlogon is controlled by the main process thread in a function called MainLoop. As LogoffThreadProc sets the ShutdownHasBegun flag and calls ExitWindowsEx the MainLoop function picks up on this change of state and begins executing its shutdown code. Since MainLoop is responsible for interaction with the user this is the first place where the user will get visible confirmation from Winlogon that the system is shutting down.

     

    When MainLoop sees that the Winlogon state has changed to shutting down it takes the following actions.

     

    1.       Signal the shell that we are shutting down. This causes the Explorer shell to disappear.

    2.       Checks to see if there are any updates to the user’s profile.

    3.       Send out the logoff notification event.

    4.       Delete network connections.

    5.       Play the logoff sound.

    6.       Play the system exit sound.

    7.       Creates a KillComProcesses thread. This calls ExitWindowsEx and will wait up to 15 minutes for this to complete.

    8.       Save and unload the user’s profile.

    9.       Delete RAS connections.

    10.    Send out the shutdown notification event.

    11.    Stop Windows file protection.

    12.    Creates another LogoffThreadProc thread which again calls ExitWindowsEx.

    13.    Call the shutdown function in the GINA. This displays the Windows is shutting down dialog.

    14.    Wait for any remaining system processes to complete. If we are stuck here then we would need to look at System, smss.exe, or csrss.exe. One of those will be stuck in some action.

    15.    Shut down the system.

     

    Winlogon uses the native API NtShutdownSystem in step 15. If there is a hang after this point then you will see the main thread in Winlogon stuck in this call and the real problem is likely the result of a device driver in the System process. This call will never return.

  • Ntdebugging Blog

    Detecting and automatically dumping hung GUI based windows applications..

    • 11 Comments

    Written by Jeff Dailey 

    My name is Jeff,  I’m an Escalation Engineer on CPR Platforms team.  Following Tate’s blog on scoping hangs I’d like discus a common category of hangs and some creative ways to track them down.  I will be providing a couple of labs to go with this post that you can run and debug on your machine and I will also be showing you how to write a hang detection tool that will dump processes that go unresponsive.  In addition to this I will be writing several more blog entries about the various hang scenarios contained in the badwindows.zip that is included with this blog.

    GUI Hangs

    Sometimes a Windows application that is GUI (Graphical User Interface) based, that is to say uses windows, buttons, scroll bars etc. may stop responding (Not Responding status in Task Manager).  When this happens in most cases the rest of the operating system seems to continue functioning ok.  However the application does not repaint or respond to mouse clicks or key strokes.  Sometimes these types of problem may be transient.   Your app may hang once or twice a day for 10-30 seconds.  In other cases it may hang for long periods of time or never recover.

    To get a better understand of this scenario it’s important to understand that all GUI based Windows applications work by passing messages to one another via a message queues.   Each Windows application typically has a single main thread that is responsible for processes these messages.   Though the application may be multi threaded there will typically be one thread processes messages.  This functionally is normally implement in WinMain.    This thread does different tasks based on the messages it receives.   It could open a dialog, create anther thread, or take actions based on a mouse click of even send a message to another Windows application or applications.

    When your application stops responding it’s generally due to this thread making a blocking / synchronous call that takes too long.  If the thread is unable to pull incoming messages from the OS it will appear to be hung.   Most of the time once you have the dump of the process you can look at thread 0 by doing a ~0s in cdb or windbg.  Then do a KB and see what the thread is blocking on or possibly looping in that is preventing it from processing messages.  If thread 0 is not the thread processing your messages you may be able to find it by dumping all the thread stacks, ~*kb

    The problem is you may not be able to fire up cdb or windbg to get a dump in time.  Or you may have a non technical user community that does not know about debuggers or creating dumps.  In this case you can do what I sometimes do. 

    Create your own tool.

    That’s right.  Sometimes I will see a scenario that warrants a slightly more elegant solution and there is nothing more powerful than a determined engineer and a C complier.   

    What is required?  Visual Studio (The Express edition is free), Windows SDK (free), the debugger SDK (free with Debugging Tools Install), and a little knowledge of how Windows works.

    Let’s take a moment and think about what our ideal debug application will and won’t do.  

    1.     It will be easy to use and configure and use.

    2.     It will not break or negatively impact our operating system.  That is to say, it will not use much CPU or resources.

    3.     It will wait quietly for our desired condition (in this case a hung window) to manifest.

    4.     It will spring into action and gather the critical information about the state of our misbehaving application by creating a dump file without raising a fuss.

    5.     It will be multi user aware and not place dumps in insecure locations, this means the dumps will go in the user’s temp directory.

    6.     We will only collect a limited number of dump files so we do not fill up the hard drive.

    7.     It will notify the admin of a hang and dump event by putting a message in the event log. 

    8.     It will execute an optional binary when a hang is detected.

     

    Here are the details of how it will work.

    To keep things simple we will just create a console application.  The application will be called dumphungwindow.exe.  We will run in a loop until we collect the desired number of dump files.  We will wake up every so many seconds,  get the top most window, loop through each window sending it a message with the SendMessageTimeout  API and if any window takes more then what we signify as our timeout we will create a dump of that process and log an event in the event log.   

    I have the sample dumphungwindow.zip and badwindow.zip embedded within it available for download here, it has the EXEs and the visual studio 2005 project with all of the source.  The tool project is called dumphungwindow, and the test application is in a project called badwindow.  This project contains a lab with three different hang scenarios that cause a window to stop responding. 

    The command line options are as follows.

    C:\source\dumphungwindow\debug>dumphungwindow.exe /?
     This sample application shows you how to use the debugger
     help api to dump an application if it stop responding.

     This tool depends on dbghelp.dll, this comes with the Microsoft debugger tools on www.microsoft.com

     Please make sure you have the debugger tools installed before running this tool.
     This tool is based on sample source code and is provided as is without warranty.

     feel free to contact jeffda@microsoft.com to provide feedback on this sample application

     /m[Number] Default is 5 dumps

     The max number of dumps to take of hung windows before exiting.

     /t[Seconds]  Default is 5 seconds

     The number of seconds a window must hang before dumping it.

     /p[Seconds] Default is 0 seconds

     The number of seconds to pause when dumping before continuing scan.

     /s[Seconds] Default is 5 seconds.

     The scan interval in seconds to wait before rescanning all windows.

     /d[DUMP_FILE_PATH] The default is the SystemRoot folder

     The path or location to place the dump files.

     /e[EXECUTABLE NAME] This allows you to start another program if an application hangs

    To run the tool simply start dumphungwindow.exe  The output should look something like this.

    C:\source\dumphungwindow\debug>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeff\AppData\Local\Temp\
    scanning for hung windows

    ****

    To start our bad application extract the badwindowapp.zip file contained in the dumphungwindows.zip

     

    Then run badwindow.exe and from the menu select hang \ hang type 2.

     

    After a few seconds findhungwindow should detect the unresponsive badwindow.exe and generate a dump.

     

    Hung Window found dumping process (7064) badwindow.exe

    Dumping unresponsive process

    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day5_29_2007_Time10_36_38_Pid7064_badwindow.exe.dmp

     

     

    Please take a moment and review the source.  I’ve added comments that explain how we go about finding the hung window, and how we go about dumping it to a dump file you can open in windbg..

     

    Feel free to download and try out dumphungwindow against the badwindow.exe application.  Try looking at “hang type 1” first as that will be my next blog.  Over the coming weeks I’ll be writing about hang types 1,2 and 3 in the badwindow.exe application.   Once you have the dump file you can open it by inside of windbg via file \ open crash dump.  See Debugging tools for the install location.

     

    I hope you find this tool and sample helpful.

     

    Thank you Jeff-

     

     

    /********************************************************************************************************************

    Warranty Disclaimer

    --------------------------

    This sample code, utilities, and documentation are provided as is, without warranty of any kind. Microsoft further disclaims all

    implied warranties including without limitation any implied warranties of merchantability or of fitness for a particular  purpose.

    The entire risk arising out of the use or performance of the product and documentation remains with you.

     

    In no event shall Microsoft be liable for any damages whatsoever  (including, without limitation, damages for loss of business

    profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to

    use the sample code, utilities, or documentation, even if  Microsoft has been advised of the possibility of such damages.

    Because some states do not allow the exclusion or limitation of liability for consequential or incidental damages, the above

    limitation may not apply to you.

     

    ********************************************************************************************************************/

     

    #include <stdio.h>

    #include <windows.h>

    #include <dbghelp.h>

    #include <psapi.h>

     

    // don't warn about old school strcpy etc.

    #pragma warning( disable : 4996 )

     

    int iMaxDump=5;

    int iDumpsTaken=0;

    int iHangTime=5000;

    int iDumpPause=1;

    int iScanRate=5000;

    HANDLE hEventLog;

    char * szDumpLocation;

    int FindHungWindows(void);

    char * szDumpFileName = 0;

    char * szEventInfo = 0;

    char * szDumpFinalTarget = 0;

    char * szModName = 0;

    char * szAppname = 0;

    DWORD dwExecOnHang = 0;

     

    #define MAXDUMPFILENAME 1000

    #define MAXEVENTINFO 5000

    #define MAXDUMPFINALTARGET 2000

    #define MAXDUMPLOCATION 1000

    #define MAXAPPPATH 1000

    #define MAXMODFILENAME 500

    #define HMODSIZE 255

     

    int main(int argc, char * argv[])

    {

          int i;

          int z;

          size_t j;

          char scan;

     

          // check to make sure we have dbghelp.dll on the machine.

          if(!LoadLibrary("dbghelp.dll"))

          {

                printf("dbghelp.dll not found please install the debugger tools and place this tool in \r\nthe debugging tools directory or a copy of dbghelp.dll in this tools directory\r\n");

                return 0;

          }

     

          // Allocate a buffer for our dump location

          szDumpLocation = (char *)malloc(MAXDUMPLOCATION);

          {

                if(!szDumpLocation)

                {

                printf("Failed to alloc buffer for szdumplocation %d",GetLastError());

                return 0;

                }

          }

     

          szAppname = (char *)malloc(MAXAPPPATH);

          {

                if(!szAppname)

                {

                printf("Failed to alloc buffer for szAppname  %d",GetLastError());

                return 0;

                }

          }

     

          // We use temp path because if we are running under terminal server sessions we want the dump to go to each

          // users secure location, ie. there private temp dir. 

          GetTempPath(MAXDUMPLOCATION, szDumpLocation );

         

          for (z=0;z<argc;z++)

          {

                switch(argv[z][1])

                {

                case '?':

                      {

                      printf("\n This sample application shows you how to use the debugger \r\n help api to dump an application if it stop responding.\r\n\r\n");

                      printf("\n This tool depends on dbghelp.dll, this comes with the Microsoft debugger tools on www.microsoft.com");

                      printf("\n Please make sure you have the debugger tools installed before running this tool.");

                      printf("\n This tool is based on sample source code and is provided as is without warranty.");

                      printf("\n feel free to contact jeffda@microsoft.com to provide feedback on this sample application\r\n\r\n");

                      printf(" /m[Number] Default is 5 dumps\r\n The max number of dumps to take of hung windows before exiting.\r\n\r\n");

                      printf(" /t[Seconds]  Default is 5 seconds\r\n The number of seconds a window must hang before dumping it. \r\n\r\n");

                      printf(" /p[Seconds] Default is 0 seconds\r\n The number of seconds to pause when dumping before continuing scan. \r\n\r\n");

                      printf(" /s[Seconds] Default is 5 seconds.\r\n The scan interval in seconds to wait before rescanning all windows.\r\n\r\n");

                      printf(" /d[DUMP_FILE_PATH] The default is the SystemRoot folder\r\n The path or location to place the dump files.  \r\n\r\n");

                      printf(" /e[EXECUTABLE NAME] This allows you to start another program if an application hangs\r\n\r\n");

     

                      return 0;

                      }

                case 'm':

                case 'M':

                      {

                            iMaxDump = atoi(&argv[z][2]);

                            break;

                      }

                case 't':

                case 'T':

                      {

                            iHangTime= atoi(&argv[z][2]);

                            iHangTime*=1000;

                            break;

                      }

                case 'p':

                case 'P':

                      {

                            iDumpPause= atoi(&argv[z][2]);

                            iDumpPause*=1000;

                            break;           

                      }

                case 's':

                case 'S':

                      {

                            iScanRate = atoi(&argv[z][2]);

                            iScanRate*=1000;             

                            break;

                      }

                case 'd':

                case 'D':

                      { // Dump file directory path

                            strcpy(szDumpLocation,&argv[z][2]);

                            j = strlen(szDumpLocation);

     

                            if (szDumpLocation[j-1]!='\\')

                            {

                                  szDumpLocation[j]='\\';

                                  szDumpLocation[j+1]=NULL;

                            }

                            break;

                      }

                case 'e':

                case 'E':

                      { // applicaiton path to exec if hang happens

                            strcpy(szAppname,&argv[z][2]);

                            dwExecOnHang = 1;

                            break;

                      }

                }

          }

     

     

          printf("Dumps will be saved in %s\r\n",szDumpLocation);

          puts("scanning for hung windows\n");

     

          hEventLog = OpenEventLog(NULL, "HungWindowDump");

     

          i=0;

          scan='*';

          while(1)

          {

                if(i>20)

                {

                      if ('*'==scan)

                      {

                      scan='.';

                }

                      else

                      {

                      scan='*';

                }

                      printf("\r");

                i=0;

                }

                i++;

                putchar(scan);

                if(!FindHungWindows())

                {

                      return 0;

                }

                if (iMaxDump == iDumpsTaken)

                {

                      printf("\r\n%d Dumps taken, exiting\r\n",iDumpsTaken);

                      return 0;

                }

                Sleep(iScanRate);

          }

     

          free(szDumpLocation);

          return 0;

    }

     

    int FindHungWindows(void)

    {

    DWORD dwResult = 0;

    DWORD ProcessId = 0;

    DWORD tid = 0;

    DWORD dwEventInfoSize = 0;

     

    // Handles

    HWND hwnd = 0;

    HANDLE hDumpFile = 0;

    HANDLE hProcess = 0;

    HRESULT hdDump = 0;

     

    SYSTEMTIME SystemTime;

    MINIDUMP_TYPE dumptype = (MINIDUMP_TYPE) (MiniDumpWithFullMemory | MiniDumpWithHandleData | MiniDumpWithUnloadedModules | MiniDumpWithProcessThreadData);

     

    // These buffers are presistant.

     

    // security stuff to report the SID of the dumper to the event log.

    PTOKEN_USER pInstTokenUser;

    HANDLE ProcessToken;

    TOKEN_INFORMATION_CLASS TokenInformationClass = TokenUser;

    DWORD ReturnLength =0;

     

    // This allows us to get the first window in the chain of top windows.

    hwnd = GetTopWindow(NULL);

    if(!hwnd)

    {

          printf("Could not GetTopWindow\r\n");

          return 0;

    }

     

    // We will iterate through all windows until we get to the end of the list.

    while(hwnd)

    {

          // Get the process ID for the current window   

          tid = GetWindowThreadProcessId(hwnd, &ProcessId);

     

          // Sent a message to this window with our timeout. 

          // If it times out we consider the window hung

          if (!SendMessageTimeout(hwnd, WM_NULL, 0, 0, SMTO_BLOCK, iHangTime, &dwResult))

          {

                // SentMessageTimeout can fail for other reasons, 

                // if it's not a timeout we exit try again later

                if(ERROR_TIMEOUT != GetLastError())

                {

                      printf("SendMessageTimeout has failed with error %d\r\n",GetLastError());

                      return 1;

                }

                      // Iint our static buffers points.

                      // On our first trip through if we have not

                      // malloced memory for our buffers do so now.

                      if(!szModName)

                      {

                            szModName = (char *)malloc(MAXMODFILENAME);

                            {

                                  if(!szModName)

                                  {

                                  printf("Failed to alloc buffer for szModName %d",GetLastError());

                                  return 0;

                                  }

                            }

                      }

                      if(!szDumpFileName)// first time through malloc a buffer.

                      {

                            szDumpFileName = (char *)malloc(MAXDUMPFINALTARGET);

                            {

                                  if(!szDumpFileName)

                                  {

                                        printf("Failed to alloc buffer for dumpfilename %d",GetLastError());

                                        return 0;

                                  }

                            }

                      }

                      if(!szDumpFinalTarget)// first time through malloc a buffer.

                      {

                            szDumpFinalTarget= (char *)malloc(MAXDUMPFINALTARGET);

                            {

                                  if(!szDumpFinalTarget)

                                  {

                                  printf("Failed to alloc buffer for dumpfiledirectory %d",GetLastError());

                                  return 0;

                                  }

                            }

                      }

                      if(!szEventInfo)

                      {

                            szEventInfo= (char *)malloc(MAXEVENTINFO);

                            {

                                  if(!szEventInfo)

                                  {

                                  printf("Failed to alloc buffer for szEventInfo %d",GetLastError());

                                  return 0;

                                  }

                            }

                      }

                      // End of initial buffer allocations.

     

                GetLocalTime (&SystemTime);

               

                // Using the process id we open the process for various tasks.

                hProcess = OpenProcess(PROCESS_ALL_ACCESS,NULL,ProcessId);

                if(!hProcess )

                {

                      printf("Open process of hung window failed with error %d\r\n",GetLastError());

                      return 1;

                }

                // What is the name of the executable?

                GetModuleBaseName( hProcess, NULL, szModName,MAXMODFILENAME);

     

                printf("\r\n\r\nHung Window found dumping process (%d) %s\n",ProcessId,szModName);

     

                // Here we build the dump file name time, date, pid and binary name

                sprintf(szDumpFileName,"HWNDDump_Day%d_%d_%d_Time%d_%d_%d_Pid%d_%s.dmp",SystemTime.wMonth,SystemTime.wDay,SystemTime.wYear,SystemTime.wHour,SystemTime.wMinute,SystemTime.wSecond,ProcessId,szModName);

                strcpy(szDumpFinalTarget,szDumpLocation);

                strcat(szDumpFinalTarget,szDumpFileName);

     

                // We have to create the file and then pass it's handle to the dump api

                hDumpFile = CreateFile(szDumpFinalTarget,FILE_ALL_ACCESS,0,NULL,CREATE_ALWAYS,FILE_ATTRIBUTE_NORMAL,NULL);

                if(!hDumpFile)

                {

                      printf("CreateFile failed to open dump file at location %s, with error %d\r\n",szDumpLocation,GetLastError());

                      return 0;

                }

     

                printf("Dumping unresponsive process\r\n%s",szDumpFinalTarget);

               

                // This dump api will halt the target process while it writes it's

                // image to disk in the form a dump file.

                // this can be opened later by windbg or cdb for debugging.

                if(!MiniDumpWriteDump(hProcess,ProcessId,hDumpFile,dumptype ,NULL,NULL,NULL))

                {

                      // We do this on failure

                      hdDump = HRESULT_FROM_WIN32(GetLastError());

                      printf("MiniDumpWriteDump failed with a hresult of %d last error %d\r\n",hdDump,GetLastError());

                      CloseHandle (hDumpFile);

                      return 0;

                }

                else

                {

                      // If we are here the dump worked.  Now we need to notify the machine admin by putting a event in

                      // the application event log so someone knows a dump was taken and where it is stored.

                      sprintf(szEventInfo,"An application hang was caught by findhungwind.exe, the process was dumped to %s",szDumpFinalTarget);

     

                      // We need to get the process token so we can get the user sit so ReportEvent will have the

                      // User name / account in the event log.

                      if (OpenProcessToken(hProcess,      TOKEN_QUERY,&ProcessToken ) )

                      {

                            // Make the firt call to findout how big the sid needs to be.    

                            GetTokenInformation(ProcessToken,TokenInformationClass, NULL,NULL,&ReturnLength);

                            pInstTokenUser = (PTOKEN_USER) malloc(ReturnLength);

                            if(!pInstTokenUser)

                            {

                                  printf("Failed to malloc buffer for InstTokenUser exiting error %d\r\n",GetLastError());

                                  return 0;

                            }

                            if(!GetTokenInformation(ProcessToken,TokenInformationClass, (VOID *)pInstTokenUser,ReturnLength,&ReturnLength))

                            {

                                  printf("GetTokenInformation failed with error %d\r\n",GetLastError());

                                  return 0;

                            }

                      }

                      // write the application event log message. 

                      // This will show up as source DumpHungWindow

                      dwEventInfoSize=(DWORD)strlen(szEventInfo);

         

                      ReportEvent(hEventLog,EVENTLOG_WARNING_TYPE,1,1,pInstTokenUser->User.Sid,NULL,dwEventInfoSize,NULL,szEventInfo);

     

                      // Free to token buffer, we don't want to leak anything.

                      free(pInstTokenUser);

                     

                      // In additon to leaking a handle if you don't close the handle

                      // you may not get the dump to flush to the hard drive.

                      CloseHandle (hDumpFile);

                      printf("\r\nDump complete");

                     

                      // This allows you to execute something if you get a hang like crash.exe

                      if (dwExecOnHang)

                      {

                            system(szAppname);

                      }

                     

                      //  The Sleep is here so in the event you want to wait N seconds

                      //  before collecting another dump

                      //  you can pause.  This is helpful if you want to see if any

                      //  forward progress is happening over time

                     

                      Sleep(iDumpPause);

                }

                // Once we are at our threadshold for max dumps

                // we exit so we do not fill up the hard drive.

                iDumpsTaken++;

                if (iMaxDump == iDumpsTaken)

                {

                      return 0;

                }

            }

            // This is where we traverse to the next window.

                hwnd = GetNextWindow(hwnd, GW_HWNDNEXT);

          }

          return 1;

    }

     

  • Ntdebugging Blog

    Scoping and Troubleshooting Hangs of Various Causes

    • 4 Comments

    Hi again!  Today I want to bring to your attention an upcoming series of posts on troubleshooting hangs and this post as a primer for understanding hangs and how we scope these scenarios.

     

    Scoping is a practice we use in troubleshooting that helps us to quickly narrow down the domain or scope of a problem from the entire operating system or enterprise to a specific computer and component.  This allows the elimination of millions of other possible problems or interactions.

     

    Hangs are a common and can be a sometimes lengthy support request because of the mere nature of the problem, and just describing it can be difficult.  “Okay, what do you mean, it’s hung?”  By nature I mean that some internal architecture knowledge is necessary to discover what component of the application or OS is not doing what is should and thus leaving us with either an unresponsive user interface or service or both.  So how do we isolate what is going on here?

     

    We will cover the main buckets or symptoms and I will list these in increasing depth or dependency into the OS below, in other words, moving from the Application Layer down into the OS.  But let’s scope first…the most important step!

     

    Scoping the Hang

    We can determine which bucket or symptom we are running into by testing increasing layers of the operating system (OSI stack). Meaning, what layer of the system is working and which ones are not.  The heart of this is to determine “What IS working properly and what IS NOT?”

    The following table outlines the layers and tools we usually use to determine their responsiveness.

     

    Functional Layer to Test

    Tools To Test

    Basic hardware + Network driver + Bottom of the network stack

    Does Ping work? Num Lock light on keyboard?

    SMB over Tcp/ip + Kernel as Server Service runs in the system process)

    Does Net view work?

    Rpc over Tcp/ip

    RPC? (Event Viewer, Remote Management, Event Log, or rpcping.exe)

    Rpc+Application

    COM/DCOM? (Dcomtest.exe)

     

    For example, if a machine is reported “hung” and we can ping it, and net view does not work (when it normally would) we should conclude that the server side of that request failure in most likely in the Server Service or one of its sub components.   This being the case it would not make sense to troubleshoot why myapplication.exe is hung on the same server if lower level things like the server service itself do not work which may be a direct dependency!

     

    Tip:  This is a scoping method we use in isolating all problems.  Look at the interaction of applications, services, the OS, drivers, etc. in light of their dependencies.  “Okay, A is failing not because of A but because B failed, because C failed, and aha here is root cause in D’s failure”.  Testing dependencies can yield considerable time savings vs. debugging “through” the application.  Another example, if an RPC dependent application stops working, testing RPC by using another RPC app might be the first thing to do vs. debugging the first app which could be very time consuming and require specific knowledge about that app.

    Here are some common scoping questions to help think about the context of the issue which could also isolate the problem quickly.

     

    • What is the smallest action we must take to recover from the hang?
    • How often does it occur?
    • Does it occur on a cycle? If so, what cycle?
    • What time did the last occurrence take place?
    • Was it under load at that time? 
    • What was that load?
    • When does it occur, at a particular time of day?
    • How long does the hang last?
    • Can we make it hang?
    • What else happened just before the hang?
    • What changed?
    • When did it start?
    • Relevant or timely errors in the Application or System logs?
    • Is the observation from the console or a remote (RDP or ICA) session?
    • Does the machine still hang if we disconnect from network?
    • Does Task Manager show a particular process taking up CPU?
    • Does the hang occur in Safe Mode or Safe Mode with Networking?

    Answering these simple questions may have obvious yet extremely helpful results.

     

    For example, if the machine is reported as hung and the observation was just made through a Remote Desktop (RDP) session, is it responsive at the console?  Let’s say it is responsive at the console, we must then conclude that only the Terminal Server Service layer or one of its unique dependencies (lower in the stack) is the problem vs. the entire server. Jump to Terminal Services specific troubleshooting, etc.

     

    Common Hang Buckets or Symptoms

    Using the above scoping usually leads to these main classes of hangs which we will cover in future posts:

     

    1.)    Specific Application Menu/Button/Function Hang

    The application looks “OK” in that it will repaint if we drag another window over it; however, if we click on a menu item or send a key stroke whatever functionality associated with it does not…function.

    2.)    Application Window Hang  “I’m not dead yet…just Not Responding

    The application stops responding entirely at the UI layer, meaning it no longer refreshes and dragging another window over the top does not repaint thus displays artifacts of other windows result.

    3.)    The Start Menu, Desktop, or the “Shell” is hung

    So here we know that the Microsoft process responsible for these windows, explorer.exe, is hung.

    4.)    All Windows are Hung…but Task Manager comes up eventually

    Here the mouse still moves, and if we hit Ctrl+Shift+Esc we can invoke task manager, or via Ctrl+Alt+Del.  This may not be a true hang, but slow or unresponsive enough to qualify or be reported as a hang!

    5.)    There are No Windows!

    I can move the Mouse and Keyboard but they don’t “do” anything and there’s just a blank desktop, no windows, it’s hung.

    In this case it may be that the server appears hung interactively while specialized services like file sharing, mail server, etc. still actually function…impending doom?

    6.)    No Windows + No Mouse/Keyboard + but the machine is still running, well, sort of…

    Obviously the most drastic of the symptoms leaving little recourse but a debug of the machine…which might be easier than it sounds!

    The server may or may not be responsive remotely via services, etc.

     

    In each of the upcoming posts expect to see for each symptom:

    Common Example(s)

    Scoping Steps (what works vs. what doesn’t in each scenario)

    Troubleshooting Steps

    Specific Debug Steps

     

    Please look forward to these installments in the New Year!


    -Tate

  • Ntdebugging Blog

    Desktop Heap Overview

    • 101 Comments

     

    Desktop heap is probably not something that you spend a lot of time thinking about, which is a good thing.  However, from time to time you may run into an issue that is caused by desktop heap exhaustion, and then it helps to know about this resource.  Let me state up front that things have changed significantly in Vista around kernel address space, and much of what I’m talking about today does not apply to Vista.

     

    Laying the groundwork: Session Space

    To understand desktop heap, you first need to understand session space.  Windows 2000, Windows XP, and Windows Server 2003 have a limited, but configurable, area of memory in kernel mode known as session space.  A session represents a single user’s logon environment.  Every process belongs to a session.  On a Windows 2000 machine without Terminal Services installed, there is only a single session, and session space does not exist.  On Windows XP and Windows Server 2003, session space always exists.  The range of addresses known as session space is a virtual address range.  This address range is mapped to the pages assigned to the current session.  In this manner, all processes within a given session map session space to the same pages, but processes in another session map session space to a different set of pages. 

    Session space is divided into four areas: session image space, session structure, session view space, and session paged pool.  Session image space loads a session-private copy of Win32k.sys modified data, a single global copy of win32k.sys code and unmodified data, and maps various other session drivers like video drivers, TS remote protocol driver, etc.  The session structure holds various memory management (MM) control structures including the session working set list (WSL) information for the session.  Session paged pool allows session specific paged pool allocations.  Windows XP uses regular paged pool, since the number of remote desktop connections is limited.  On the other hand, Windows Server 2003 makes allocations from session paged pool instead of regular paged pool if Terminal Services (application server mode) is installed.  Session view space contains mapped views for the session, including desktop heap. 

    Session Space layout:

    Session Image Space: win32k.sys, session drivers

    Session Structure: MM structures and session WSL

    Session View Space: session mapped views, including desktop heap

    Session Paged Pool

     

    Sessions, Window Stations, and Desktops

    You’ve probably already guessed that desktop heap has something to do with desktops.  Let’s take a minute to discuss desktops and how they relate to sessions and window stations.  All Win32 processes require a desktop object under which to run.  A desktop has a logical display surface and contains windows, menus, and hooks.  Every desktop belongs to a window station.  A window station is an object that contains a clipboard, a set of global atoms and a group of desktop objects.  Only one window station per session is permitted to interact with the user. This window station is named "Winsta0."  Every window station belongs to a session.  Session 0 is the session where services run and typically represents the console (pre-Vista).  Any other sessions (Session 1, Session 2, etc) are typically remote desktops / terminal server sessions, or sessions attached to the console via Fast User Switching.  So to summarize, sessions contain one or more window stations, and window stations contain one or more desktops.

    You can picture the relationship described above as a tree.  Below is an example of this desktop tree on a typical system:

    - Session 0

    |   |

    |   ---- WinSta0           (interactive window station)

    |   |      |

    |   |      ---- Default    (desktop)

    |   |      |

    |   |      ---- Disconnect (desktop)

    |   |      |

    |   |      ---- Winlogon   (desktop)

    |   |

    |   ---- Service-0x0-3e7$  (non-interactive window station)

    |   |      |

    |   |      ---- Default    (desktop)

    |   |

    |   ---- Service-0x0-3e4$  (non-interactive window station)

    |   |      |

    |   |      ---- Default    (desktop)

    |   |

    |   ---- SAWinSta          (non-interactive window station)

    |   |      |

    |   |      ---- SADesktop  (desktop)

    |   |

    - Session 1

    |   |

    |   ---- WinSta0           (interactive window station)

    |   |      |

    |   |      ---- Default    (desktop)

    |   |      |

    |   |      ---- Disconnect (desktop)

    |   |      |

    |   |      ---- Winlogon   (desktop)

    |   |

    - Session 2

        |

        ---- WinSta0           (interactive window station)

               |

               ---- Default    (desktop)

               |

               ---- Disconnect (desktop)

               |

               ---- Winlogon   (desktop)

     

    In the above tree, the full path to the SADesktop (as an example) can be represented as “Session 0\SAWinSta\SADesktop”.

     

    Desktop Heap – what is it, what is it used for?

    Every desktop object has a single desktop heap associated with it.  The desktop heap stores certain user interface objects, such as windows, menus, and hooks.  When an application requires a user interface object, functions within user32.dll are called to allocate those objects.  If an application does not depend on user32.dll, it does not consume desktop heap.  Let’s walk through a simple example of how an application can use desktop heap. 

    1.     An application needs to create a window, so it calls CreateWindowEx in user32.dll.

    2.     User32.dll makes a system call into kernel mode and ends up in win32k.sys.

    3.     Win32k.sys allocates the window object from desktop heap

    4.     A handle to the window (an HWND) is returned to caller

    5.     The application and other processes in the same session can refer to the window object by its HWND value

     

    Where things go wrong

    Normally this “just works”, and neither the user nor the application developer need to worry about desktop heap usage.  However, there are two primary scenarios in which failures related to desktop heap can occur:

    1. Session view space for a given session can become fully utilized, so it is impossible for a new desktop heap to be created.
    2. An existing desktop heap allocation can become fully utilized, so it is impossible for threads that use that desktop to use more desktop heap.

     

    So how do you know if you are running into these problems?  Processes failing to start with a STATUS_DLL_INIT_FAILED (0xC0000142) error in user32.dll is a common symptom.  Since user32.dll needs desktop heap to function, failure to initialize user32.dll upon process startup can be an indication of desktop heap exhaustion.  Another symptom you may observe is a failure to create new windows.  Depending on the application, any such failure may be handled in different ways.  Note that if you are experiencing problem number one above, the symptoms would usually only exist in one session.  If you are seeing problem two, then the symptoms would be limited to processes that use the particular desktop heap that is exhausted.

     

    Diagnosing the problem

    So how can you know for sure that desktop heap exhaustion is your problem?  This can be approached in a variety of ways, but I’m going to discuss the simplest method for now.  Dheapmon is a command line tool that will dump out the desktop heap usage for all the desktops in a given session.  See our first blog post for a list of tool download locations.  Once you have dheapmon installed, be sure to run it from the session where you think you are running out of desktop heap.  For instance, if you have problems with services failing to start, then you’ll need to run dheapmon from session 0, not a terminal server session.

    Dheapmon output looks something like this:

    Desktop Heap Information Monitor Tool (Version 7.0.2727.0)

    Copyright (c) 2003-2004 Microsoft Corp.

    -------------------------------------------------------------

      Session ID:    0 Total Desktop: (  5824 KB -    8 desktops)

     

      WinStation\Desktop            Heap Size(KB)    Used Rate(%)

    -------------------------------------------------------------

      WinSta0\Default                    3072              5.7

      WinSta0\Disconnect                   64              4.0

      WinSta0\Winlogon                    128              8.7

      Service-0x0-3e7$\Default            512             15.1

      Service-0x0-3e4$\Default            512              5.1

      Service-0x0-3e5$\Default            512              1.1

      SAWinSta\SADesktop                  512              0.4

      __X78B95_89_IW\__A8D9S1_42_ID       512              0.4

    -------------------------------------------------------------

     

    As you can see in the example above, each desktop heap size is specified, as is the percentage of usage.  If any one of the desktop heaps becomes too full, allocations within that desktop will fail.  If the cumulative heap size of all the desktops approaches the total size of session view space, then new desktops cannot be created within that session.  Both of the failure scenarios described above depend on two factors: the total size of session view space, and the size of each desktop heap allocation.  Both of these sizes are configurable. 

     

    Configuring the size of Session View Space

    Session view space size is configurable using the SessionViewSize registry value.  This is a REG_DWORD and the size is specified in megabytes.  Note that the values listed below are specific to 32-bit x86 systems not booted with /3GB.  A reboot is required for this change to take effect.  The value should be specified under:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management

    OS

    Size if no registry value configured

    Default registry value

    Windows 2000 *

    20 MB

    none

    Windows XP

    20 MB

    48 MB

    Windows Server 2003

    20 MB

    48 MB

    * Settings for Windows 2000 are with Terminal Services enabled and hotfix 318942 installed.  Without the Terminal Services installed, session space does not exist, and desktop heap allocations are made from a fixed 48 MB region for system mapped views.  Without hotfix 318942 installed, the size of session view space is fixed at 20 MB.

    The sum of the sizes of session view space and session paged pool has a theoretical maximum of slightly under 500 MB for 32-bit operating systems.  The maximum varies based on RAM and various other registry values.  In practice the maximum value is around 450 MB for most configurations.  When the above values are increased, it will result in the virtual address space reduction of any combination of nonpaged pool, system PTEs, system cache, or paged pool.

     

    Configuring the size of individual desktop heaps

    Configuring the size of the individual desktop heaps is bit more complex.  Speaking in terms of desktop heap size, there are three possibilities:

    ·         The desktop belongs to an interactive window station and is a “Disconnect” or “Winlogon” desktop, so its heap size is fixed at 64KB or 128 KB, respectively (for 32-bit x86)

    ·         The desktop heap belongs to an interactive window station, and is not one of the above desktops.  This desktop’s heap size is configurable.

    ·         The desktop heap belongs to a non-interactive window station.  This desktop’s heap size is also configurable.

     

    The size of each desktop heap allocation is controlled by the following registry value:

                HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\SubSystems\Windows

     

     The default data for this registry value will look something like the following (all on one line):

                   %SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows

                   SharedSection=1024,3072,512 Windows=On SubSystemType=Windows

                   ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3

                   ServerDll=winsrv:ConServerDllInitialization,2 ProfileControl=Off

                   MaxRequestThreads=16

                                                               

     

    The numeric values following "SharedSection=" control how desktop heap is allocated. These SharedSection values are specified in kilobytes.

    The first SharedSection value (1024) is the shared heap size common to all desktops. This memory is not a desktop heap allocation, and the value should not be modified to address desktop heap problems.

    The second SharedSection value (3072) is the size of the desktop heap for each desktop that is associated with an interactive window station, with the exception of the “Disconnect” and “Winlogon” desktops.

    The third SharedSection value (512) is the size of the desktop heap for each desktop that is associated with a "non-interactive" window station. If this value is not present, the size of the desktop heap for non-interactive window stations will be same as the size specified for interactive window stations (the second SharedSection value). 

    Consider the two desktop heap exhaustion scenarios described above.  If the first scenario is encountered (session view space is exhausted), and most of the desktop heaps are non-interactive, then the third SharedSection can be decreased in an effort to allow more (smaller) non-interactive desktop heaps to be created.  Of course, this may not be an option if the processes using the non-interactive heaps require a full 512 KB.  If the second scenario is encountered (a single desktop heap allocation is full), then the second or third SharedSection value can be increased to allow each desktop heap to be larger than 3072 or 512 KB.  A potential problem with this is that fewer total desktop heaps can be created.

     

    What are all these window stations and desktops in Session 0 anyway?

    Now that we know how to tweak the sizes of session view space and the various desktops, it is worth talking about why you have so many window stations and desktops, particularly in session 0.  First off, you’ll find that every WinSta0 (interactive window station) has at least 3 desktops, and each of these desktops uses various amounts of desktop heap.  I’ve alluded to this previously, but to recap, the three desktops for each interactive window stations are:

    ·         Default desktop - desktop heap size is configurable as described below

    ·         Disconnect desktop - desktop heap size is 64k on 32-bit systems

    ·         Winlogon desktop - desktop heap size is 128k on 32-bit systems

     

    Note that there can potentially be more desktops in WinSta0 as well, since any process can call CreateDesktop and create new desktops.

    Let’s move on to the desktops associated with non-interactive window stations: these are usually related to a service.  The system creates a window station in which service processes that run under the LocalSystem account are started. This window station is named service-0x0-3e7$. It is named for the LUID for the LocalSystem account, and contains a single desktop that is named Default. However, service processes that run as LocalSystem interactive start in Winsta0 so that they can interact with the user in Session 0 (but still run in the LocalSystem context).

    Any service process that starts under an explicit user or service account has a window station and desktop created for it by service control manager, unless a window station for its LUID already exists. These window stations are non-interactive window stations.  The window station name is based on the LUID, which is unique for every logon.  If an entity (other than System) logs on multiple times, a new window station is created for each logon.  An example window station name is “service-0x0-22e1$”.

    A common desktop heap issue occurs on systems that have a very large number of services.  This can be a large number of unique services, or one (poorly designed, IMHO) service that installs itself multiple times.  If the services all run under the LocalSystem account, then the desktop heap for Session 0\Service-0x0-3e7$\Default may become exhausted.  If the services all run under another user account which logs on multiples times, each time acquiring a new LUID, there will be a new desktop heap created for every instance of the service, and session view space will eventually become exhausted.

    Given what you now know about how service processes use window stations and desktops, you can use this knowledge to avoid desktop heap issues.  For instance, if you are running out of desktop heap for the Session 0\Service-0x0-3e7$\Default desktop, you may be able to move some of the services to a new window station and desktop by changing the user account that the service runs under.

     

    Wrapping up

    I hope you found this post interesting and useful for solving those desktop heap issues!  If you have questions are comments, please let us know.

     

    - Matthew Justice

     

    [Update: 7/5/2007 - Desktop Heap, part 2 has been posted]

    [Update: 9/13/2007 - Talkback video: Desktop Heap has been posted]

    [Update: 3/20/2008 - The default interactive desktop heap size has been increased on 32-bit Vista SP1]

     

  • Ntdebugging Blog

    Understanding Pool Consumption and Event ID: 2020 or 2019

    • 41 Comments

     

    Hi!  My name is Tate.  I’m an Escalation Engineer on the Microsoft Critical Problem Resolution Platforms Team.  I wanted to share one of the most common errors we troubleshoot here on the CPR team, its root cause being pool consumption, and the methods by which we can remedy it quickly!

     

    This issue is commonly misdiagnosed, however, 90% of the time it is actually quite possible to determine the resolution quickly without any serious effort at all!

     

     

    First, what do these events really mean?

     

    Event ID 2020
    Event Type: Error
    Event Source: Srv
    Event Category: None
    Event ID: 2020
    Description:
    The server was unable to allocate from the system paged pool because the pool was empty.

     

    Event ID 2019
    Event Type: Error
    Event Source: Srv
    Event Category: None
    Event ID: 2019
    Description:
    The server was unable to allocate from the system NonPaged pool because the pool was empty.

     

     

    This is our friend the Server Service reporting that when it was trying to satisfy a request, it was not able to find enough free memory of the respective type of pool.  2020 indicates Paged Pool and 2019, NonPaged Pool.  This doesn’t mean that the Server Service (srv.sys) is broken or the root cause of the problem, more often rather it is the first component to see the resource problem and report it to the Event Log.  Thus, there could be (and usually are) a few more symptoms of pool exhaustion on the system such as hangs, or out of resource errors reported by drivers or applications, or all of the above!

     

     

    What is Pool?

     

    First, Pool is not the amount of RAM on the system, it is however a segment of the virtual memory or address space that Windows reserves on boot.  These pools are finite considering address space itself is finite.  So, because 32bit(x86) machines can address 2^32==4Gigs, Windows uses (by default) 2GB for applications and 2GB for kernel.  Of the 2GB for kernel there are other things we must fit in our 2GB such as Page Table Entries (PTEs) and as such the maximum amount of Paged Pool for 32bit(x86) of ~460MB puts this in perspective in terms of our realistic limits per processor architecture.  As this implies, 64bit(x64&ia64) machines have less of a problem here due to their larger address space but there are still limits and thus no free lunch.

     

    *For more about determining current pool limits see the common question post “Why am I out of Paged Pool at ~200MB…” at the end of this post.

     

    *For more info about pools:  About Memory Management > Memory Pools

    *This has changed a bit for Vista, see Dynamic Kernel Address space

     

     

    What are these pools used for?

     

    These pools are used by either the kernel directly, indirectly by its support of various structures due to application requests on the system (CreateFile for example), or drivers installed on the system for their memory allocations made via the kernel pool allocation functions.

     

    Literally, NonPaged means that this memory when allocated will not be paged to disk and thus resident at all times, which is an important feature for drivers.  Paged conversely, can be, well… paged out to disk.  In the end though, all this memory is allocated through a common set of functions, most common is ExAllocatePoolWithTag.

     

     

    Ok, so what is using it/abusing it? (our goal right!?)

     

    Now that we know that the culprit is Windows or a component shipping with Windows, a driver, or an application requesting lots of things that the kernel has to create on its behalf, how can we find out which?

     

    There are really four basic methods that are typically used (listing in order of increasing difficulty)

     

    1.)    Find By Handle Count

     

    Handle Count?  Yes, considering that we know that an application can request something of the OS that it must then in turn create and provide a reference to…this is typically represented by a handle, and thus charged to the process’ total handle count!

     

    The quickest way by far if the machine is not completely hung is to check this via Task Manager.  Ctrl+Shift+Esc…Processes Tab…View…Select Columns…Handle Count.  Sort on Handles column now and check to see if there is a significantly large one there (this information is also obtainable via Perfmon.exe, Process Explorer, Handle.exe, etc.).

     

    What’s large?  Well, typically we should raise an eyebrow at anything over 5,000 or so.  Now that’s not to say that over this amount is inherently bad, just know that there is no free lunch and that a handle to something usually means that on the other end there is a corresponding object stored in NonPaged or Paged Pool which takes up memory.

     

    So for example let’s say we have a process that has 100,000 handles, mybadapp.exe.  What do we do next?

     

    Well, if it’s a service we could stop it (which releases the handles) or if an application running interactively, try to shut it down and look to see how much total Kernel Memory (Paged or NonPaged depending on which one we are short of) we get back.  If we were at 400MB of Paged Pool (Look at Performance Tab…Kernel Memory…Paged) and after stopping mybadapp.exe with its 100,000 handles are now at a reasonable 100MB, well there’s our bad guy and following up with the owner or further investigating (Process Explorer from sysinternals or the Windows debugger for example) what type of handles are being consumed would be the next step.

     

    Tip: 

    For essential yet legacy applications, which there is no hope of replacing or obtaining support, we may consider setting up a performance monitor alert on the handle count when it hits a couple thousand or so (Performance Object: Process, Counter: Handle Count) and taking action to restart the bad service.  This is a less than elegant solution for sure but it could keep the one rotten apple from spoiling the bunch by hanging/crashing the machine!

     

    2.)    By Pooltag (as read by poolmon.exe)

     

    Okay, so no handle count gone wild? No problem.

     

    For Windows 2003 and later machines, a feature is enabled by default that allows tracking of the pool consumer via something called a pooltag.  For previous OS’s we will need to use a utility such as gflags.exe to Enable Pool Tagging (which requires a reboot unfortunately).  This is usually just a 3-4 character string or more technically “a character literal of up to four characters delimited by single quotation marks” that the caller of the kernel api to allocate the pool will provide as its 3rd parameter.  (see ExAllocatePoolWithTag)

     

    The tool that we use to get the information about what pooltag is using the most is poolmon.exe.  Launch this from a cmd prompt, hit B to sort by bytes descending and P to sort the list by the type (Paged, NonPaged, or Both) and we have a live view into what’s going on in the system.  Look specifically at the Tag Name and its respective Byte Total column for the guilty party!  Get Poolmon.exe Here  or More info about poolmon.exe usage. 

     

    The cool thing is that we have most of the OS utilized pooltags already documented so we have an idea if there is a match for one of the Windows components in pooltag.txt.  So if we see MmSt as the top tag for instance consuming far and away the largest amount, we can look at pooltag.txt and know that it’s the memory manager and also using that tag in a search engine query we might get the more popular KB304101 which may resolve the issue!

     

    We will find pooltag.txt in the ...\Debugging Tools for Windows\triage folder when the debugging tools are installed.

     

    Oh no, what if it’s not in the list? No problem…

     

    We might be able to find its owner by using one of the following techniques:

     

    • For 32-bit versions of Windows, use poolmon /c to create a local tag file that lists each tag value assigned by drivers on the local machine (%SystemRoot%\System32\Drivers\*.sys). The default name of this file is Localtag.txt.

     

    Really all versions---->• For Windows 2000 and Windows NT 4.0, use Search to find files that contain a specific pool tag, as described in KB298102, How to Find Pool Tags That Are Used By Third-Party Drivers.

    From:  http://www.microsoft.com/whdc/driver/tips/PoolMem.mspx

     

     

     

    3.)    Using Driver Verifier

     

    Using driver verifier is a more advanced approach to this problem.  Driver Verifier provides a whole suite of options targeted mainly at the driver developer to run what amounts to quality control checks before shipping their driver.

     

    However, should pooltag identification be a problem, there is a facility here in Pool Tracking that does the heavy lifting in that it will do the matching of Pool consumer directly to driver!

     

    Be careful however, the only option we will likely want to check is Pool Tracking as the other settings are potentially costly enough that if our installed driver set is not perfect on the machine we could get into an un-bootable situation with constant bluescreens notifying that xyz driver is doing abc bad thing and some follow up suggestions.

     

    In summary, Driver Verifier is a powerful tool at our disposal but use with care only after the easier methods do not resolve our pool problems.

     

    4.)    Via Debug (live and postmortem)

     

    As mentioned earlier the api being used here to allocate this pool memory is usually ExAllocatePoolWithTag.  If we have a kernel debugger setup we can set a break point here to brute force debug who our caller is….but that’s not usually how we do it, can you say, “extended downtime?”  There are other creative live debug methods with are a bit more advanced that we may post later…

     

    Usually, debugging this problem involves a post mortem memory.dmp taken from a hung server or a machine that has experienced Event ID:  2020 or Event ID 2019 or is no longer responsive to client requests, hung, or often both.  We can gather this dump via the Ctrl+Scroll Lock method see KB244139 , even while the machine is “hung” and seemingly unresponsive to the keyboard or Ctrl+Alt+Del !

     

    When loading the memory.dmp via windbg.exe or kd.exe we can quickly get a feel for the state of the machine with the following commands.

     

    Debugger output Example 1.1  (the !vm command)

     

    2: kd> !vm 
    *** Virtual Memory Usage ***
      Physical Memory:   262012   ( 1048048 Kb)
      Page File: \??\C:\pagefile.sys
         Current:   1054720Kb Free Space:    706752Kb
         Minimum:   1054720Kb Maximum:      1054720Kb
      Page File: \??\E:\pagefile.sys
         Current:   2490368Kb Free Space:   2137172Kb
         Minimum:   2490368Kb Maximum:      2560000Kb
      Available Pages:    63440   (  253760 Kb)
      ResAvail Pages:    194301   (  777204 Kb)
      Modified Pages:       761   (    3044 Kb)
      NonPaged Pool Usage: 52461   (  209844 Kb)<<NOTE!  Value is near NonPaged Max
      NonPaged Pool Max:   54278   (  217112 Kb)
      ********** Excessive NonPaged Pool Usage *****

     

    Note how the NonPaged Pool Usage value is near the NonPaged Pool Max value.  This tells us that we are basically out of NonPaged Pool.

     

    Here we can use the !poolused command to give the same information that poolmon.exe would have but in the dump….

     

    Debugger output Example 1.2  (!poolused 2)

     

    Note the 2 value passed to !poolused orders pool consumers by NonPaged

     

    2: kd> !poolused 2
       Sorting by NonPaged Pool Consumed
      Pool Used:
                NonPaged            Paged
    Tag    Allocs     Used    Allocs     Used
    Thre   120145 76892800         0        0
    File   187113 29946176         0        0
    AfdE    89683 25828704         0        0
    TCPT    41888 18765824         0        0
    AfdC    90964 17465088         0        0 

     

    We now see the “Thre” tag at the top of the list, the largest consumer of NonPaged Pool, let’s go look it up in pooltag.txt….

     

    Thre - nt!ps        - Thread objects

     

    Note, the nt before the ! means that this is NT or the kernel’s tag for Thread objects.

    So from our earlier discussion if we have a bunch of thread objects, I probably have an application on the system with a ton of handles and or a ton of Threads so it should be easy to find!

     

    Via the debugger we can find this out easily via the !process 0 0 command which will show the TableSize (Handle Count) of over 90,000!

     

    Debugger output Example 1.3  (the !process command continued)

     

    Note the two zeros after !process separated by a space gives a list of all running processes on the system.

     

     

    PROCESS 884e6520  SessionId: 0  Cid: 01a0    Peb: 7ffdf000  ParentCid: 0124
    DirBase: 110f6000  ObjectTable: 88584448  TableSize: 90472
    Image: mybadapp.exe

     

    We can dig further here into looking at the threads…

     

    Debugger output Example 1.4  (the !process command continued)

     

    0: kd> !PROCESS 884e6520 4
    PROCESS 884e6520  SessionId: 0  Cid: 01a0    Peb: 7ffdf000  ParentCid: 0124
    DirBase: 110f6000  ObjectTable: 88584448  TableSize: 90472.
    Image: mybadapp.exe
            THREAD 884d8560  Cid 1a0.19c  Teb: 7ffde000  Win32Thread: a208f648 WAIT
            THREAD 88447560  Cid 1a0.1b0  Teb: 7ffdd000  Win32Thread: 00000000 WAIT
            THREAD 88396560  Cid 1a0.1b4  Teb: 7ffdc000  Win32Thread: 00000000 WAIT
            THREAD 88361560  Cid 1a0.1bc  Teb: 7ffda000  Win32Thread: 00000000 WAIT
            THREAD 88335560  Cid 1a0.1c0  Teb: 7ffd9000  Win32Thread: 00000000 WAIT
            THREAD 88340560  Cid 1a0.1c4  Teb: 7ffd8000  Win32Thread: 00000000 WAIT
     And the list goes on…

     

    We can examine the thread via !thread 88340560 from here and so on…

     

    So in this rudimentary example the offender is clear in mybadapp.exe in its abundance of threads and one could dig further to determine what type of thread or functions are being executed and follow up with the owner of this executable for more detail, or take a look at the code if the application is yours!

     

     

     

    Common Question:

     

    Why am I out of Paged Pool at ~200MB when we say that the limit is around 460MB?

     

    This is because the memory manager at boot decided that given the current amount of RAM on the system and other memory manager settings such as /3GB, etc. that our max is X amount vs. the maximum.  There are two ways to see the maximum’s on a system.

     

    1.)   Process Explorer using its Task Management.  View…System Information…Kernel Memory section.

     

    Note that we have to specify a valid path to dbghelp.dll and Symbols path via Options…Configure Symbols.

     

    For example:

     

          Dbghelp.dll path:

    c:\<path to debugging tools for windows>\dbghelp.dll

     

    Symbols path:

    SRV*C:\websymbols*http://msdl.microsoft.com/download/symbols

     

    2.)The debugger (live or via a memory.dmp by doing a !vm)

     

    *NonPaged pool size is not configurable other than the /3GB boot.ini switch which lowers NonPaged Pool’s maximum.

    128MB with the /3GB switch, 256MB without

     

    Conversely, Paged Pool size is often able to be raised to around its maximum manually via the PagedPoolSize registry setting which we can find for example in KB304101.

     

     

    So what is this Pool Paged Bytes counter I see in Perfmon for the Process Object?

     

    This is when the allocation is charged to a process via ExAllocatePoolWithQuotaTag.  Typically, we will see ExAlloatePoolWithTag used and thus this counter is less effective…but hey…don’t pass up free information in Perfmon so be on the lookout for this easy win.

     

     

    Additional Resources:

     

     “Who's Using the Pool?” from Driver Fundamentals > Tips: What Every Driver Writer Needs to Know

    http://www.microsoft.com/whdc/driver/tips/PoolMem.mspx

     

    Poolmon Remarks:  http://technet2.microsoft.com/WindowsServer/en/library/85b0ba3b-936e-49f0-b1f2-8c8cb4637b0f1033.mspx

     

     

     

     

     I hope you have enjoyed this post and hopefully it will get you going in the right direction next time you see one of these events or hit a pool consumption issue!

     

    -Tate

     

  • Ntdebugging Blog

    Getting Ready for Windows Debugging

    • 8 Comments

     

    Welcome to the Microsoft NTDebugging blog!  I’m Matthew Justice, an Escalation Engineer on Microsoft’s Platforms Critical Problem Resolution (CPR) team.  Our team will be blogging about troubleshooting Windows problems at a low level, often by using the Debugging Tools for Windows.  For more information about us and this blog, check out the about page.

     

    To get things started I want to provide you with a list of tools that we’ll be referencing in our upcoming blog posts, as well as links to some technical documents to help you get things configured.

     

    The big list of tools:

     

    The following tools are part of the “Debugging Tools for Windows” – you’ll definitely need these

    http://www.microsoft.com/whdc/devtools/debugging/

    ·         windbg

    ·         cdb

    ·         ntsd

    ·         tlist

    ·         gflags

    ·         adplus

    ·         UMDH

    ·         symcheck

     

    Sysinternals provides some great tools that we’ll be discussing

    http://www.sysinternals.com

    ·         Process Explorer

    ·         Process Monitor

    ·         Regmon

    ·         Filemon

    ·         DbgView

    ·         Handle.exe

    ·         Tcpview

    ·         LiveKD

    ·         AutoRuns

    ·         WinObj

     

    There are many tools contained in “MPS Reports” (MPSRPT_SETUPPerf.EXE), but I’m listing it here specifically for Checksym

    http://www.microsoft.com/downloads/details.aspx?FamilyID=CEBF3C7C-7CA5-408F-88B7-F9C79B7306C0&displaylang=en

    ·         Checksym

     

    “Windows Server 2003 Resource Kit Tools” is another great set of tools.  In particular Kernrate is a part of that package

    http://www.microsoft.com/downloads/details.aspx?displaylang=en&familyid=9D467A69-57FF-4AE7-96EE-B18C4790CFFD

    ·         Kernrate

     

    Windows XP SP2 Support Tools

    http://www.microsoft.com/downloads/details.aspx?FamilyID=49AE8576-9BB9-4126-9761-BA8011FABF38&displaylang=en

    ·         netcap

    ·         poolmon

    ·         memsnap

    ·         tracefmt  (64-bit versions available in the DDK)

    ·         tracelog

    ·         tracepdb

    ·         depends

    ·         pstat

     

    “Visual Studio “ – in addition to the compilers and IDE, the following tools come in handy:

    ·         SPY++

    ·         dumpbin

     

    Perfwiz (Performance Monitor Wizard)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=31fccd98-c3a1-4644-9622-faa046d69214&DisplayLang=en

     

    DebugDiag

    http://www.iis.net/handlers/895/ItemPermaLink.ashx

     

    Userdump (User Mode Process Dumper)

    http://www.microsoft.com/downloads/details.aspx?FamilyID=E089CA41-6A87-40C8-BF69-28AC08570B7E&displaylang=en

     

    Dheapmon (Desktop Heap Monitor)

    http://www.microsoft.com/downloads/details.aspx?familyid=5CFC9B74-97AA-4510-B4B9-B2DC98C8ED8B&displaylang=en

     

    Netmon 3.0

    §  Go to http://connect.microsoft.com/

    §  Sign in with your passport account

    §  Choose "Available Connections" on the left

    §  Choose "Apply for Network Monitor 3.0” (once you've finished with the application, the selection appears in your "My Participation" page)

    §  Go to the Downloads page (On the left side), and select the appropriate build 32 or 64 bit build.

     

     

     

    Some articles you may find useful:

     

    Debugging Tools and Symbols: Getting Started

    http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx

     

    Boot Parameters to Enable Debugging

    http://msdn2.microsoft.com/en-us/library/ms791527.aspx

     

    How to Generate a Memory Dump File When a Server Stops Responding (Hangs)

    http://support.microsoft.com/kb/303021/

     

    After installing the “Debugging Tools for Windows”, you’ll find two documents at the root of the install folder that are helpful:

     

    ·         kernel_debugging_tutorial.doc - A guide to help you get started using the kernel debugger.

     

    ·         debugger.chm - The help file for the debuggers.  It details the commands you can use in the debugger.  Think of this as a reference manual, rather than a tutorial.





Page 24 of 24 (239 items) «2021222324