• Ntdebugging Blog

    Windows 7 XP Mode – Auto Publish ‘Not Available’

    • 3 Comments

    Hello All, Ron Riddle here again to share another mystery relating to the XP Mode Auto Publish feature for Windows 7 that had an unexpected root cause.  As you may know, the Auto Publish feature allows applications installed within the guest to be automagically available from the start menu of the Windows 7 host.  Of course, one requirement for this to work properly is that the feature must be enabled; however, I have encountered two such instances where the feature is listed as ‘Not Available’ within the Virtual PC Settings UI once the guest is started.

     

    image001

     

    Debugging the Issue

     

    The Virtual PC Integration Components Services Application (1-vmsrvc) service, which runs within a vmsrvc.exe instance on the guest, decides whether the Auto Publish feature should be enabled.  For non-Windows 7 guests, the first thing 1-vmsrvc does is issue a WMI query to determine whether KB961742 (for an XP guest) or KB961741 (for a Vista guest), which provide Remote Applications Integrated Locally (RAIL) support, have been applied.

     

    Root Cause Analysis

     

    Before we issue the WMI query, we must first activate the CLSID_WbemLevel1Login component.  Here’s an excerpt below:

     

    0:009> k

    ChildEBP RetAddr 

    00b1f730 74ef186e wbemprox!CDCOMTrans::DoActualCCI+0x3d

    00b1f774 74ef15db wbemprox!CDCOMTrans::DoCCI+0x12d

    00b1f830 74ef17e4 wbemprox!CDCOMTrans::DoActualConnection+0x25c

    00b1f85c 74ef1ee1 wbemprox!CDCOMTrans::DoConnection+0x25

    00b1f89c 01018283 wbemprox!CLocator::ConnectServer+0x7c

    00b1fae8 010182da vmsrvc!VPCRAILUpdates::Connect+0xa8

    00b1faf4 0101842b vmsrvc!VPCRAILUpdates::QueryInstalledFixes+0xb

    00b1fe34 0100b61f vmsrvc!VPCRAILUpdates::CheckIfUpdatesArePresent+0x91

    00b1ffb4 7c80b729 vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc+0x90

    00b1ffec 00000000 kernel32!BaseThreadStart+0x37

     

    However, I noticed that the activation attempt failed with WBEM_E_CRITICAL_ERROR(0x8004100a).

     

    0:009> r eax

    eax=8004100a

     

    So, I proceeded to debug the activation attempt from within the Windows Management Instrumentation(winmgmt) service, since it provides the class factory for this component.  Notice the call to LoadLibraryExW passing a relative path for the lpFileName parameter.  This means that a search strategy must be applied which leverages the PATH environment variable.

     

    0:002> k

    ChildEBP RetAddr             

    0086f438 594976e2 kernel32!LoadLibraryExW

    0086f494 7751d8a7 wmisvc!CForwardFactory::CreateInstance+0xf8

    0086f4b8 7751daac ole32!GetInstanceHelperMulti+0x20

    0086f578 77e799f4 ole32!CObjServer::CreateInstance+0x251

    0086f59c 77ef421a RPCRT4!Invoke+0x30

    0086f9a8 77ef4bf3 RPCRT4!NdrStubCall2+0x297

    0086fa00 77600c15 RPCRT4!CStdStubBuffer_Invoke+0xc6

    0086fa40 77600bbf ole32!SyncStubInvoke+0x33

    0086fa88 7752ad31 ole32!StubInvoke+0xa7

    0086fb60 7752ac56 ole32!CCtxComChnl::ContextInvoke+0xe3

    0086fb7c 776007f5 ole32!MTAInvoke+0x1a

    0086fbac 77602df3 ole32!AppInvoke+0x9c

    0086fc80 77600715 ole32!ComInvokeWithLockAndIPID+0x2c2

    0086fccc 77e794bd ole32!ThreadInvoke+0x1cd

    0086fd00 77e79422 RPCRT4!DispatchToStubInC+0x38

    0086fd54 77e7934e RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x113

    0086fd78 77e8a384 RPCRT4!RPC_INTERFACE::DispatchToStub+0x84

    0086fdb8 77e8a3c5 RPCRT4!RPC_INTERFACE::DispatchToStubWithObject+0xc0

    0086fdf8 77e7bcc1 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x2cd

    0086fe1c 77e7bc05 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d

     

    0:002> du poi(esp+4)

    59491668  "wbemcore.dll"

     

    Surprisingly, I found that the requested library could not be found using the standard search strategy.

     

    0:002> !gle

    LastErrorValue: (Win32) 0x7e (126) - The specified module could not be found.

    LastStatusValue: (NTSTATUS) 0xc0000135 - {Unable To Locate Component}  This application has failed to start because %hs was not found. Re-installing the application may fix this problem.

     

    I then decided to enable boot logging within the Process Monitor tool to catch a glimpse into why the load for wbemcore.dll was failing.  Here’s an excerpt from the Process Monitor log:

     

    8:18:11.1652951 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\system32\wbemcore.dll  PATH NOT FOUND    

    8:18:11.1653627 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\wbemcore.dll        PATH NOT FOUND 

    8:18:11.1654161 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\System32\Wbem\wbemcore.dll  PATH NOT FOUND     

     

    At this point, it became clear to me that something must be wrong with the PATH environment variable configuration because the SystemRoot environment variable was not being properly expanded.  Sure enough, the registry showed that the Path value was of type REG_SZ rather than REG_EXPAND_SZ.

     

    image003

     

    Once I saved off the data for the Path value and recreated it specifying the proper type(REG_EXPAND_SZ), the issue was resolved!

     

    Sidenote on the Debugging Strategy

     

    I chose to debug the services by attaching ntsd.exe and redirecting the session to the kernel debugger.  Alternatively, I could have chosen to use a remote debugging session since it’s much more natural than redirecting to the kernel debugger, in my opinion.  However, this issue was further complicated by the fact that introducing a debugger in the mix had potential to change the timing enough such that I struggled to reproduce the issue using a remote session.  The reason for this is that if we slow down the 1-vmsrvc execution enough, the winmgmt service itself will attempt to load wbemcore.dll, albeit through an activation request where an absolute path is specified, thereby avoiding application of a search strategy, which will succeed.  Thus, when 1-vmsrvc attempts to activate CLSID_WbemLevel1Login, it will now succeed because wbemcore.dll is already loaded and the Auto Publish feature will now be ‘Enabled’!

     

    Configuration

     

    I leveraged the Image File Execution Options key, creating a new entry for vmsrvc.exe and configuring the Debugger value with the following command line:

     

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\vmsrvc.exe]

    "Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\vmsrvc -c \"bu vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc;bu wbemprox!CDCOMTrans::DoActualCCI;g\""

     

    I also isolated the winmgmt service into its own svchost.exe, copied %systemroot%\system32\svchost.exe to %systemroot%\system32\svchost2.exe, and then created a new key for svchost2.exe with the following command line:

     

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\svchost2.exe]

    "Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\winmgmt -c \"bu wmisvc!CForwardFactory::CreateInstance;g\""

     

    Although redirecting ntsd.exe to the kernel debugger was cumbersome from a usability perspective, I found the ability to debug both processes from a central facility very appealing and worthwhile.

     

    Conclusion

     

    For issues like these that ultimately resolve to a misconfiguration of the OS, I can’t help but think how unfortunate it was that I didn’t stumble onto root cause sooner via the routine task of launching some executable from a command shell.  Surely this would have been a red flag and could have saved me a lot of time debugging!

     

    image005

     

    While this was a rather extreme example of how a misconfiguration of the OS can affect other seemingly unrelated parts such as the Auto Publish feature of XP Mode, the take-away here is when you detect that the Auto Publish feature is ‘Not Available’, you should begin with standard WMI troubleshooting; and, as we’ve just seen here, a quick sanity check of the environment might not be a bad idea either! :)

     

    Until next time, happy debugging!

  • Ntdebugging Blog

    How to debug WOW64 applications?

    • 3 Comments

     

    Hello, my name is Venkatesh Ganga, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team.  Recently, I worked on an issue where registry reflection was not happening while installing 32bit Office under the system account. This required looking into both the 32bit Office code and the Wow64 code where the registry reflection is implemented.  When attaching to the Wow64 process using the 32bit debugger it’s like debugging a 32bit process on the 32bit machine; there are no 64bit binaries in the process. However, we needed to debug Wow64 to debug the registry reflection code. To do this we attached to the Wow64 process using the 64bit debugger which allows you to see the Wow64 binaries.

    Ø  lm

                Base TimeStamp                     Module

              400000 42435b2a Mar 24 18:28:26 2005 C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE

            77ec0000 45d6cc72 Feb 17 03:35:46 2007 C:\WINDOWS\system32\ntdll.dll

            6b000000 45d6943d Feb 16 23:35:57 2007 C:\WINDOWS\system32\wow64.dll

            6b280000 45d695f3 Feb 16 23:43:15 2007 C:\WINDOWS\system32\wow64win.dll

            78b80000 42438b7a Mar 24 21:54:34 2005 C:\WINDOWS\system32\wow64cpu.dll

     

    There are 2 options for debugging Wow64 applications.

    1.       Using the 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    2.       Using the 32bit debugger

    Using 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    I ran the 32bit version of Internet Explorer on a 64bit machine and attached to it using the 64bit debugger. Here is the thread 0 call stack when viewed from 64bit debugger.

    0:000> kL

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

     

    The above stack only shows the 64 bit calls; we cannot see what the 32 bit calls are doing. To get the 32bit stack you must use one of the below methods.

    Ø  Option 1 : Run “!wow64exts.k”

    0:000> !wow64exts.k

    Walking 64bit Stack…

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

    Walking 32bit Stack...

    ChildEBP          RetAddr          

    002ded98 75ec1c83 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 KERNEL32!BaseProcessStart+0x28

     

    Ø  Option 2 : Switch to x86 mode (using “!wow64exts.sw”) and do KB.

     

    0:000> !wow64exts.sw

    Switched to 32bit mode

    0:000:x86> kb

    ChildEBP          RetAddr           Args to Child                                        

    002ded98 75ec1c83 002f1be8 002dee50 002f1be8 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef 002f1be8 002f1be8 00000000 BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 002f1be8 00000001 00000000 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d 002e2508 00000001 ffffffff SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 00400000 00000000 002e2508 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a 00000000 00000000 7efdf000 IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 004025c2 00000000 000000c8 KERNEL32!BaseProcessStart+0x28

     

    The easiest way to see all of the 32bit call stacks is by switching to  32bit mode (!wow64exts.sw) and doing ~*k.  In addition, you can set breakpoints in 32bit or 64 bit binaries using the 64bit debugger.  Also note  “!peb” will show both the 64bit and 32bit PEB.

     

    Using the 32bit debugger

    As mentioned earlier there is nothing wrong with using the 32 bit debugger.  If you just need to debug the application’s 32bit code, using it is probably the simplest approach.  However, if you need to view Wow64 code or binaries, you must use the 64bit debugger.  Note that these techniques apply to debugging Wow64 dumps and live processes.

    You can find more information about WoW64 applications at http://msdn.microsoft.com/en-us/library/aa384249(VS.85).aspx 

     

  • Ntdebugging Blog

    Debugging a bluescreen at home

    • 15 Comments

     

    Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.

     

    A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.

     

    This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.

     

    Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.

     

    Let's see what we can tell from the dump!

     

    The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:

     

    kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)

    This is a very common bugcheck.  Usually the exception address pinpoints

    the driver/function that caused the problem.  Always note this address

    as well as the link date of the driver/image that contains this address.

    Some common problems are exception code 0x80000003.  This means a hard

    coded breakpoint or assertion was hit, but this system was booted

    /NODEBUG.  This is not supposed to happen as developers should never have

    hardcoded breakpoints in retail code, but ...

    If this happens, make sure a debugger gets connected, and the

    system is booted /DEBUG.  This will let us see why this breakpoint is

    happening.

    Arguments:

    Arg1: c0000005, The exception code that was not handled

    Arg2: 8738e300, The address that the exception occurred at

    Arg3: b9b3dc7c, Trap Frame

    Arg4: 00000000

     

    Debugging Details:

    ------------------

     

     

    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

     

    FAULTING_IP:

    +ffffffff8738e300

    8738e300 0000             add     [eax],al

     

    TRAP_FRAME:  b9b3dc7c -- (.trap ffffffffb9b3dc7c)

    ErrCode = 00000002

    eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64

    eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0         nv up ei pl zr na po nc

    cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246

    8738e300 0000             add     [eax],al                ds:0023:00000001=??

    Resetting default scope

     

    DEFAULT_BUCKET_ID:  DRIVER_FAULT

     

    BUGCHECK_STR:  0x8E

     

    LAST_CONTROL_TRANSFER:  from 8051d6a7 to 8053331e

     

    STACK_TEXT: 

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

    ...

     

    From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.

     

    What’s at 0x8738e300?

     

    kd> dc 8738e300

    8738e300  00000000 00000001 00000000 87360350  ............P.6.

    8738e310  00000000 00000001 f71af9fe f71b0030  ............0...

    8738e320  f71afb0e f71afbb4 f71b0098 f71b0214  ................

    8738e330  f71afef6 f71aff8e 07fef800 00000000  ................

    8738e340  f71afffc 00000000 0a0e000a 644c6d4d  ............MmLd

    8738e350  8732ea58 870303e0 ffffffff ffffffff  X.2.............

    8738e360  00000012 00000000 f797f000 f7989905  ................

    8738e370  0000c000 00500050 e1971458 00160016  ....P.P.X.......

     

    kd> !pool 8738e300 2

    Pool page 8738e300 region is Nonpaged pool

    *8738e2f8 size:   50 previous size:    8  (Allocated) *NV 

                  Owning component : Unknown (update pooltag.txt)

     

    Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).

     

    So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.

     

    kd> kv L8

    ChildEBP RetAddr  Args to Child             

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300

    b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])

    b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])

    b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)

    0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

    0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc

    77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93

    77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090

     

    The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:

     

    kd> ub bf801619

    win32k!ValidateHwnd+0x2f:

    bf8015ff 8d1c82           lea     ebx,[edx+eax*4]

    bf801602 8bc1             mov     eax,ecx

    bf801604 c1e810           shr     eax,0x10

    bf801607 663b430a         cmp     ax,[ebx+0xa]

    bf80160b 75ad             jnz     win32k!ValidateHwnd+0x3d (bf8015ba)

    bf80160d 807b0801         cmp     byte ptr [ebx+0x8],0x1

    bf801611 7573             jnz     win32k!ValidateHwnd+0xff (bf801686)

    bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]

     

    OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.

     

    kd> dd bf98b2e0 L1

    bf98b2e0  804e4a15

     

    Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?

     

    kd> ln 804e4a15

    (804e4a15)   nt!PsGetCurrentThread   |  (804e4a51)   nt!KeSetEventBoostPriority

    Exact matches:

        nt!PsGetCurrentThread = <no type information>

     

    Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.

     

    So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.

     

    At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)

     

    Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:

    image

    Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!

     

    The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!

     

  • Ntdebugging Blog

    Hung Window?, No source?, No problem!! Part 1

    • 2 Comments

    Written by Jeff Dailey

     

    Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR Platforms team.   This blog entry is a follow on for how to detect a hung window.   This process and training lab is right out of our CPR Training curriculum.  In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post.  You will also need to install the debugging tools for windows.  

     

    Debugging tools:  http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx

     

    After you have both of these installed we can get started.  We are going to debug and figure out why the window stops repainting and does not respond.

     

    Step 1 start badwindow.exe

    Step 2 run dumphungwindow.exe

    Step 3 select Hang \ Hang Type 1 from  the BadWindow.exe menu.

     

    You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process

     

    ************ OUTPUT *************

    C:\source\dumphungwindow\release>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeffda\AppData\Local\Temp\
    scanning for hung windows

    **

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_46_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp

    Dump complete*

    ************ OUTPUT *************

     

    Step 4 create a local symbol directory at C:\websymbols

    Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.

     

    Step 6 start windbg select file\open crash dump and select the first dump file.

    Your initial output should look like this.

     

    Microsoft (R) Windows Debugger  Version 6.7.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]

    User Mini Dump File with Full Memory: Only application data is available

     

    Symbol search path is: Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    Windows Vista Version 6000 MP (2 procs) Free x86 compatible

    Product: WinNt, suite: SingleUserTS

    Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)

    System Uptime: 11 days 18:41:43.089

    Process Uptime: 0 days 0:00:32.000

    ....................................

    Loading unloaded module list

    .

    eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=004e3750 edi=004d9c60

    eip=777fb2b2 esp=0017ee18 ebp=0017ee44 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    ntdll!NtAlpcSendWaitReceivePort+0x12:

    777fb2b2 c22000          ret     20h

     

    Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it.
    Type ~* k

    You will most likely see output similar to this.

     

    .  0  Id: 1ef4.17a8 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 
    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12
    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f
    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48
    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c
    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25
    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31
    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9
    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f
    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19
    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257
    0017f388 756e6025 dnsapi!Query_Shim+0xbb
    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20
    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d
    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129
    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c
    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17
    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c
    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4
    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74
    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

       1  Id: 1ef4.f80 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 
    0275fef0 76e41220 ntdll!ZwWaitForSingleObject+0x15
    0275ff60 76e41188 kernel32!WaitForSingleObjectEx+0xbe
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for FwcWsp.dll -
    0275ff74 55620611 kernel32!WaitForSingleObject+0x12
    WARNING: Stack unwind information not available. Following frames may be wrong.
    0275ff88 5561f10c FwcWsp!RatDumpStack+0x1bd
    0275ffa0 76eb19f1 FwcWsp!RatDbgPrint+0x156
    0275ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    0275ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

       2  Id: 1ef4.1d04 Suspend: 0 Teb: 7efd7000 Unfrozen
    ChildEBP RetAddr 
    028cfe90 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
    028cff2c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
    028cff48 5560c841 kernel32!WaitForMultipleObjects+0x18
    WARNING: Stack unwind information not available. Following frames may be wrong.
    028cff84 5560c9c0 FwcWsp!WspDisable+0x1b47
    028cffa0 76eb19f1 FwcWsp!WspDisable+0x1cc6
    028cffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    028cffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 1ef4.1d04 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efd7000 Unfrozen

     

    Each of these threads represents a call stack.  The most recent call is at the TOP of the stack.  As each call is made the stack grows larger.  In this case thread 0 is not showing us all it’s stack.  We can get more of the stack be specifying the number of frames we would like to see.

     

    In this case my current thread is 0 so I will issue a k 123 command asking for as many 123 calls to be displayed.

     

    0:000> k 123

    ChildEBP RetAddr 

    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12

    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f

    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48

    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c

    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25

    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9

    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19

    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257

    0017f388 756e6025 dnsapi!Query_Shim+0xbb

    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20

    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d

    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129

    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c

    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17

    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c

    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4

    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74

    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

    0017f7e0 76d7dba3 ws2_32!getxyDataEnt+0xbd

    0017fa1c 0040147d ws2_32!gethostbyname+0xb4

    0017fbc8 004012fc badwindow!hangtype1+0x3d [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af badwindow!WndProc+0xdc [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    As you can see our current thread is the WinMain thread.   Note this from the call badwindow!wWinMain,  this winmain is processing messages, you can see it calling DispatchMessageW this goes back into User32, and user32 has to rout this message for the proper WinProc that is registered for the associated Window.  

     

    Looking further up the stack you see our winproc being called badwindow!WndProc.  Remember it’s critical that your WinProc stays available to process incoming messages.  If it makes a blocking call or runs in a loop you will not be able to process incoming messages and your window will not repaint and your application will appear to be hung.

     

    If you look deeper yet into the call stack you can see that the WndProc calls hangtype1, this calls gethostbyname.    http://msdn2.microsoft.com/en-us/library/ms738524.aspx 

     

    Looking at the docs, Gethostbyname takes one parameter being the server name.  Looks dump it out and see what we are looking for.

     

    0:000> kv 123

    ChildEBP RetAddr  Args to Child             

    0017ee14 76f584c5 00000424 40020000 004d9c60 ntdll!NtAlpcSendWaitReceivePort+0x12 (FPO: [8,0,0])

    0017ee44 76f5899d 00020000 004d9c60 004e3638 rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f (FPO: [Non-Fpo])

    0017ee78 76f58514 0017eed8 0017ef04 0017f2e8 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48 (FPO: [Non-Fpo])

    0017ee90 76f57785 0017eed8 0017ef04 0017eeb4 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c (FPO: [Non-Fpo])

    0017eea0 76f5766b 0017eed8 756e54f0 0017ef04 rpcrt4!I_RpcSendReceive+0x25 (FPO: [Non-Fpo])

    0017eeb4 76f5763a 0017ef04 004d9d18 76ff0146 rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 59a28201 00000000 00000000 rpcrt4!NdrpSendReceive+0x9 (FPO: [0,0,0])

    0017f2c8 756e5ff3 756e54f0 756e5aec 0017f2e8 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 00000000 0279091c 00000001 dnsapi!R_ResolverQuery+0x19 (FPO: [Non-Fpo])

    0017f34c 756e5d34 0279091c 00000001 14000000 dnsapi!Query_PrivateExW+0x257 (FPO: [Non-Fpo])

    0017f388 756e6025 00000001 0279091c 00000001 dnsapi!Query_Shim+0xbb (FPO: [Non-Fpo])

    0017f3b0 74f461b4 0279091c 00000001 14000000 dnsapi!DnsQuery_W+0x20 (FPO: [Non-Fpo])

    0017f3e0 74f4611c 0279091c 00000001 14000000 mswsock!SaBlob_Query+0x2d (FPO: [Non-Fpo])

    0017f424 74f459db 0279091c 0031c2e0 00307118 mswsock!Rnr_DoDnsLookup+0x129 (FPO: [Non-Fpo])

    0017f6c0 76d75a72 027908b8 00000000 0017f794 mswsock!Dns_NSPLookupServiceNext+0x25c (FPO: [Non-Fpo])

    0017f6d8 76d75a52 00307270 027908b8 00000000 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17 (FPO: [Non-Fpo])

    0017f6f4 76d75994 0031bc18 00000000 0017f794 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c (FPO: [Non-Fpo])

    0017f720 76d7590a 00307130 00000000 0017f794 ws2_32!NSQUERY::LookupServiceNext+0xa4 (FPO: [Non-Fpo])

    0017f764 76d7b11f 00307118 00000000 0017f794 ws2_32!WSALookupServiceNextW+0x74 (FPO: [Non-Fpo])

    0017f7b4 76d7ab77 00307118 00000000 0017f7ec ws2_32!WSALookupServiceNextA+0x52 (FPO: [Non-Fpo])

    0017f7e0 76d7dba3 00307118 0000013c 0017fa38 ws2_32!getxyDataEnt+0xbd (FPO: [Non-Fpo])

    0017fa1c 0040147d 0017fa38 00000000 00070520 ws2_32!gethostbyname+0xb4 (FPO: [Non-Fpo])

    0017fbc8 004012fc 00401220 0017fbfc 00401220 badwindow!hangtype1+0x3d (FPO: [Uses EBP] [0,101,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af 00070520 00000111 00008003 badwindow!WndProc+0xdc (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 00401220 00070520 00000111 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d 00000000 00401220 00070520 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])

    0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])

    0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])

    0017ff0c 00401817 00400000 00000000 004a0f84 badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 7efde000 0017206b 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

    0:000> da 0017fa38

    0017fa38  "AServerThatDoesNotExist14"

     

    Generally speaking it’s not a good idea to make any call that could possibly block in your winproc. 

     

    Let’s look deeper into what the caller (hangtype1) is really doing, Feel free to look at the C code as it was included with the project.  I like to look at the assembler,  it’s nice practice for those times you don’t have the source.  We can take the return address from the gethostbyname call 0040147d to figure out how it was called.  A neat trick for disassembling the entire function based on an address within that functions code range is to use the unassembled function or UF command.  In the following sample output I’ll simply UF the return address of the gethostbyname call.   I’ll then walk trough and annotate the disassembly.  Then we will look at the C source.

     

    For reference on intel cpu programming see http://developer.intel.com/design/pentium/manuals/24319001.pdf

     

    0:000> uf 0040147d

    We are making run for locals on the stack, 402 dec bytes.  If you look at the source you will see this is a sznamebuff[400], int I, and pointer to a host structure.
    badwindow!hangtype1 [c:\source\badwindow\badwindow\badwindow.cpp @ 298]:
    298 00401440 81ec94010000    sub     esp,194h 

     

    This is done for  stack checking to ensure none is trying to exploit our stack.
    298 00401446 a100304000      mov     eax,dword ptr [badwindow!__security_cookie (00403000)]
    298 0040144b 33c4            xor     eax,esp
    298 0040144d 89842490010000  mov     dword ptr [esp+190h],eax

    Save ebx register state on the stack
    298 00401454 53              push    ebx 

    Get the function pointer for sprint from the import table.
    299 00401455 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]

    Save the stack base / frame pointer.
    299 0040145b 55              push    ebp

    Get the address of gethostbyname from the import table and save it in EBP.
    299 0040145c 8b2d20214000    mov     ebp,dword ptr [badwindow!_imp__gethostbyname (00402120)]

    Save ESI and EDI on the stack.
    299 00401462 56              push    esi
    299 00401463 57              push    edi

    Zero out ESI

    299 00401464 33f6            xor     esi,esi
    badwindow!hangtype1+0x26 [c:\source\badwindow\badwindow\badwindow.cpp @ 306]:

    Push ESI on the stack and the first pass it will be zero == 0, this is the 3rd parameter to the call we are about to make.
    306 00401466 56              push    esi

    We are loading the effective address of something on the stack, we will see what it is used for when we use EAX, remember that.
    306 00401467 8d442414        lea     eax,[esp+14h]

    Hum what is this?.
    badwindow!`string' (004021ac) Well when in doubt dump it out.
    0:000> db 004021ac
    004021ac  41 53 65 72 76 65 72 54-68 61 74 44 6f 65 73 4e  AServerThatDoesN
    004021bc  6f 74 45 78 69 73 74 25-64 00 00 00 57 65 20 61  otExist%d...We a
    004021cc  72 65 20 69 6e 20 68 61-6e 67 74 79 70 65 32 74  re in hangtype2t
    004021dc  68 72 65 61 64 61 00 00-57 65 20 61 72 65 20 69  hreada..We are i
    004021ec  6e 20 68 61 6e 67 74 79-70 65 32 74 68 72 65 61  n hangtype2threa
    004021fc  64 62 00 00 48 00 00 00-00 00 00 00 00 00 00 00  db..H...........
    0040220c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
    0040221c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    Ok so we are pushing our print formatting string for sprintf this is the second parameter to the call we are about to make.

    306 0040146b 68ac214000      push    offset badwindow!`string' (004021ac)

    Ok here we are pushing EAX,  remember eax is just an address on our stack, and in this case the 1st parameter passed to sprint,  it’s our BUFFER.  szNameBuffer to be exact.
    306 00401470 50              push    eax

     

    Call sprint (Remember we stored it’s address in ebx)
    306 00401471 ffd3            call    ebx

     

    Stacks grown down or negative as they get larger, so adding in this case is cleaning up the stack from our sprintf call
    306 00401473 83c40c          add     esp,0Ch

     

    Get the address of the szNameBuffer
    307 00401476 8d4c2410        lea     ecx,[esp+10h]

     

    Push the address of the name buffer on the stack.
    307 0040147a 51              push    ecx

     

    Call gethostbyname (Rember we stored it’s address in ebp the)  NOTE when you do a return the value is stored in EAX (REMBER THIS)
    307 0040147b ffd5            call    ebp

     

    Push 250 dec FA hex on the stack.  (This is our return address from get host by name we see on the stack)
    310 0040147d 68fa000000      push    0FAh 

     

    Store EAX in EDI We need to save it because EAX will be over written when we come back from Sleep and we may want to use it later.  Eax is the return value from gethostbyname.
    310 00401482 8bf8            mov     edi,eax

     

    Call sleep it takes 1 parameter and that values was just pushed on the stack 250ms ¼ second.
    310 00401484 ff150c204000    call    dword ptr [badwindow!_imp__Sleep (0040200c)]

    Increment ESI this is where we are storing the value int i.   It was zero the first time through now it’s ESI+1 if we loop it will grow
    311 0040148a 83c601          add     esi,1

     

    Compare ESI we know this is in I to 0x64h or 100dec.
    312 0040148d 83fe64          cmp     esi,64h

    Where we have a jump if greater then, that is to say if ESI > 64h or 100dec jump to 00401496
    312 00401490 7f04            jg      badwindow!hangtype1+0x56 (00401496)

    Now we test EDI, this is the return value from gethostbyname.  TEST is a logical compare.  So if you test EDI and EDI and EDI = 0
    304 00401492 85ff            test    edi,edi

    If EDI is ZERO 0 jump to this address je is jump if ZERO this take us to the TOP of the loop.

    304 00401494 74d0            je      badwindow!hangtype1+0x26 (00401466)

     

    The rest of this is us cleaning up the stack and getting ready to drop out of the function.  We can stop here.
    318 00401496 8b8c24a0010000  mov     ecx,dword ptr [esp+1A0h]
    318 0040149d 5f              pop     edi
    318 0040149e 5e              pop     esi
    318 0040149f 5d              pop     ebp
    318 004014a0 5b              pop     ebx
    318 004014a1 33cc            xor     ecx,esp
    318 004014a3 e8c5010000      call    badwindow!__security_check_cookie (0040166d)
    318 004014a8 81c494010000    add     esp,194h
    318 004014ae c3              ret

     

    Here is the C code snipped for hangtype1

     

    void hangtype1(void)

    {

    int i =0;

    char szNameBuff[400];

    struct hostent * myhost=NULL;

     

          // Lets see if any of my favorte machines are on the network.

          while (!myhost)

          {

                sprintf(szNameBuff,"AServerThatDoesNotExist%d",i);

                myhost = gethostbyname (szNameBuff);

                // We are hitting network resource in a loop.

                // Let's not hit the server with to many request at once. We will pause 250ms.

                Sleep(250);

                i++;

                if(i>100)

                {

                      break;

                }

          }

         

    }

     

    So what is happening,  if you recall we saw AServerThatDoesNotExist14 on the stack.  We know from this loop and the format string passed to sprint that this loop will run to 100 and will sleep 250ms with each pass.  That’s 25 seconds total just in sleeps not to mention the network stall on a bad server name.   I realize this code is somewhat contrived but it’s based on the types of things we actually see and troubleshoot every day.

     

    So how do we get this work done without stalling the application?.  The answer is rather simple.  We go multithreaded.   You can look for hangtype1fix in the project.  Instread of calling hangtype1 from our windproc directly we declare a threadproc called hangtype1fixthread

     

    void __cdecl hangtype1fixthread(void *)

    {

           // our blocking function.

           hangtype1();

    }

     

    Then from our winproc instead of calling hangtype1 we call another function called hangtype1fix that creates a new thread that in turn executes out hangtype1 code in another thread context so our winproc can continue to execute and process incoming messages.  The being thread call is asynchronously .  The thread is create immediately and your thread continues to run.

     

    void hangtype1fix(void)

    {

           _beginthread(hangtype1fixthread, 12000, NULL);

    }

     

    I hope you found this blog informative and I welcome your feedback.

     

    Thanks  Jeff-

     

     

  • Ntdebugging Blog

    DNS Slow to Start?

    • 7 Comments

    Hi my name is Jason. I’m an escalation engineer on the Microsoft critical problem resolutions platform team. We recently dealt with an interesting issue that I would like to share with you and show how we went about discovering the cause. 

    The customer stated that when his server (Windows Server 2003 SP1) boots up the DNS Server service does not become availabe/responsive for ~5 minutes. Customer also stated he has other DC's in this domain that are the same model server (same specs too) that do not have the problem. So I ask, "What do you mean not responsive?". During this 5 minute period, the administrator cannot use the DNS MMC to administer the server, and DNS lookups to this server fail with a timeout error. The customer can: Ping the server, connect to file shares, etc.

    Troubleshooting

    So my initial question was "What is DNS doing for 5 minutes?". In order to determine where dns.exe was hung up or spinning his wheels, I replaced dns.exe with the checked build.

    Walkthrough:

    1.     Downloaded the checked build of SP1

    2.     Extracted and expanded the checked build of dns.exe (What is a checked build?)

    3.     Worked through the process of replacing the free build of DNS.exe (Booting into safe mode in order to bypass WFP (Windows file protection))

    4.     Created the dnsdebug file (a text file with no extension under %windir%\system32\dns). The checked build of DNS looks for this file, and reads the contents to determine the level of output. 

    5.     In the file I set the debug flags for DNS_DEBUG_INIT (0x00000010 ) & DNS_DEBUG_INIT2 (0x01000000) and set the output to go to the debugger (0x00000002), net result (0x01000012). The image below shows contents of the file.
     

    6.     Instead of using the debugger I decided to use DebugView from Sysinternals to view the output. I could have attached a debugger to DNS.exe to see the output, but in this case the DebugView was sufficient.

     

    With all this in place we rebooted. We launched DebugView and did a "net start dns". Here is a screen scrape of what the output looked like:

    image

     

    I have highlighted the debug flag I have set.

    From this output we were able to determine that the Server was going through the following process:

    1.     DNS gets started by Service Control Manger (a.k.a. SCM) and then DNS communicates back to SCM (across the NtControlPipe)

    2.     DNS loads the zones from the registry (HKLM\Software\Microsoft\Windows NT\Current Version\DNS Server\Zones

    3.     DNS pulls this zones from the Active Directory (as this is an AD integrated zone)

     

    During these phases the ability to respond to administrative requests and name lookup requests is blocked (this is by design). So based on this I needed to figure out where we were spending the most of our 5 minutes.

    1.     For the SCM portion, I used Filemon monitoring only Named Pipes for dns.exe and services.exe.

    2.     For the registry portion, I used Regmon with a filter for dns.exe

    3.     For the AD portion, I used perfmon with the NTDS>LDAP Searches/Sec counter

    With these running we were able to isolate the delay to the point where DNS was pulling the zones from AD. Here is a scrape of me running this test on my VM:

    image

     

    You can see (top left) the service communicating back to SCM across the NtControlPipe. Lets take a moment here and have a short discussion how a service starting works.

    1. SCM goes to a registry value that will be used as a seed value to build a named pipe. (HKLM\System\CurrentControlSet\Control\ServiceCurrent\(default) = <a number>
    2. T
    he result is incremented by 1 and then returned back to SCM
    3. SCM then creates a named pipe using NtControlPipe as the prefix followed by the numerical value from the registry. So when starting the first service, the pipe name would be NtControlPipe1 (To get a peek at these pipes download PipeList from Sysinternals)
    4.
    The SCM then starts the service process
    5.
    The service then goes to the same registry value and uses this seed to know which named pipe he will need to talk to in order to convey progress back to SCM.
     

    We can then see (top right) the service pulling in the dns zone information from the registry. Finally we see (bottom) the small spike in LDAP searches/sec (highlighted) where the service was pulling the zone information from the AD. Until the pull of the zones from the registry and AD is complete, the DNS service does not respond to MMC administration or DNS requests.

    The majority of the 5 minute delta fell into the last bucket (the LDAP queries to pull the zone data). So begins our next discussion, What causes LDAP to be slow?

    So LDAP lives in LSASS.EXE. There are two LDAP interfaces in AD. The entire directory is available of TCP port 389, and then the subset of AD that is published (a.k.a. Global Catalog) over TCP port 3268. The DNS services is pulling the DNS zones from the "Entire Directory" interface.

    Lsass memory usage on domain controllers has two components:

    • Data structures, which are like those in other processes (i.e. threads, heaps, and stacks)
    • Database buffer cache, which consists of database pages and index pages for the directory ß This is where a "In-RAM" copy of AD database would be

       

    In Windows Server 2003, there is no limit to how large the database buffer cache can grow (outside of the limits of the 2GB of virtual address space that can be modified via the /3GB switch). This helps explain the behavior of lsass.exe normally being the largest value in the "Mem Usage" category in task manager on a DC.

    Q. What can affect the amount of Database buffer cache that LSASS is maintaining "In-RAM?
    A. Memory pressure. This could be caused by configuration (pagefile settings, etc.) or another application(s) using and paging a lot of memory on the machine.

    lsass.exe on the affected machine was using 300mb in the "Mem Usage" column in task manager. The machine that was not affected was closer to 450mb. This means there was 150mb more data in the Database buffer cache, and due to this, the process is able to respond a lot faster to LDAP requests since he can pull from the RAM cache more and this limits the amount of time spent pulling this data into the cache from the DIT file on the disk.

    The solution is to tune the pagefile settings to match the better performing server.

    By simply increasing the pagefile from ~1 gig to ~3gig the DNS Server can be available within ~30 seconds

    So what are we (Microsoft) doing about this?

    Excerpt from the "Book of Longhorn"

    What new DNS functionality is provided by this feature in Windows Server "Longhorn"?

    The DNS Server service in Windows Server "Longhorn" includes a number of new and enhanced features compared to the DNS Server service that was available in the Microsoft Windows NT® Server, Windows 2000 Server, and Windows Server® 2003 operating systems. The following sections describe these features.

     

    Background zone loading

    Very large organizations with extremely large zones that store their DNS data in AD DS sometimes discover that restarting a DNS server can take an hour or more while the DNS data is retrieved from the directory service. The result is that the DNS server is effectively unavailable to service client requests for the entire time that it takes to load AD DS-based zones.

    A DNS server running Windows Server "Longhorn" now loads zone data from AD DS in the background while it restarts so that it can respond to requests for data from other zones. When the DNS server starts, it:

    • Enumerates all zones to be loaded.
    • Loads root hints from files or AD DS storage.
    • Loads all file-backed zones, that is, zones that are stored in files rather than in AD DS.
    • Begins responding to queries and remote procedure calls (RPCs).
    • Spawns one or more threads to load the zones that are stored in AD DS.

    Because the task of loading zones is performed by separate threads, the DNS server is able to respond to queries while zone loading is in progress. If a DNS client requests data for a host in a zone that has already been loaded, the DNS server responds with the dat (or, if appropriate, a negative response) as expected. If the request is for a node that has not yet been loaded into memory, the DNS server reads the node's data from AD DS and updates the node's record list accordingly.

    Why is this functionality important?

    The DNS server can use background zone loading to begin responding to queries almost immediately when it restarts, instead of waiting until its zones are fully loaded. The DNS server can respond to queries for the nodes that it has loaded or that can be retrieved from AD DS. This functionality also provides another advantage when zone data is stored in AD DS rather than in a file: AD DS can be accessed asynchronously and immediately when a query is received, while file-based zone data can be accessed only through a sequential read of the file.

    I hope you find this information helpful.

  • Ntdebugging Blog

    Part 1 - ETW Introduction and Overview

    • 2 Comments

    Introduction:

     

    Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.

    ETW is useful from a variety of scenarios, including:

     -User & Admin: Being in control of your system and knowing what is going on.
     -User & Admin: Troubleshooting performance, hardware and OS issues.
     -Enthusiast: Learning further about the OS and the low level guts of the OS.
     -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies
     -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.

    ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.

    A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.

    Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.

     

    image

    Some uses of ETW:

     

    Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.

    As it stands today, ETW has two major thrusts:

    1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.

    First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):

    image

    A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.

    Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.

    ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.

     

    image

    There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.

    The second major thrust of ETW is exposing performance data.

    2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).

    Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.

    Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.

    image

    image

    Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.

    For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).

    Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.

    image

    There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.

    Where to go from here:

     

    In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).

    The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.

     

     

    Share this post :

  • Ntdebugging Blog

    Debugging a Crash, Found a Trojan

    • 8 Comments

    Hi, I'm Manish from Global Escalation Services. I would like to present a multiple random bug check issue, which was caused by malicious code (trojan) running on the machine. This is the walkthrough of how we found the virus on the server.


    In this particular dump, the machine crashed with Bugcheck 0xA (IRQL_NOT_LESS_OR_EQUAL) because we got a Page Fault at dispatch level, also known as DPC Level (IRQL 2). Windows system architecture governs that we cannot have a page fault at dispatch level because paging requires I/O, I/O requires a wait, and we cannot wait while the IRQL is above dispatch level.  So when this anomaly happens Windows will intentionally crash the machine.

    We trapped in CcMapData because we touched the address c226d800 which was not valid because it’s paged out. If we look at implementation of CcMapData on MSDN http://msdn.microsoft.com/en-us/library/windows/hardware/ff539155(v=vs.85).aspx we see that this function can only be called below Dispatch Level, but the current CPU IRQL is 2.


    So how did this happen?  The most likely possibility is that some driver on the stack raised the IRQL by calling KeRaiseIrql and then forgot to lower it by calling KeLowerIrql.  There are many drivers on this stack and anyone could be the culprit. It is difficult to track when one of these may have done it as it’s long gone.


    We could have enabled IRQL Checking using Driver Verifier to find the culprit but we have to enable it on all drivers listed in the stack (and possibly others), which could cause some performance issues.  Also this was not the only crash we were seeing; there were various stop codes.

    So before going the verifier route I decided to dig more into this dump. I noticed there is an address on the stack for which the module name is not getting resolved. This looked odd (suspicious) as most of the time the debugger does a great job of finding the module.

    I started investigating what is this module.  It turned out to be a Trojan “TrojanDropper:Win32/Sirefef.B”.
    http://www.microsoft.com/security/portal/Threat/Encyclopedia/Entry.aspx?Name=TrojanDropper%3AWin32%2FSirefef.B

     

    0: kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

    IRQL_NOT_LESS_OR_EQUAL (a)

    An attempt was made to access a pageable (or completely invalid) address at an

    interrupt request level (IRQL) that is too high.  This is usually

    caused by drivers using improper addresses.

    If a kernel debugger is available get the stack backtrace.

    Arguments:

    Arg1: c226d800, memory referenced

    Arg2: d0000002, IRQL

    Arg3: 00000000, bitfield :

          bit 0 : value 0 = read operation, 1 = write operation

          bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)

    Arg4: 808b64a6, address which referenced memory

     

    0: kd> kv

    ChildEBP RetAddr  Args to Child             

    f78ae41c 808b64a6 badb0d00 00000000 00000001 nt!KiTrap0E+0x2a7 (FPO: [0,0] TrapFrame @ f78ae41c)

    f78ae4cc f71a6f2d 8b22d520 f78ae4fc 00000400 nt!CcMapData+0x8c (FPO: [Non-Fpo])

    f78ae4ec f71a4494 f78ae7ec 8b64c150 01c6d800 Ntfs!NtfsMapStream+0x4b (FPO: [Non-Fpo])

    f78ae560 f71a6df0 f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadMftRecord+0x86 (FPO: [Non-Fpo])

    f78ae598 f71a6fac f78ae7ec 8b3a7100 e7c50ce0 Ntfs!NtfsReadFileRecord+0x7a (FPO: [Non-Fpo])

    f78ae5d0 f718e46d f78ae7ec e7c50cd8 e7c50ce0 Ntfs!NtfsLookupInFileRecord+0x37 (FPO: [Non-Fpo])

    f78ae67c f718e541 f78ae7ec e7c50cd8 f718e2cb Ntfs!NtfsWalkUpTree+0xbe (FPO: [Non-Fpo])

    f78ae6d8 f718e263 f78ae7ec e7c50cd8 00000000 Ntfs!NtfsBuildNormalizedName+0x44 (FPO: [Non-Fpo])

    f78ae704 f7196c2e f78ae7ec 8801d600 e7c50da0 Ntfs!NtfsQueryNameInfo+0x4b (FPO: [Non-Fpo])

    f78ae774 f71a0ff6 f78ae7ec 88081cb0 8b5f9260 Ntfs!NtfsCommonQueryInformation+0x291 (FPO: [Non-Fpo])

    f78ae7d8 f71a102f f78ae7ec 88081cb0 00000001 Ntfs!NtfsFsdDispatchSwitch+0x12a (FPO: [Non-Fpo])

    f78ae8f4 8081df85 8b3a7020 88081cb0 88081cb0 Ntfs!NtfsFsdDispatchWait+0x1c (FPO: [Non-Fpo])

    f78ae908 f721fd28 8b2fd220 8b60d308 8b54b020 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78ae934 8081df85 8b5f9260 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])

    f78ae948 f721fd28 88081ed0 8b60d308 c000000d nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78ae974 8081df85 8b54b020 88081cb0 88081cb0 fltMgr!FltpDispatch+0x152 (FPO: [Non-Fpo])

    f78ae988 f5c62bec 88081cb0 8b54a980 88081cb0 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    WARNING: Stack unwind information not available. Following frames may be wrong.

    f78ae99c f5c5e3ee 8b03b690 00000000 f78ae9c0 CtxSbx+0x5bec

    f78ae9ac 8081df85 8b03b690 88081cb0 8b0528e0 CtxSbx+0x13ee

    f78ae9c0 f61074e1 8b0528e0 8801d600 f78aea00nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78ae9e4 f61075d0 8b0a9b80 00081cb0 f78aea08 CtxAltStr+0x44e1

    f78ae9f4 8081df85 8b0a9b80 88081cb0 88081ed0 CtxAltStr+0x45d0

    f78aea08 8b5cfc89 00000000 87ef0000 87ef9002 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78aec90 8081df85 8b5ef610 8b08c110 8b08c2a8 0x8b5cfc89 <------------ This looks odd what is this module?

    f78aeca4 f7241607 8b08c2a8 00000000 f78aece8nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78aecb4 f72412b2 8b08c2a8 8b392b70 87f11974 CLASSPNP!SubmitTransferPacket+0xbb (FPO: [Non-Fpo])

    f78aece8 f7241533 00000000 00000e00 87f11808 CLASSPNP!ServiceTransferRequest+0x1e4 (FPO: [Non-Fpo])

    f78aed0c 8081df85 8b392ab8 00000000 8b2a2670 CLASSPNP!ClassReadWrite+0x159 (FPO: [Non-Fpo])

    f78aed20 f74c80cf 8b2b7d80 87f11998 f78aed44 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78aed30 8081df85 8b6e4020 87f11808 87f119bc PartMgr!PmReadWrite+0x95 (FPO: [Non-Fpo])

    f78aed44 f7317053 87f119d8 8b6e8148 882b9888 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78aed60 8081df85 8b2b7cc8 87f11808 87f119fc ftdisk!FtDiskReadWrite+0x1a9 (FPO: [Non-Fpo])

    f78aed74 f72cf4f5 885360cc 88536098 87f26810 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])

    f78aed90 f72d1517 87f11808 885360cc 8b60c470 volsnap!VspDecrementIrpRefCount+0x14f (FPO: [Non-Fpo])

    f78aeda8 f72c0398 87f26810 87f26810 88536098 volsnap!VspWriteVolumePhase22+0x3d (FPO: [Non-Fpo])

    f78aee30 f72d15d3 00f26810 885360cc 00000000 volsnap!VspAcquireNonPagedResource+0xc6 (FPO: [Non-Fpo])

    f78aee50 8081e123 00000000 87f215e8 88536098 volsnap!VspWriteVolumePhase2+0x59 (FPO: [Non-Fpo])

    f78aee80 f7241829 f78aeeb0 f72413ec 8b392ab8 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])

    f78aee88 f72413ec 8b392ab8 87f215e8 00000001 CLASSPNP!ClassCompleteRequest+0x11 (FPO: [Non-Fpo])

    f78aeeb0 8081e123 00000000 8b08c428 8b08c5c0 CLASSPNP!TransferPktComplete+0x1fd (FPO: [Non-Fpo])

    f78aeee0 f7266545 8b6e30e8 8b08c428 f78aef24 nt!IopfCompleteRequest+0xcd (FPO: [Non-Fpo])

    f78aeef0 f7265a8a 87f856a8 00000001 00000000 SCSIPORT!SpCompleteRequest+0x5e (FPO: [Non-Fpo])

    f78aef24 f7265130 8b6e30e8 87f856a8 f78aef9b SCSIPORT!SpProcessCompletedRequest+0x6a7 (FPO: [Non-Fpo])

    f78aef9c 8083211c 8b6e30a4 8b6e3030 00000000 SCSIPORT!ScsiPortCompletionDpc+0x2bd (FPO: [Non-Fpo])

    f78aeff4 8088dba7 f3f4f92c 00000000 00000000 nt!KiRetireDpcList+0xca (FPO: [Non-Fpo])

    f3f4f948 80a603d9 ffdffa02 f3f4f980 f3f4f980 nt!KiDispatchInterrupt+0x37 (FPO: [Uses EBP] [0,0,1])

    f3f4f964 80a60577 8b6c226c f3f4f980 8088d91d hal!HalpCheckForSoftwareInterrupt+0x81 (FPO: [Non-Fpo])

    f3f4f970 8088d91d 8b304c00 000001a3 f3f4fa04 hal!HalEndSystemInterrupt+0x67 (FPO: [Non-Fpo])

    f3f4f970 809395e7 8b304c00 000001a3 f3f4fa04 nt!KiInterruptDispatch+0x5d (FPO: [0,2] TrapFrame @ f3f4f980)

    f3f4fa04 80939c37 88139178 88028301 00000008 nt!ObpAllocateObject+0x199 (FPO: [Non-Fpo])

    f3f4fa38 808f8d28 00000000 8b76aad0 f3f4fa70 nt!ObCreateObject+0x129 (FPO: [Non-Fpo])

    f3f4fb44 80937a40 8b6afd10 00000000 88028398 nt!IopParseDevice+0x710 (FPO: [Non-Fpo])

    f3f4fbc4 80933b74 00000000 f3f4fc04 00000042 nt!ObpLookupObjectName+0x5b0 (FPO: [Non-Fpo])

    f3f4fc18 808eaee7 00000000 00000000 dffa7c01 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo])

    f3f4fc94 808ec181 077ef198 c0140000 077ef14c nt!IopCreateFile+0x447 (FPO: [Non-Fpo])

    f3f4fcf0 808eec10 077ef198 c0140000 077ef14c nt!IoCreateFile+0xa3 (FPO: [Non-Fpo])

    f3f4fd30 808897ec 077ef198 c0140000 077ef14c nt!NtCreateFile+0x30 (FPO: [Non-Fpo])

    f3f4fd30 7c82847c 077ef198 c0140000 077ef14c nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f3f4fd64)

    077ef22c 00000000 00000000 00000000 00000000 0x7c82847c

       

    The current CPU IRQL is 2, at which pagefaults cannot be serviced, hence windows crashed the machine.

    0: kd> !irql

    Debugger saved IRQL for processor 0x0 -- 2 (DISPATCH_LEVEL)

     

    Un-assembling from the return address of this unknown driver, to try to confirm that this is really driver code and that it really does belong in this call stack:

    0: kd> ub 8b5cfc89

    8b5cfc62 8bd7            mov     edx,edi

    8b5cfc64 c60605          mov     byte ptr [esi],5

    8b5cfc67 895e18          mov     dword ptr [esi+18h],ebx

    8b5cfc6a c7460400020000  mov     dword ptr [esi+4],200h

    8b5cfc71 c7460809000000  mov     dword ptr [esi+8],9

    8b5cfc78 c7461c70fb5c8b  mov     dword ptr [esi+1Ch],8B5CFB70h

    8b5cfc7f c64603e0        mov     byte ptr [esi+3],0E0h

    8b5cfc83 ff158c805d8b    call    dword ptr ds:[8B5D808Ch]


    Sure does call IofCallDriver

    0: kd> dps 8B5D808Ch l1

    8b5d808c  8081df40 nt!IofCallDriver

     

    So what is this driver? In Windows every image starts with a PE header which starts with letters “MZ”. So I started looking for PE header. I took the offset of the module on the stack and went back to its page boundary, then I started moving 1 page back at a time looking for the PE header.

    0: kd> dc 8b5cf000

    8b5cf000  0010b9f2 f02b0000 3b063c8b 83137538  ......+..<.;8u..

    8b5cf010  c08304e9 04f98304 b85fee73 00000001  ........s._.....

    8b5cf020  9cb8c35e b98b5d86 00000010 ff8bd02b  ^....]......+...

    8b5cf030  3b02348b 83137530 c08304e9 04f98304  .4.;0u..........

    8b5cf040  b85fee73 00000001 335fc35e ccc35ec0  s._.....^._3.^..

    8b5cf050  83ec8b55 78a04cec 538b5da4 c0b60f56  U....L.x.].SV...

    8b5cf060  827ae857 ff330000 33f46589 f845c7db  W.z...3..e.3..E.

    8b5cf070  00000400 8b084d8b 52510c55 50b4458d  .....M..U.QR.E.P

     

    0: kd> dc 8b5cf000-1000

    8b5ce000  01c73024 00000000 c70cc483 00800002  $0..............

    8b5ce010  845e5f00 b10874db 0815ff01 8b8b5d80  ._^..t.......]..

    8b5ce020  6a0c2444 15ff5000 8b5d8028 14c25b5d  D$.j.P..(.].][..

    8b5ce030  cccccc00 cccccccc cccccccc cccccccc  ................

    8b5ce040  6a306a56 2415ff00 8b8b5d80 74f685f0  Vj0j...$.].....t

    8b5ce050  6a006a53 68026a01 8b5d0892 5d08ae68  Sj.j.j.h..].h..]

    8b5ce060  ff006a8b 5d80ac15 ff56508b 5d80a815  .j.....].PV....]

    8b5ce070  000d8b8b ff8b5da2 5d80b015 6a006a8b  .....].....].j.j


    0: kd> dc 8b5cf000-1000*2

    8b5cd000  0689c033 89044689 46890846 1046890c  3....F..F..F..F.

    8b5cd010  89144689 ec831846 1c46891c 57204689  .F..F.....F..F W

    8b5cd020  85244689 8d0574f6 02eb284e c033c933  .F$..t..N(..3.3.

    8b5cd030  41890189 08418904 8b34468b 4e891048  ...A..A..F4.H..N

    8b5cd040  14508b10 8b145689 4e891848 1c508b18  ..P..V..H..N..P.

    8b5cd050  8b1c5689 4e891848 1c508b20 2424448d  .V..H..N .P..D$$

    8b5cd060  244c8d50 186a510c 8b245689 6a302454  P.L$.Qj..V$.T$0j

    8b5cd070  46c65205 46c70005 00000008 0c46c700  .R.F...F......F.

     

    Finally I found the PE header for this image.

    0: kd> dc 8b5cf000-1000*3

    8b5cc000  00905a4d 00000003 00000004 0000ffff  MZ..............

    8b5cc010  000000b8 00000000 00000040 00000000  ........@.......

    8b5cc020  00000000 00000000 00000000 00000000  ................

    8b5cc030  00000000 00000000 00000000 000000d0  ................

    8b5cc040  0eba1f0e cd09b400 4c01b821 685421cd  ........!..L.!Th

    8b5cc050  70207369 72676f72 63206d61 6f6e6e61  is program canno

    8b5cc060  65622074 6e757220 206e6920 20534f44  t be run in DOS

    8b5cc070  65646f6d 0a0d0d2e 00000024 00000000  mode....$.......

     

    Using the built-in debugger extension !dh I dumped the header of this image to find the name.  Unfortunately there is no name for this image. This address 8b5cc000 is not in the loaded module list, which raised further suspicion. Either it is hiding its load address or this driver was not loaded by standard loading mechanism. This module date shows it’s pretty recent build.

    0: kd> !dh 8b5cc000

     

    File Type: DLL

    FILE HEADER VALUES

         14C machine (i386)

           4 number of sections

    4EA3461E time date stamp Sun Oct 23 04:09:26 2011

     

           0 file pointer to symbol table

           0 number of symbols

          E0 size of optional header

        2102 characteristics

                Executable

                32 bit word machine

                DLL

     

    OPTIONAL HEADER VALUES

         10B magic #

        9.00 linker version

        A400 size of code

        2000 size of initialized data

           0 size of uninitialized data

        46C0 address of entry point

        1000 base of code

             ----- new -----

    10000000 image base

        1000 section alignment

         200 file alignment

           1 subsystem (Native)

        5.00 operating system version

        0.00 image version

        5.00 subsystem version

       10000 size of image

         400 size of headers

        DF1F checksum

    00100000 size of stack reserve

    00001000 size of stack commit

    00100000 size of heap reserve

    00001000 size of heap commit

           0  DLL characteristics

           0 [       0] address [size] of Export Directory

        C91C [      3C] address [size] of Import Directory

           0 [       0] address [size] of Resource Directory

           0 [       0] address [size] of Exception Directory

           0 [       0] address [size] of Security Directory

        F000 [     3FC] address [size] of Base Relocation Directory

           0 [       0] address [size] of Debug Directory

           0 [       0] address [size] of Description Directory

           0 [       0] address [size] of Special Directory

           0 [       0] address [size] of Thread Storage Directory

           0 [       0] address [size] of Load Configuration Directory

           0 [       0] address [size] of Bound Import Directory

        C000 [     1A8] address [size] of Import Address Table Directory

           0 [       0] address [size] of Delay Import Directory

           0 [       0] address [size] of COR20 Header Directory

           0 [       0] address [size] of Reserved Directory

     

     

    SECTION HEADER #1

       .text name

        A354 virtual size

        1000 virtual address

        A400 size of raw data

         400 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    60000020 flags

             Code

             (no align specified)

             Execute Read

     

    SECTION HEADER #2

      .rdata name

        13AC virtual size

        C000 virtual address

        1400 size of raw data

        A800 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    40000040 flags

             Initialized Data

             (no align specified)

             Read Only

     

    SECTION HEADER #3

       .data name

         4B0 virtual size

        E000 virtual address

         200 size of raw data

        BC00 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    C0000040 flags

             Initialized Data

             (no align specified)

             Read Write

     

    SECTION HEADER #4

      .reloc name

         576 virtual size

        F000 virtual address

         600 size of raw data

        BE00 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    42000040 flags

             Initialized Data

             Discardable

             (no align specified)

             Read Only


    0: kd> !lmi 8b5cc000 

    Loaded Module Info: [8b5cc000]

    ffffffff8b5cc000 is not a valid address

     

    0: kd> lmvm 8b5cc000 

    start    end        module name


    Checking the import table it does have calls to raise and lower the IRQL. We get to the import table by taking the base address plus the offset to the Import Address Table Directory.  The below output is just a snippet of the whole table.

    0: kd> dps 8b5cc000+c000

    8b5d8000  80a603f4 hal!KfLowerIrql

    8b5d8004  80a5ff00 hal!KeGetCurrentIrql

    8b5d8008  80a600b4 hal!KfRaiseIrql

     

    Dumping the entire image contents in memory to find more clues about this driver.  Again, I am only showing snippets of the whole output.

    0: kd> dc 8b5cc000  L?10000/4

    8b5cc000  00905a4d 00000003 00000004 0000ffff  MZ..............

    8b5cc010  000000b8 00000000 00000040 00000000  ........@.......

    8b5cc020  00000000 00000000 00000000 00000000  ................

    8b5cc030  00000000 00000000 00000000 000000d0  ................

    8b5cc040  0eba1f0e cd09b400 4c01b821 685421cd  ........!..L.!Th

    8b5cc050  70207369 72676f72 63206d61 6f6e6e61  is program canno

    8b5cc060  65622074 6e757220 206e6920 20534f44  t be run in DOS

    8b5cc070  65646f6d 0a0d0d2e 00000024 00000000  mode....$.......

    8b5d5e20  76000000 66697265 77252079 00000a5a  ...verify %wZ...

    8b5d5e30  31000000 35343332 39383736 33323130  ...1234567890123

    8b5d5e40  37363534 31303938 35343332 39383736  4567890123456789

    8b5d5e50  33323130 37363534 31303938 35343332  0123456789012345

    8b5d5e60  39383736 33323130 37363534 31303938  6789012345678901

    8b5d5e70  00343332 66000000 646e756f 67697320  234....found sig

    8b5d5e80  7574616e 3d206572 0a752520 b4000000  nature = %u.....   

     

    Interestingly this image has other images (modules) embedded in it.  We can see other PE headers, which again have no name.  This looks highly suspicious and resembles behavior used by malicious software.

    8b5d0970  f775c085 4848c78b 4dc35f5e 0300905a  ..u...HH^_.MZ...

    8b5d0980  04000000 ff000000 b80000ff 00000000  ................

    8b5d0990  40000000 00000000 00000000 00000000  ...@............

    8b5d09a0  00000000 00000000 00000000 00000000  ................

    8b5d09b0  00000000 c8000000 0e000000 000eba1f  ................

    8b5d09c0  21cd09b4 cd4c01b8 69685421 72702073  ...!..L.!This pr

    8b5d09d0  6172676f 6163206d 746f6e6e 20656220  ogram cannot be

    8b5d09e0  206e7572 44206e69 6d20534f 2e65646f  run in DOS mode.

    8b5d2200  4d00004e 0300905a 04000000 ff000000  N..MZ...........

    8b5d2210  b80000ff 00000000 40000000 00000000  ...........@....

    8b5d2220  00000000 00000000 00000000 00000000  ................

    8b5d2230  00000000 00000000 00000000 f0000000  ................

    8b5d2240  0e000000 000eba1f 21cd09b4 cd4c01b8  ...........!..L.

    8b5d2250  69685421 72702073 6172676f 6163206d  !This program ca

    8b5d2260  746f6e6e 20656220 206e7572 44206e69  nnot be run in D

    8b5d2270  6d20534f 2e65646f 240a0d0d 00000000  OS mode....$....

     

    After ensuring my AV definitions were up to date, I decided to dump this memory contents into a file onto my machine. The moment contents were written my AV Microsoft Forefront Endpoint Protection caught a Trojan in this file.

    0: kd> .writemem c:\temp\drv.sys 8b5cc000   L?10000
    Writing 10000 bytes................................

       
    image001

     

    image003


    Before any further trouble shooting we asked the customer to clean the Trojan. Once we did that bug checks disappeared.


    Hope you enjoyed reading this post.
  • Ntdebugging Blog

    Understanding !PTE , Part 1: Let’s get physical

    • 2 Comments

    Hello. It’s Ryan Mangipano again (Ryanman). Today’s blog will be the first in a multi-part post designed to help you understand the output of the !PTE debuger comand along with the basics of hardware Virtual Addressing.  To better understand Virtual Addressing, we will use the debugger to manually translate a 4-KByte Page Table PAE virtual address into the actual physical addresses in order to understand what !PTE is displaying. I’ll provide relevant information about Virtual Addresses and Virtual Memory along the way.  

     

    We’ll start by translating a non-prototype valid hardware VA from an x86 PAE system

    The actual process of manually decoding a virtual addresses is going to vary according to the architecture (x86, x64), size of the page, whether or not the virtual address is a large page, whether the page is marked as valid, whether it’s a hardware or software PTE, and whether PAE is enabled. For simplicity, we will not be going over the table entry (PTE/PDE)  flags until part two of the blog. For my first example, I am going to demonstrate how to use the information in the processor manuals together with the debugger to decode a valid non-prototype virtual address into the physical memory that it references.  You can then try this on your own using windbg and the !pte command to validate your findings.  

     

    Finding an address to translate

    To start, we'll need to locate a virtual address that maps to a valid PTE. I am going to use the following highlighted virtual address which I found in a memory dump.  

     

    f9a12d0c ff155400a1f9    call    dword ptr [sfilter+0x4054 (f9a10054)]

     

    Get out your processor manuals

    The AMD and Intel manuals both contain helpful reference material on this subject. PDFs of these manuals are available online. Since my CPU is an Intel, I’m going to refer to the Intel manuals.

     

    1: kd> !cpuinfo

    CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features

     0 15,6,4 GenuineIntel 3192 0000000400000000                   a0073fff

     

    My Intel manuals arrived on CD via snail mail a few days after placing my free order:

    http://www.intel.com/products/processor/manuals/order.htm

     

    Is PAE in use?

    On this Intel x86 system, we first need to determine if we are using PAE. In the Intel “System Programming section 3.6.1 Paging Options” I found that the PAE (Physical Address Extension) flag can be found in bit 5 of the CR4 register. The PG (Paging) flag in CR0 which enables paging must of course also be set. Register bit numbering starts at zero so it’s the fifth bit from the right. Let’s examine cr4 and convert the value it contains into binary:

     

    1: kd> r cr4

    cr4=000006f9

     

    1: kd> .formats 000006f9

      Binary:  00000000 00000000 00000110 11111001

     

    You can use the debugger to check for other flags in the same manner

    You can use the above method to check for other flags that you find documented in the processor manuals. For example, you can see that bit 4 is also set in the cr4 register output above. This is the Page Size Extensions (PSE) bit which enables large page sizes.

     

    Terminology

    Paging is simply a method of dividing up the linear address space into chunks. Pages are simply the name that we give to the chunks that result. The size of these sections is referred to as the Page Size. On x86 systems, the standard page size is 4-KBytes.  A Large Page means that the page is larger than the standard size (2MB on PAE x86 or 4MB on non-PAE x86).

     

    Keep the last three

    On a standard 4-KByte page size virtual address, the address f9a10054 can be thought of as being split up as follows:

     

     Information needed to locate the base of the page in physical memory      f9a10       

    Offset into that physical page once it is found                               054

     

    This means that the last three hexadecimal digits of our physical address will also be 0x054. Once we find the base of our physical page(which will always end in three zeros), we can simply add it to the offset 0x54 and we will have our physical address.

     

    Are there any commands that work with physical addresses?

    We will also need a way to work with these physical addresses. Most of us are familiar with using the dd command with virtual addresses. There is also a command that accepts a physical address instead of a virtual address:  !dd  (notice the ! before the command). There are also variants of the !dd command, such as !db , !du , !dq.

     

    What data do we need to obtain to perform this conversion?

    We need to determine how to use the f9a10 portion of the virtual address to find the physical page base. The Memory Management System sets up and maintains a hierarchy of tables that keep track of the mappings between virtual addresses and physical addresses. You will need the following information to traverse the tables yourself and convert this address.

     

    1.       Starting point. This will be in the form of a pointer to the base of the first table that you need to check

    2.       The number of table levels in use on the system and the size of each entry in the tables on our system

    3.       The offsets into the different tables and the bases of each table.

     

    Once you have all the above information, you can use the debugger to traverse these tables.

     

    Let’s get our plan together for how we are going to obtain these three pieces of information

    1.      Starting point. This can be obtained from the PDBR. The CR3 register is known as the Page Directory Base Register (PDBR) which points to the physical address of the base of the first table.

     

    2.      The number of table levels in use on the system and the size of each entry in the tables on our system. This information can also be obtained from the Intel processor manuals. As previously mentioned, the number of tables, the size of the entries in the tables, the flags, where the bits are split up, and the names of the tables vary according to platform and if you are using PAE or not.  All of these, however, use the same basic concepts:

    Register pointing to a àTable with entries pointing to a table à with more entries pointing to à another tableà pointing to the physical address of the page in memory.

     

    image

    Table1: By reading the Intel Processor Manual “System Programming:  Section 3.8, Physical Address Extension”, I was able to determine that the first table used for x86 PAE virtual addressing is called the Page-Directory-Pointer Table. It’s a table with only 4 Entries that are 64 bits (which is 8 bytes) wide. Each entry is referred to as a Page Directory Pointer Entry and abbreviated as PPE, PDP or PDPE depending on the source.  These entries provide the index into the page directory, which is known as the Page Directory Index (PDI). One of these four pointers will lead you to the physical address of the base of the next table that you need to visit in the x86 PAE hierarchy.  Just like the pointer that we used from the CR3 register, some of the bits in these table entries are not used as part of the index (referred to as a pointer in some documentation). We will grab the relevant bits and add the appropriate number of zeros to the index to obtain our physical address pointer.  I will cover what these other bits are used for in part two of this blog. We must substitute zeros for these bits.

     

    Table2: The table at the second level of tables in the x86 PAE hierarchy (which is referenced by the pointers in Table Level One) is called the Page Directory Table.  Don’t confuse this with the Page-Directory-Pointer Table. Each Page Directory table can hold 512 entries which are 64 bits in size. The entries in this table are called Page Directory Entries (PDE) and they provide Page Table Index (PTI). Just like the last table, these entries contain indexes (which we convert to a pointer by simply adding zeros) to the base of the next table in the hierarchy.

     

    Table3:  The last table is referred to as the Page Table. Each Entry in the page table is called a Page Table Entry (PTE) and provides the Page Offset.  Just like the last table, not all the bits are used for pointers. Each page table contains up to 512 entries which are also 64 bits in size. Each 64 bits entry in the table contains a pointer to the base of the page in physical memory.

     

    In summary there are 3 levels of tables when using x86 PAE.  These are the Page Directory Pointer Table, Page Directory Tables, and the Page Tables.

     

    1.      The offsets into the different tables and the bases of each table.

    Each table above is a listing of indexes that will be used to locate the base of the next table. However, once we arrive at each table, we will need to know the index or offset into the table in order to know which table to get to next. These offsets into the tables can be obtained from the virtual address itself.  Let’s review our virtual address again.  However, this time we will break the address down in binary:

     

     Virtual Address: f9a10054

     

    1: kd> .formats 0xf9a10054

      Binary:  11111001 10100001 00000000 01010100

     

    Page Directory Pointer Index(PDPI)       11                        Index into 1st table(Page Directory Pointer Table)

     Page Directory Index(PDI)                    111001 101          Index into 2nd table(Page Directory Table)

    Page Table Index(PTI)                            00001 0000          Index into 3rd table(Page Table)

    Byte Index                                               0000 01010100    0x054, the offset into the physical memory page

     

    So as you can see, the virtual address is nothing more than a bunch of indexes/offsets. 

     

    Putting all this data together to find the physical address

    Now that we have all the required data, let’s proceed to locate our physical address

     

    1.       Obtain our base pointer to the first table.  As we discussed earlier, we can obtain this value from the cr3 register.  Bits 0-4 of this register are not used for the pointer to the table base.  This means that in order to get the base pointer, we will need to replace these 5 least-significant bits with zero. This will result in the table base being located on a physical address that is always aligned to a 32-bit boundary. 

     

    Keep in mind that CR3 will have a different value here for each process. You must make sure that you are in the appropriate process context before proceeding.  This is because user mode tables are specific to a particular process. Notice that I said processes, not threads.  CR3 will not be changed when swapping threads, since each thread in a given process shares the same address space.  Tables relating the system address space (kernel mode) are shared between all processes.

     

    We’ll need to dump out the CR3 register (PDBR) in a format where we can view the last 5 bits. As you can see in the .formats output below, the 5 least significant bits are already set to zero. This means that the hexadecimal value located in the cr3 register is a pointer to the base of the first table. So we now have our starting point.  Physical address 023406e0 is the base of the first table.

     

    1: kd> .formats @cr3

      Hex:     023406e0

      Binary:  00000010 00110100 00000110 11100000

     

    The proper way to display the pointer out the value would be to & against the following mask

     

    1: kd> ? 0y11111111111111111111111111100000

    Evaluate expression: -32 = ffffffe0

     

    1: kd> .formats (@cr3 & ffffffe0)

      Binary:  00000010 00110100 00000110 11100000

     

    You can use the!process command to get and/or verify this value.

     

    1: kd> !process

    PROCESS ff981a58  SessionId: 0  Cid: 0d54    Peb: 7ffde000  ParentCid: 0550

        DirBase: 023406e0  ObjectTable: e1541510  HandleCount:  30.

                   

    You can also obtain this information from the EPROCESS structure.

     

    1: kd>  dt nt!_EPROCESS ff981a58 Pcb.DirectoryTableBase

       +0x000 Pcb                    :

          +0x018 DirectoryTableBase     : [2] 0x23406e0

     

    2.       Obtain our base to the second table by finding our index into the first table.  The first table (Page Directory Pointer Table) only has 4 entries and each entry is 64 bits wide. We know from the first two bits of the Virtual Address above that our offset into this table is 0y11  (The y tells the debugger the value is binary, instead of eleven). Eleven would be represented in the debugger as 0n11. We can simply multiply the offset (0y11) by the size of each entry and add the result to the base of the table to get our entry. The entries in the table are 8 bytes wide. We shall use sizeof() as shown below to obtain this value.  We can pass this math to the !dq command to dump the data at these physical addresses.

     

    1: kd> !dq (@cr3 & 0xffffffe0)+(0y11*@@(sizeof(nt!_MMPTE))) L1

    # 23406f8 00000000`05503801

     

    The processor manuals indicate that the first 12 bits of the entry above are not part of the pointer and must be discarded.  This will cause alignment on 4KB boundaries. Since each Hex digit above represents 4 bits, this means that we need to change 801 to 000. That gives us our physical address of the base of our second table, 05503000. We will accomplish this in the next step by ANDing our PDE against a mask.

     

    3.       Obtain our base to the third table by finding our index into the second table. The Second Table (Page Directory Table) works the same way, except that this table contains up to 512 entries (on PAE systems).  Keep in mind that there can be more than one Page Directory Tables for each process on a PAE system, however we are only concerned with the one that contains the data relating to our virtual address.  We know from the virtual address that the offset into this table is 0y111001101.  Calculate the address in the same manner as before. We will also need to set the last 12 bits to zero, just like before.

     

    1: kd> !dq (0x05503801 & 0xFFFFFF000) +( 0y111001101*@@(sizeof(nt! _MMPTE))) L1

    # 5503e68 00000000`0102d963

     

    The last three hexadecimal digits must be changed to zero, since they are not part of the pointer. This gives us the base address of our last table, the page table, 0x0102d000

     

    4.       Find the base of the physical page our memory resides in by finding our index into the third table. Using the base of the page table from the previous step, let’s add the index into this table that we obtained from the virtual address, 0y000010000. The last three digits need to be set to zero.

     

    1: kd> !dq (0x102d963&0xFFFFFF000)+(0y000010000*@@(sizeof(nt! _MMPTE))) L1

    # 102d080 00000000`02010121

     

    5.       So now we have the physical address of the base of the page, 2010000 Add the base of the page to our offset from the virtual address, 0x054.

     

    1: kd> ? (2010121 & 0xFFFFFF000) +0x054

    Evaluate expression: 33620052 = 02010054

     

    A shortcut is to simply change the 0x121 to 0x054 in the previous step.

     

    6.       Now, let’s dump out the data at our physical address and the virtual address

     

    1: kd> !dd (2010121 & 0xFFFFFF000) +0x054 L2

    # 2010054 804ef09c 804ef12c

     

    1: kd> dd 0xf9a10054 L2

    f9a10054  804ef09c 804ef12c

                    You can see that the data displayed by the two commands is the same.

     

    7.       Now let’s dump out the virtual address using the !PTE extension. Notice the values that it provides you with. Look above and compare the values displayed above to what is displayed below. You should now understand what the highlighted and bolded fields mean.

     

    1: kd> !pte 0xf9a10054

                   VA f9a10054

    PDE at 00000000C0603E68    PTE at 00000000C07CD080

    contains 000000000102D963  contains 0000000002010121

    pfn 102d       -G-DA--KWEV    pfn 2010       -G--A—KREV

     

    The virtual addresses in italics represent the virtual address of the PDE (Page Directory Entry) and the PTE (Page Table Entry). Also, please note that PFN represents Page Frame Number. PFN is the term used to describe what I referred to as “pointers to the base of the next table” in the hierarchy.  This is because it really isn’t a pointer; it’s an index into the table.

     

    Hopefully, the output of !PTE makes a lot of sense to you now. In part two of this blog, I’ll discuss what the PDE/PTE flags (-G-DA—KWEV) represent and provide an example of manual conversion of x86 PAE Large Page Virtual Addresses to Physical.

  • Ntdebugging Blog

    Troubleshooting Pool Leaks Part 3 – Debugging

    • 1 Comments

    In our previous articles we discussed identifying a pool leak with perfmon, and narrowing the source of the leak with poolmon.  These tools are often preferred because they are easy to use, provide verbose information, and can be run on a system without forcing downtime.  However, it is not always possible to get perfmon and poolmon data.  If a system is experiencing poor performance you may have a business need to get the system up and running as quickly as possible without allowing time to troubleshoot.  It is also possible to completely exhaust memory through a pool leak, leaving the system in a state where tools such as perfmon and poolmon will not work.  In these scenarios it may be possible to troubleshoot the poor performance by forcing a bugcheck, gathering a memory dump, and performing a post mortem analysis.

     

    Although a dump is not the ideal data to troubleshoot a leak, it can be done.  I say less than ideal because a dump is a snapshot of the system memory, and does not provide the historical data which perfmon would provide.  The lack of historical data makes it difficult to differentiate between high memory usage due to load and high memory usage due to a leak.  It is up to you, as the troubleshooter, to determine if the dump is sufficient evidence of a leak.  Sometimes identifying the tag and the driver will help you identify a known issue that causes a leak, or your knowledge of the driver architecture may allow you to determine if the memory usage is normal or not.  In some scenarios you may decide to start monitoring with perfmon and collect additional data for a future occurrence.

     

    The first step to debug a pool leak using a dump is to load the dump in windbg, set the symbol path, and reload symbols.

     

    1: kd> .symfix c:\symbols

    1: kd> .reload

    Loading Kernel Symbols

    ...............................................................

    ................................................................

    .....

     

    The !vm command will show memory utilization, the 1 flag will limit the verbosity of this command.  For the scenario of a pool leak, the significant values are “NonPagedPool Usage:” and “PagedPool Usage:”.  If the debugger identifies a value that is out of the normal range it will flag it, and we can see here that the debugger has flagged excessive nonpaged pool usage.  This is similar to the information we obtained in Part 1 using perfmon, but unlike perfmon we do not have trend data to indicate if this is temporary high pool usage due to load or if this is a leak.

     

    1: kd> !vm 1

     

    *** Virtual Memory Usage ***

            Physical Memory:      403854 (   1615416 Kb)

            Page File: \??\C:\pagefile.sys

              Current:   1048576 Kb  Free Space:   1015644 Kb

              Minimum:   1048576 Kb  Maximum:      4194304 Kb

            Available Pages:      106778 (    427112 Kb)

            ResAvail Pages:       225678 (    902712 Kb)

            Locked IO Pages:           0 (         0 Kb)

            Free System PTEs:   33533355 ( 134133420 Kb)

            Modified Pages:         4844 (     19376 Kb)

            Modified PF Pages:      4838 (     19352 Kb)

            NonPagedPool Usage:   155371 (    621484 Kb)

            NonPagedPool Max:     191078 (    764312 Kb)

            ********** Excessive NonPaged Pool Usage *****

            PagedPool 0 Usage:     27618 (    110472 Kb)

            PagedPool 1 Usage:      3848 (     15392 Kb)

            PagedPool 2 Usage:       299 (      1196 Kb)

            PagedPool 3 Usage:       283 (      1132 Kb)

            PagedPool 4 Usage:       344 (      1376 Kb)

            PagedPool Usage:       32392 (    129568 Kb)

            PagedPool Maximum:  33554432 ( 134217728 Kb)

            Session Commit:         7764 (     31056 Kb)

            Shared Commit:          6371 (     25484 Kb)

            Special Pool:              0 (         0 Kb)

            Shared Process:         5471 (     21884 Kb)

            PagedPool Commit:      32394 (    129576 Kb)

            Driver Commit:          2458 (      9832 Kb)

            Committed pages:      326464 (   1305856 Kb)

            Commit limit:         665998 (   2663992 Kb)

     

    The debugger can parse the pool tagging database and present similar information as poolmon.  The !poolused command will do this, the /t5 option will limit output to the top 5 consumers, the 2 flag will sort by nonpaged pool usage (use the 4 flag if your leak is in paged pool).

     

    1: kd> !poolused /t5 2

    ..

    Sorting by NonPaged Pool Consumed

     

                   NonPaged                  Paged

    Tag     Allocs         Used     Allocs         Used

     

    Leak       601    615424000          0            0    UNKNOWN pooltag 'Leak', please update pooltag.txt

    Pool         6      1717840          0            0    Pool tables, etc.

    nVsC       664      1531552          0            0    UNKNOWN pooltag 'nVsC', please update pooltag.txt

    netv      4369      1172224          1          144    UNKNOWN pooltag 'netv', please update pooltag.txt

    Thre       607       774048          0            0    Thread objects , Binary: nt!ps

     

    TOTAL     43424    634209952      63565    126487760

     

    The above output shows that the tag “Leak” is associated with almost all of the nonpaged pool usage.  This is the same information we obtained in Part 2 using poolmon.

     

    Now we must identify what drivers use the pool tag “Leak”. Because we have a snapshot of the system memory we can search the dump for this tag.  We can match each address to a module using the command lm a.

     

    1: kd> !for_each_module s -a @#Base @#End "Leak"

    fffff880`044b63aa  4c 65 61 6b 3b c1 0f 42-c1 41 8d 49 fd 8b d0 ff  Leak;..B.A.I....

    fffff880`044b6621  4c 65 61 6b 3b c1 0f 42-c1 33 c9 8b d0 ff 15 cc  Leak;..B.3......

    1: kd> lm a fffff880`044b63aa

    start             end                 module name

    fffff880`044b5000 fffff880`044bc000   myfault    (no symbols)

    1: kd> lm a fffff880`044b6621

    start             end                 module name

    fffff880`044b5000 fffff880`044bc000   myfault    (no symbols)

     

    The tag and driver name can be used to search the internet for known problems.  If a known issue is found a driver update may be available, and installing this update may prevent a future memory leak.

     

    If there are no updates available for the driver, or if this is your driver and you need to identify the cause of the leak, don’t panic.  In future articles we will show techniques for getting call stacks of pool allocations, these call stacks be used to identify under what conditions the driver leaks memory.

  • Ntdebugging Blog

    Capturing ADPlus CLR crashes

    • 2 Comments

    Trey Nash here again, and I would like to discuss a scenario we are all too familiar with.  You’ve worked your tail off for the past year, and for the past couple of months you even worked evenings and weekends.  Management rewarded your team with two weeks off in appreciation of your efforts.  But now that you’re back in the office, you’re hearing rumors percolating up from your tech support department that there are some cases where your application is crashing in the field for mysterious reasons.  What do you do?

    The application happens to have been built with an unhandled exception filter registered via the AppDomain.UnhandledException event.  Therefore, you at least know that the application is failing with an InvalidCastException, but you cannot imagine why this is happening.

    Wouldn’t it be nice if you could live debug on the affected system?  Unless you work on-site for your customer or your software is on a laptop and your customer is willing to send it to you, then I doubt you will get this opportunity.  What you need is a tool to capture the state of your application while it is failing.  Then the customer could capture this information and send it to you.

    Enter ADPlus.  ADPlus is a free tool in the Debugging Tools for Windows package that scripts the CDB debugger allowing you to capture dumps for one or multiple processes on a system.  It also offers the following advantages:

    ·         ADPlus can monitor desktop applications, services, etc.

    ·         ADPlus can monitor multiple processes on the system.  When it collects a dump of those processes, it freezes and dumps them simultaneously.  This is essential for tracking down problems with inter-process communications.

    ·         ADPlus supports xcopy deployment meaning the customer does not need to install anything via Windows Installer, etc.  This minimizes configuration changes on the machine, and that is music to customers’ ears.  

     

    Note:    Although ADPlus is xcopy installable, you still have to install the Debugging Tools for Windows package via Windows Installer as that is the only way Microsoft distributes it.  However, once you have installed Debugging Tools for Windows once, you can xcopy deploy ADPlus or the entire Debugging Tools for Windows package to another machine.  In fact, during development, I find it extremely handy to check in the development tools into the source repository.  Debugging Tools for Windows supports this by virtue of the fact that it is xcopy installable.

    For those of you that are savvy with Windows Installer, you can perform an admin install using the msi for Debugging Tools for Windows and that will allow you to extract the files without actually installing the package on the machine, for example:

    msiexec /a dbg_x86_6.11.1.404.msi

     

    With all of that said, let’s see how ADPlus can help you diagnose problems with .NET applications.

    The Sample Application

    In the rest of this post, I will reference the C# 3.0 sample application that I have put together to illustrate using ADPlus to capture an unhandled exception in a .NET application.  The code is listed below:

    using System;

    using System.Linq;

    using System.Runtime.Serialization;

     

    class A

    {

        public void SaySomething() {

            Console.WriteLine( "Yeah, Peter...." );

            throw new BadDesignException();

        }

    }

     

    class B : A

    {

    }

     

    class C

    {

    }

     

    class EntryPoint

    {

        static void Main() {

            DoSomething();

        }

     

        static void DoSomething() {

            Func<int, object> generatorFunc = (x) => {

                if( x == 7 ) {

                    return new C();

                } else {

                    return new B();

                }

            };

     

            var collection = from i in Enumerable.Range( 0, 10 )

                             select generatorFunc(i);

     

            // Let's iterate over each of the items in the collection

            //

            // ASSUMING THEY ARE ALL DERIVED FROM A !!!!

            foreach( var item in collection ) {

                A a = (A) item;

                try {

                    a.SaySomething();

                }

                catch( BadDesignException ) {

                    // Swallow these here. The programmer chose to

                    // use exceptions for normal control flow which

                    // is *very* poor design.

                }

            }

        }

    }

     

    public class BadDesignException : Exception

    {

        public BadDesignException() { }

        public BadDesignException( string msg ) : base( msg ) { }

        public BadDesignException( string msg, Exception x ) : base( msg, x ) { }

        protected BadDesignException( SerializationInfo si,

                                      StreamingContext ctx )

            :base( si, ctx ) { }

    }

     

    You can compile this sample code easily by putting it in a file, such as test.cs, and then from either a Visual Studio Command Prompt or from a Windows SDK Command Shell, execute the following:

    csc /debug+ test.cs

     

    Note:    The code is contrived and there are many bad things about this code from a design/coding standpoint, but that is intentional for the sake of illustration.  For example, one may want to re-think introducing a collection that contains references to System.Object.

    The code above also uses features that are new to C# 3.0 (for brevity) including lambda expressions and LINQ.  If you would like to become more familiar with them, visit the
    C# website on MSDN or reference one of the excellent books on C# such as Pro LINQ or Accelerated C# 2008.

     

    The notable section of the code to focus on is the foreach loop within the EntryPoint.Main() method.  In that foreach loop, we are iterating over a collection of objects where we are assuming that they all derive from type A.  Thus the code attempts to cast all instances to a reference of type A and since I have intentionally put an instance of type C within that collection, it will fail at some point with an exception of type System.InvalidCastException.

    Capturing the Problem

    Using ADPlus in crash mode, we can capture the exception in the customer’s environment.  To illustrate this in action, let’s launch ADPlus in crash mode and monitor the test.exe example application built above by using the following command line:

    adplus -crash -o c:\temp\test -FullOnFirst -sc c:\temp\test\test.exe

     

    Note:    I built and tested this code in a directory named c:\temp\test on my machine as I wrote this.  Therefore, you will see references to it throughout this post.  Incidentally, I have found ADPlus is a lot easier to use if you feed it fully qualified paths.  If you get mysterious errors or behavior, and you are using relative paths with ADPlus, try fully qualifying the paths before you beat your head against the wall for too long trying to figure out what could be going wrong.

     

    ADPlus is easier to launch if you add the directory where the Debugging Tools for Windows was installed to your PATH environment variable.  The command line above assumes this has been done.  Now, let me explain the command line options that I used above.  I highly recommend that you become familiar with all of the ADPlus command line options.

    ·         -crash launches ADPlus in crash mode.  This is the mode you want to use if your application is failing because of an unhandled exception.

    ·         -o c:\temp\test tells ADPlus that I want the output to be placed in c:\temp\test, which is the directory in which I built the test.exe application.

    ·         -FullOnFirst is very important for managed applications..  This tells ADPlus to grab a full process dump on first chance exceptions.  It is essential that you capture full dumps for managed applications, otherwise, all of the necessary data regarding the execution engine and the managed heap will be absent from the dump making it impossible to debug effectively.

    ·         -sc c:\temp\test.exe is only one of the ways you can point ADPlus to an application to monitor.  In this case, we’re instructing ADPlus to tell the debugger to explicitly launch the application.  Had it been a service, or if the application you want to monitor is already running, we would probably have used –p to attach to its PID or –pn to attach to the process by name.  Notice that I provided the full path to the application.

    After you launch ADPlus, you are presented with the following dialog unless you use the –quiet option.

     

    Once the application is finished executing, go to the directory that you specified in the ADPlus –o command line option and you should see a subdirectory named similarly to what you see in the previous dialog snapshot.  For the instance of test.exe I just executed, that directory is named Crash_Mode__Date_05-11-2009__Time_21-12-54PM.  Under that directory, there are quite a few files that I have listed below:

    C:\temp\test\Crash_Mode__Date_05-11-2009__Time_21-12-54PM>dir /b
    ADPlus_report.txt
    CDBScripts
    PID-0__Spawned0__1st_chance_CPlusPlusEH__full_15ac_2009-05-11_21-13-01-332_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-55-482_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-56-527_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-57-370_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-103_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-58-867_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-12-59-663_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-00-505_0eac.dmp
    PID-0__Spawned0__1st_chance_NET_CLR__full_15ac_2009-05-11_21-13-02-689_0eac.dmp
    PID-0__Spawned0__1st_chance_Process_Shut_Down__full_15ac_2009-05-11_21-13-27-743_0eac.dmp
    PID-0__Spawned0__2nd_chance_NET_CLR__full_15ac_2009-05-11_21-13-04-140_0eac.dmp
    PID-0__Spawned0__Date_05-11-2009__Time_21-12-54PM.log
    Process_List.txt

    For each first chance exception, a dump file (.dmp) has been collected.  Notice that the dump files can tend to be very large because of the –FullOnFirst option.  You can subsequently load these dump files into either Windbg (or its variants) or the Visual Studio Debugger.  Personally, I prefer Windbg because I can then load the SOS extension along with the SOSEX extension and dig into the state of the application and the CLR.

    Using the ADPlus default configuration as we have above, you can see that ADPlus generated dumps for one first-chance C++ exception, eight first-chance CLR exceptions, one second-chance CLR exception and one dump collected during process shut down.

    What is Included in the Output

    If you look in your Debugging Tools for Windows directory, you’ll notice that ADPlus is really just a very complex VBScript.  It generates quite a bit of useful information along with any problem dump files.  ADPlus_report.txt reports the configuration for ADPlus for this run.  This is handy if you need to know what it will do for a specific type of exception.

    Process_list.txt is generated from executing tlist.exe, another tool that comes with Debugging Tools for Windows.

    And finally, the CDBScripts subdirectory contains a .cfg file which is the exact debugger script ADPlus generated and subsequently fed to CDB to get the job done.  On my machine, when running against the sample application using the command line from the previous section, this file is named PID‑0__Spawned0.cfg.  If you’re ever curious or need to know exactly what ADPlus instructed the debugger to do, this file is the source.

     

    Note:    The reason the name contains a zero PID is because we used the –sc option to launch the application.  Had we used the –p or –pn option, the PID in the filename would not be zero.

     

    I don’t recommend executing this debugger script in a live debugger as doing so could overwrite the data that you already have collected.  Instead, if you need to couple ADPlus with live debugging, you should use the –gs option which I will describe shortly.

    Pinpointing a Specific Exception

    You’ll notice that in the previous run of ADPlus, quite a few dump files were generated and each one was fairly large.  In reality, there are times where a first chance exception does not always indicate a fatal situation.  In Accelerated C# 2008, I go into detail regarding how it is poor design practice to implement any sort of control flow or otherwise reasonably expected behavior using exceptions.  This is because exceptions are supposed to be truly exceptional events!  For more juicy details on how expensive exceptions are, I invite you to read an excellent blog post by Tess Ferrandez on the topic. At any rate, you may encounter situations where you get a lot of dumps for first chance exceptions you are not interested in as this example shows..

    To alleviate this situation, you can create an ADPlus configuration file coupled with the !StopOnException command provided by the SOS extension to instruct ADPlus to filter out only the exceptions you’re interested in.  To do this, I created a configuration file named filter.config with the following contents:

    <ADPlus>

     

        <!-- Configuring ADPlus to log only exceptions we're interested in -->

     

      <Exceptions>

         <Config>

            <!-- This is for the CLR exception -->

           <Code> clr </Code>

           <Actions1> Log </Actions1>

           <CustomActions1> .loadby sos mscorwks; !StopOnException System.InvalidCastException 1; j ($t1 = 1) '.dump /ma /u c:\dumps\InvalidCastException.dmp; gn' ; 'gn'  </CustomActions1>

           <ReturnAction1> VOID  </ReturnAction1>

           <Actions2> Log </Actions2>

         </Config>

      </Exceptions>

     

    </ADPlus>

     

     

    Note:    The <CustomActions1> element above is meant to be in one line in the config file.

     

    The <CustomActions1> element is the element of interest in this configuration file.  This element allows you to specify which commands the debugger should execute on first chance exceptions.  Within this element, you can put any valid debugger commands (except windbg GUI-related commands).  If you need to execute multiple commands, as I have above, you simply separate them with semicolons.  In the <CustomActions1> element shown above, I first load the SOS extension using the .loadby command.  Then, I use the !StopOnException command in predicate mode to set the $t1 pseudo-register to 1 if the exception is of type System.InvalidCastException and 0 otherwise.  Then, the following j command is used to create a full dump if $t1 is 1 and do nothing otherwise.  The gn command in both paths of the j command tells the debugger to go unhandled so that the exception is propagated up rather than swallowed by the debugger.  If you were to go handled, thus swallowing the exception, the exception handlers in the code would never see it and the debugger would alter the behavior of the application.  And finally, note that the .dump command used to create the dump indicates the path where the dump will be stored.  In this case, I am placing it in the c:\dumps directory on my machine.

    Now you can provide this configuration file to ADPlus using the following command:

    adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -sc c:\temp\test\test.exe

    Notice that there are fewer dumps collected.  Along with the dump for the InvalidCastException, it also captures a C++ exception as well as when the application shuts down.  If you open the C++ exception dump in the debugger and inspect the stack, it shows that the CLR is in the process of generating the InvalidCastException.  The CLR catches the C++ exception and converts it into a managed exception.  The C++ exception dump was generated because I left the –FullOnFirst option in the previous command line.  You can eliminate the C++ exception dump by removing -FullOnFirst.

    Using ADPlus during Live Debugging

    I already hinted at the –gs ADPlus command line option earlier in this post.  This option allows you to create the debugger scripts ADPlus creates without actually running them in the debugger.  For example, if you execute the following command (I executed mine from my c:\temp\test directory):

    adplus -crash -o c:\temp\test -FullOnFirst -c c:\temp\test\filter.config -gs livedebug

    You will notice that ADPlus does not actually launch any debugger or your application.  Rather, it creates a subdirectory named livedebug.  When you go into that directory, you’ll notice it looks similar in layout to the crash directories created in the previous demonstration.  On my machine, I end up with the following two files:

    C:\temp\test\livedebug\ADPlus_report.txt
    C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg

    The PID-0__livedebug.cfg file is actually a debugger script file containing debugger commands.  All we have to do now is launch our test application in the debugger and then execute this script.  Within my c:\temp\test directory, I can launch the debugger using the following command:

    windbg test.exe

    Once inside the debugger, I can invoke the ADPlus debugger script by executing the following $$< command:

    $$<C:\temp\test\livedebug\CDBScripts\PID-0__livedebug.cfg

    Once you execute the $$< command, you will notice that the script takes over and performs the same actions as ADPlus does when run conventionally from the command line.

    As a further experiment, edit the filter.config file and remove the gn commands.  Now the debugger will wait for user input after executing the custom commands rather than continuing execution of the application.  This could be handy if you want the opportunity to perform debugging by hand if ADPlus encounters a certain situation.

    First Chance vs. Second Chance Exceptions

    Throughout this post, I have been focusing on first chance exceptions for the sake of illustration.  However, many times, you are only interested in the truly unhandled exceptions.  In that case, you would want to capture only second chance exceptions.  You certainly would want to capture first chance exceptions if you needed to capture the state at the exact point the exception was thrown and before the OS searches for any handlers.  Additionally, if you suspect that the application you are debugging may be handling an exception inappropriately (maybe even blindly swallowing it), then you certainly want to catch first chance exceptions in that case.

    Conclusion

    In this blog post I have introduced you to ADPlus and the utility it provides when troubleshooting problems in the field.  ADPlus lends itself well to capturing exceptional situations in the field since it requires no configuration changes on the affected machine thus making it an easy pill for your customers to swallow.  You may also find this very useful when working with your Quality Assurance team during the development phase.  For example, how many times have you encountered a situation where a problem only presents itself on a dusty old rarely-used machine in the back corner of a random QA engineer’s office?  How many times, in such situations, have you felt like the only way to trouble shoot the problem effectively is to install the Visual Studio debugger and start working on that machine?  Furthermore, what if the problem only happens on that dusty old machine about once a week.  ADPlus can help you avoid that madness by providing an easy mechanism for capturing full process dumps on the troubled machine so you can then take those dumps to your trusty development machine for further debugging and analysis.

    Have fun out there!

     

    Share this post :
Page 5 of 22 (216 items) «34567»