• Ntdebugging Blog

    Debugging Backwards: Proving root cause

    • 1 Comments

    Matt Burrough here again.  On rare occasions when debugging, we'll actually know (or strongly suspect) what the root cause of a problem is at the beginning of our analysis - but we still need to investigate to confirm our assertion.  The following is a case study for an issue I worked on recently where the print spooler was hanging.

     

    This customer had recently upgraded their print servers from Windows 2003 to Windows 2008 R2.  After the upgrade, the spooler would frequently go unresponsive, and no jobs could be printed.  Rebooting the server provided some relief, but the problem would reoccur as the jobs started coming in again.

     

    One of my peers had completed an initial analysis of a user mode memory dump of the spooler process, and found that spooler seemed to be blocked waiting on PrintIsolationHost.exe.  For those not familiar with recent changes to the print spooler's design - Print Isolation Host was added in Windows 7/2008 R2 as a way to isolate print drivers from each other and from the spooler process so a crash in one driver doesn't take down the entire printing environment.  Given the large number of print drivers found on some print servers, this can be a great help for stability and availability of the spooler.  See http://blogs.technet.com/b/askperf/archive/2009/10/08/windows-7-windows-server-2008-r2-print-driver-isolation.aspx if you would like more details on Print Isolation Host.

     

    Unfortunately for my team mate, the data collected did not include dumps of Print Isolation Host, so he requested that the next time the problem happened that both spooler and PrintIsolationHost dumps would be gathered.  The customer had configured his server for the "Isolated" Print Isolation Host option during troubleshooting, which places each driver in its own process.  (The default is shared which places all drivers in one PrintIsolationHost.exe instance.  Driver isolation is configured using the Print Management Console.)

     

    The next morning, the newly requested data came in, and since the problem was urgent, I began looking at the new dumps immediately.  This dataset included both a spoolsv.exe dump, as well as nearly two dozen PrintIsolationHost dumps!  I knew from the past analysis that I should start with the PrintIsolationHost data - but where to begin?  In order to triage the dumps, I wrote a small batch file to open each dump (luckily, they were sequentially numbered), write the call stacks in each thread to a file, and close the log.  On every iteration, the script created a new cmd.txt file, which contained a set of commands that were passed to the debugger.  This allowed me to name the debugger output files sequentially with names that matched their dump (e.g. PrintIsolationHost1.txt contained output from PrintIsolationHost1.dmp).

     

    set x=1

    :Top

    echo .reload > c:\data\cmd.txt

    echo .logopen c:\data\PrintIsolationHost%x%.txt  >> c:\data\cmd.txt

    echo ~*kc >> c:\data\cmd.txt

    echo .logclose >> c:\data\cmd.txt

    echo qq >> c:\data\cmd.txt

    c:\debuggers\kd.exe -cf c:\data\cmd.txt -z "c:\data\PrintIsolationHost%x%.DMP"

    set /A x+=1

    IF %x% LEQ 23 GOTO Top

     

    Now that I had a directory full of text files, I used my favorite differencing tool to compare the stacks in each text file.  I used the first PrintIsolationHost file as a reference.  It only had four stacks, and these were common to all of the other files:

     

    .  0  Id: 40c.1c28 Suspend: 0 Teb: 000007ff`fffde000 Unfrozen

    Call Site

    user32!ZwUserGetMessage

    user32!GetMessageW

    PrintIsolationHost!ATL::CAtlExeModuleT<Host>::Run

    PrintIsolationHost!Host::RunUsingConfiguration

    PrintIsolationHost!wWinMain

    PrintIsolationHost!__wmainCRTStartup

    kernel32!BaseThreadInitThunk

    ntdll!RtlUserThreadStart

     

       1  Id: 40c.23c0 Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

    Call Site

    ntdll!ZwWaitForMultipleObjects

    ntdll!TppWaiterpThread

    kernel32!BaseThreadInitThunk

    ntdll!RtlUserThreadStart

     

       2  Id: 40c.2598 Suspend: 0 Teb: 000007ff`fffd3000 Unfrozen

    Call Site

    ntdll!ZwWaitForSingleObject

    KERNELBASE!WaitForSingleObjectEx

    PrintIsolationHost!Host::MonitorShutdown

    PrintIsolationHost!Host::MonitorProc

    kernel32!BaseThreadInitThunk

    ntdll!RtlUserThreadStart

     

       3  Id: 40c.820 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

    Call Site

    ntdll!ZwWaitForWorkViaWorkerFactory

    ntdll!TppWorkerThread

    kernel32!BaseThreadInitThunk

    ntdll!RtlUserThreadStart

     

    I was able to rule out a number of other PrintIsolationHost instances that were either identical to this one (aside from the process/thread IDs and Tebs), or which just had one or two additional idle worker threads (like thread 3 above).

     

    Things got interesting when I looked at two of the PrintIsolationHost dumps.  Both had these two stacks not found in any other the other dumps (note that I do not have symbols for the 3rd-party print processor ProseWarePrintProcessor):

     

       2  Id: 20a4.1328 Suspend: 0 Teb: 000007ff`fffda000 Unfrozen

    Call Site

    ntdll!ZwWaitForSingleObject

    ntdll!RtlpWaitOnCriticalSection

    ntdll!RtlEnterCriticalSection

    ntdll!LdrLockLoaderLock

    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for ProseWarePrintProcessor.dll -

    KERNELBASE!GetModuleFileNameW

    ProseWarePrintProcessor

    ProseWarePrintProcessor

    ProseWarePrintProcessor!ControlPrintProcessor

    PrintIsolationProxy!sandbox::PrintProcessor::ControlPrintProcessor

    rpcrt4!Invoke

    rpcrt4!Ndr64StubWorker

    rpcrt4!NdrStubCall3

    ole32!CStdStubBuffer_Invoke

    ole32!SyncStubInvoke

    ole32!StubInvoke

    ole32!CCtxComChnl::ContextInvoke

    ole32!AppInvoke

    ole32!ComInvokeWithLockAndIPID

    ole32!ThreadInvoke

    rpcrt4!DispatchToStubInCNoAvrf

    rpcrt4!RPC_INTERFACE::DispatchToStubWorker

    rpcrt4!RPC_INTERFACE::DispatchToStub

    rpcrt4!RPC_INTERFACE::DispatchToStubWithObject

    rpcrt4!LRPC_SCALL::DispatchRequest

    rpcrt4!LRPC_SCALL::HandleRequest

    rpcrt4!LRPC_ADDRESS::ProcessIO

    rpcrt4!LrpcIoComplete

    ntdll!TppAlpcpExecuteCallback

    ntdll!TppWorkerThread

    kernel32!BaseThreadInitThunk

    ntdll!RtlUserThreadStart

     

       6  Id: 20a4.1668 Suspend: 0 Teb: 000007ff`fffac000 Unfrozen

    Call Site

    ntdll!ZwWaitForSingleObject

    ntdll!RtlpWaitOnCriticalSection

    ntdll!RtlEnterCriticalSection

    ProseWarePrintProcessor

    ProseWarePrintProcessor

    ProseWarePrintProcessor

    ntdll!LdrpInitializeThread

    ntdll!_LdrpInitialize

    ntdll!LdrInitializeThunk

     

    Interesting.  Thread 6 is running the DllMain code for the ProseWarePrintProcessor DLL, which holds the loader lock. It is waiting on a critical section.  Meanwhile, thread 2 is waiting on the loader lock.  So who holds the critical section that thread 6 wants?  First, let's find what lock thread 6 wants:

     

    0:006> kn

     # Child-SP          RetAddr           Call Site

    00 00000000`0104eb18 00000000`777fe4e8 ntdll!ZwWaitForSingleObject+0xa

    01 00000000`0104eb20 00000000`777fe3db ntdll!RtlpWaitOnCriticalSection+0xe8

    02 00000000`0104ebd0 00000000`750c5d6b ntdll!RtlEnterCriticalSection+0xd1

    03 00000000`0104ec00 00000000`750c6256 ProseWarePrintProcessor+0xabf

    04 00000000`0104ec30 00000000`750c7015 ProseWarePrintProcessor+0xfaa

    05 00000000`0104f090 00000000`777dc76c ProseWarePrintProcessor+0x1d69

    06 00000000`0104f1f0 00000000`777dc42f ntdll!LdrpInitializeThread+0x17c

    07 00000000`0104f2f0 00000000`777dc32e ntdll!LdrpInitialize+0x9f

    08 00000000`0104f360 00000000`00000000 ntdll!LdrInitializeThunk+0xe

    0:006> .frame /c /r 2

    02 00000000`0104ebd0 00000000`750c5d6b ntdll!RtlEnterCriticalSection+0xd1

    rax=0000000300d1001a rbx=00000000750ca330 rcx=00000000001d0000

    rdx=0000000000000040 rsi=0000000000000001 rdi=0000000000000004

    rip=00000000777fe3db rsp=000000000104ebd0 rbp=00000000ff000000

     r8=00000000002c6a00  r9=00000000002c6a10 r10=0000000000000073

    r11=0000000000000001 r12=000007fffffd6000 r13=00000000778e2660

    r14=0000000000000001 r15=000000007780a280

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!RtlEnterCriticalSection+0xd1:

    00000000`777fe3db 83f801          cmp     eax,1

    0:006> ub 00000000`750c5d6b      <<<< Let's see what gets passed to RtlEnterCriticalSection

    ProseWarePrintProcessor+0xaa2:

    00000000`750c5d4e f9              stc

    00000000`750c5d4f ff743f33        push    qword ptr [rdi+rdi+33h]

    00000000`750c5d53 d2ff            sar     bh,cl

    00000000`750c5d55 15aeb4ffff      adc     eax,0FFFFB4AEh

    00000000`750c5d5a 85c0            test    eax,eax

    00000000`750c5d5c 7533            jne     ProseWarePrintProcessor+0xae5 (00000000`750c5d91)

    00000000`750c5d5e 488d0dcb450000  lea     rcx,[ProseWarePrintProcessor+0x5084 (00000000`750ca330)] <<<< Critical section is in rcx

    00000000`750c5d65 ff15bdb3ffff    call    qword ptr [ProseWarePrintProcessor+0x1128 (00000000`750c1128)]

    0:006> u ntdll!RtlEnterCriticalSection

    ntdll!RtlEnterCriticalSection:

    00000000`77802fc0 fff3            push    rbx

    00000000`77802fc2 4883ec20        sub     rsp,20h

    00000000`77802fc6 f00fba710800    lock btr dword ptr [rcx+8],0

    00000000`77802fcc 488bd9          mov     rbx,rcx   <<<< Critical section is saved in RBX.  RBX isn't modified between here and our current position

    00000000`77802fcf 0f83e9b1ffff    jae     ntdll!RtlEnterCriticalSection+0x31 (00000000`777fe1be)

    00000000`77802fd5 65488b042530000000 mov   rax,qword ptr gs:[30h]

    00000000`77802fde 488b4848        mov     rcx,qword ptr [rax+48h]

    00000000`77802fe2 c7430c01000000  mov     dword ptr [rbx+0Ch],1

    0:006> r rbx

    Last set context:

    rbx=00000000750ca330  <<<< This is the address of the critical section thread 6 is waiting for

     

    Now let's look at the threads in this process, and the held locks:

     

    0:002> ~

    #  0  Id: 20a4.180c Suspend: 0 Teb: 000007ff`fffde000 Unfrozen

       1  Id: 20a4.2294 Suspend: 0 Teb: 000007ff`fffdc000 Unfrozen

    .  2  Id: 20a4.1328 Suspend: 0 Teb: 000007ff`fffda000 Unfrozen

       3  Id: 20a4.1c34 Suspend: 0 Teb: 000007ff`fffd8000 Unfrozen

       4  Id: 20a4.5bc Suspend: 0 Teb: 000007ff`fffd4000 Unfrozen

       5  Id: 20a4.3c4 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

       6  Id: 20a4.1668 Suspend: 0 Teb: 000007ff`fffac000 Unfrozen

    0:002> !locks

     

    CritSec ntdll!LdrpLoaderLock+0 at 00000000778e7490

    WaiterWoken        No

    LockCount          2

    RecursionCount     1

    OwningThread       1668

    EntryCount         0

    ContentionCount    2

    *** Locked

     

    CritSec ProseWarePrintProcessor+5084 at 00000000750ca330

    WaiterWoken        No

    LockCount          1

    RecursionCount     2

    OwningThread       1328

    EntryCount         0

    ContentionCount    1

    *** Locked

     

    Scanned 201 critical sections

     

    That's not good!  We can see that thread 6 indeed owns the loader lock, which thread 2 is waiting for.  But thread 2 owns the ProseWarePrintProcessor lock - and thread 6 is waiting for it!  This is a classic deadlock.  In fact, Raymond Chen even described this on his blog.  More information about LdrInitialize can be found here.

    image002

     

    So we know that there is a deadlock in the Print Isolation Host, but why exactly does this cause spooler to hang?  Here's where we work backwards.  We know that thread 6 is handling DLL initialization, but what is thread 2 doing?  From the stack we can see it is handling an RPC request that called into ProseWarePrintProcessor.  Let's determine who called into this thread.

     

    0:002> kn

     # Child-SP          RetAddr           Call Site

    00 00000000`0085dd88 00000000`777fe4e8 ntdll!ZwWaitForSingleObject+0xa

    01 00000000`0085dd90 00000000`777fe3db ntdll!RtlpWaitOnCriticalSection+0xe8

    02 00000000`0085de40 00000000`777db9e7 ntdll!RtlEnterCriticalSection+0xd1

    03 00000000`0085de70 000007fe`fd963706 ntdll!LdrLockLoaderLock+0x6d

    04 00000000`0085deb0 00000000`750c58f3 KERNELBASE!GetModuleFileNameW+0x96

    05 00000000`0085df10 00000000`750c5d77 ProseWarePrintProcessor!InstallPrintProcessor+0x647

    06 00000000`0085e380 00000000`750c51d9 ProseWarePrintProcessor!InstallPrintProcessor+0xacb

    07 00000000`0085e3b0 000007fe`f96766b4 ProseWarePrintProcessor!ControlPrintProcessor+0x25

    08 00000000`0085e3e0 000007fe`fe2f23d5 PrintIsolationProxy!sandbox::PrintProcessor::ControlPrintProcessor+0x34

    09 00000000`0085e420 000007fe`fe39b68e rpcrt4!Invoke+0x65

    0a 00000000`0085e470 000007fe`fe2f48d6 rpcrt4!Ndr64StubWorker+0x61b

    0b 00000000`0085ea30 000007fe`fdd50883 rpcrt4!NdrStubCall3+0xb5

    0c 00000000`0085ea90 000007fe`fdd50ccd ole32!CStdStubBuffer_Invoke+0x5b

    0d 00000000`0085eac0 000007fe`fdd50c43 ole32!SyncStubInvoke+0x5d

    0e 00000000`0085eb30 000007fe`fdc0a4f0 ole32!StubInvoke+0xdb

    0f 00000000`0085ebe0 000007fe`fdd514d6 ole32!CCtxComChnl::ContextInvoke+0x190

    10 00000000`0085ed70 000007fe`fdd5122b ole32!AppInvoke+0xc2

    11 00000000`0085ede0 000007fe`fdd4fd6d ole32!ComInvokeWithLockAndIPID+0x52b

    12 00000000`0085ef70 000007fe`fe2e50f4 ole32!ThreadInvoke+0x30d

    13 00000000`0085f010 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14

    14 00000000`0085f040 000007fe`fe2e775b rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146

    15 00000000`0085f160 000007fe`fe2e769b rpcrt4!RPC_INTERFACE::DispatchToStub+0x9b

    16 00000000`0085f1a0 000007fe`fe2e7632 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0x5b

    17 00000000`0085f220 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x422

    18 00000000`0085f300 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d

    19 00000000`0085f430 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    1a 00000000`0085f570 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5

    1b 00000000`0085f600 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b

    1c 00000000`0085f690 00000000`776a652d ntdll!TppWorkerThread+0x3f8

    1d 00000000`0085f990 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd

    1e 00000000`0085f9c0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    I know that the code in frame 19 deals with processing the RPC and has a record of the calling process' PID and TID.  In fact, from a bit of code review, I know that at this portion of the code, we get back a value that contains a ntdll!_CLIENT_ID structure at offset 8:

     

    rpcrt4!LRPC_ADDRESS::ProcessIO+0xe6:

    000007fe`fe302ba6 ff151c050a00    call    qword ptr [rpcrt4!_imp_AlpcGetMessageFromCompletionList (000007fe`fe3a30c8)]

    000007fe`fe302bac 4885c0          test    rax,rax

    000007fe`fe302baf 0f84d1020000    je      rpcrt4!LRPC_ADDRESS::ProcessIO+0x3c6 (000007fe`fe302e86)

    000007fe`fe302bb5 4c8bac2488000000 mov     r13,qword ptr [rsp+88h]

    000007fe`fe302bbd 41bc01000000    mov     r12d,1

    000007fe`fe302bc3 33d2            xor     edx,edx

    000007fe`fe302bc5 488bf8          mov     rdi,rax

    000007fe`fe302bc8 41bfff000000    mov     r15d,0FFh

     

    Reviewing the assembly, from ProcessIO+0xe6 to where we are now (ProcessIO+0x3bf), we don't modify rdi again - and rdi is nonvolatile - so if we switch to that frame and check out rdi+8, we'll know who called this thread!

     

    0:002> .frame /c /r 19

    19 00000000`0085f430 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    rax=0000e5a47e7646ad rbx=0000000000000001 rcx=00000000750ca330

    rdx=0000000000000000 rsi=00000000002b8aa0 rdi=00000000002c4a80

    rip=000007fefe302e7f rsp=000000000085f430 rbp=0000000000ecff90

     r8=000000000085e2d8  r9=0000000000000002 r10=0000000000000000

    r11=0000000000000246 r12=0000000000ec8bf0 r13=0000000000000000

    r14=0000000000000000 r15=0000000000ec8080

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf:

    000007fe`fe302e7f 4c8d357ad1fbff  lea     r14,[rpcrt4!COMMON_ResubmitListen <PERF> (rpcrt4+0x0) (000007fe`fe2c0000)]

    0:002> dt _CLIENT_ID rdi+8

    ntdll!_CLIENT_ID

       +0x000 UniqueProcess    : 0x00000000`00002694 Void

       +0x008 UniqueThread     : 0x00000000`000005e8 Void

    0:002> ? 0x00000000`00002694

    Evaluate expression: 9876 = 00000000`00002694

    0:002> ? 0x00000000`000005e8

    Evaluate expression: 1512 = 00000000`000005e8

     

    So now we know that the caller was process 0x2694 and thread 0x5e8, or 9876 and 1512 in decimal, respectively.  Our current process (PrintIsolationHost.exe) is PID 20a4 (see above ~ output), or 8356 decimal.  So who is 9876?  I happen to have a process listing from the data collection:

     

    [0]  0 64 9876 spoolsv.exe     Svcs:  Spooler

         Command Line: C:\Windows\System32\spoolsv.exe

    [0]  0 64 8356 PrintIsolationHost.exe

         Command Line: C:\Windows\system32\PrintIsolationHost.exe -Embedding

     

    Okay, so I know the caller is thread 5e8 in spooler.  Loading up the spooler dump, what is that thread doing?

     

    0:000> ~~[5e8]s

    ntdll!NtAlpcSendWaitReceivePort+0xa:

    00000000`77801b6a c3              ret

    # Child-SP          RetAddr           Call Site

    00 00000000`0649d898 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa

    01 00000000`0649d8a0 000007fe`fe2f4e42 rpcrt4!LRPC_CCALL::SendReceive+0x156

    02 00000000`0649d960 000007fe`fdd528c0 rpcrt4!I_RpcSendReceive+0x42

    03 00000000`0649d990 000007fe`fdd5282f ole32!ThreadSendReceive+0x40

    04 00000000`0649d9e0 000007fe`fdd5265b ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xa3

    05 00000000`0649da80 000007fe`fdc0daaa ole32!CRpcChannelBuffer::SendReceive2+0x11b

    06 00000000`0649dc40 000007fe`fdc0da0c ole32!CAptRpcChnl::SendReceive+0x52

    07 00000000`0649dd10 000007fe`fdd5205d ole32!CCtxComChnl::SendReceive+0x68

    08 00000000`0649ddc0 000007fe`fe39b949 ole32!NdrExtpProxySendReceive+0x45

    09 00000000`0649ddf0 000007fe`fdd521d0 rpcrt4!NdrpClientCall3+0x2e2

    0a 00000000`0649e0b0 000007fe`fdc0d8a2 ole32!ObjectStublessClient+0x11d

    0b 00000000`0649e440 000007fe`fb059070 ole32!ObjectStubless+0x42

    0c 00000000`0649e490 000007fe`fb057967 localspl!sandbox::PrintProcessorExecuteObserver::ControlPrintProcessor+0x10

    0d 00000000`0649e4c0 000007fe`fb055e27 localspl!sandbox::PrintProcessorAdapterImpl::ControlPrintProcessor+0x27

    0e 00000000`0649e4f0 000007fe`faff7392 localspl!sandbox::PrintProcessorAdapter::ControlPrintProcessor+0x1b

    0f 00000000`0649e520 000007fe`faff8a0a localspl!DeleteJob+0x1ce

    10 00000000`0649eae0 00000000`ff41fe25 localspl!SplSetJob+0x49e

    11 00000000`0649eb80 000007fe`f9683603 spoolsv!SetJobW+0x25

    12 00000000`0649ebc0 00000000`61001ce1 spoolss!SetJobW+0x1f

    13 00000000`0649ec00 00000000`61001d7d Contoso!InitializePrintMonitor+0x781

    14 00000000`0649ec40 000007fe`faffa674 Contoso!InitializePrintMonitor+0x81d

    15 00000000`0649ec70 00000000`ff41c9c7 localspl!SplEndDocPrinter+0x214

    16 00000000`0649ecd0 00000000`ff403ba6 spoolsv!EndDocPrinter+0x1f

    17 00000000`0649ed00 00000000`ff3fe772 spoolsv!YEndDocPrinter+0x22

    18 00000000`0649ed30 000007fe`fe2f23d5 spoolsv!RpcEndDocPrinter+0x3e

    19 00000000`0649ed60 000007fe`fe39b68e rpcrt4!Invoke+0x65

    1a 00000000`0649edb0 000007fe`fe2dac40 rpcrt4!Ndr64StubWorker+0x61b

    1b 00000000`0649f370 000007fe`fe2e50f4 rpcrt4!NdrServerCallAll+0x40

    1c 00000000`0649f3c0 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14

    1d 00000000`0649f3f0 000007fe`fe2e5679 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146

    1e 00000000`0649f510 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x149

    1f 00000000`0649f5f0 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d

    20 00000000`0649f720 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    21 00000000`0649f860 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5

    22 00000000`0649f8f0 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b

    23 00000000`0649f980 00000000`776a652d ntdll!TppWorkerThread+0x3f8

    24 00000000`0649fc80 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd

    25 00000000`0649fcb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    It's calling into the print isolation host as we expect.  It looks like it is doing that to end a print job, based on an RPC call it received.  Using our same method as last time, let's pull out the PID and TID he is responding to:

     

    0:048> .frame /c /r 20

    20 00000000`0649f720 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    rax=000000000622986c rbx=0000000000000000 rcx=000000000622985c

    rdx=000007fefe3a3c40 rsi=00000000027e5150 rdi=0000000008e6abd0

    rip=000007fefe302e7f rsp=000000000649f720 rbp=0000000008e43480

     r8=0000000000000010  r9=0000000000000000 r10=000007fefe2c0000

    r11=0000000000000002 r12=000000000512d8c0 r13=0000000000000000

    r14=0000000000000000 r15=00000000102e3710

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf:

    000007fe`fe302e7f 4c8d357ad1fbff  lea     r14,[rpcrt4!COMMON_ResubmitListen <PERF> (rpcrt4+0x0) (000007fe`fe2c0000)]

    0:048> dt _CLIENT_ID rdi+8

    ntdll!_CLIENT_ID

       +0x000 UniqueProcess    : 0x00000000`000020a4 Void

       +0x008 UniqueThread     : 0x00000000`00001c34 Void

     

    Look at that, it's doing work for a different thread back in our Print Isolation Host.  Switching back to that dump, what is thread 1c34 doing?

     

    0:002> ~~[1c34]s

    ntdll!NtAlpcSendWaitReceivePort+0xa:

    00000000`77801b6a c3 

    0:003> kn

     # Child-SP          RetAddr           Call Site

    00 00000000`00ebc098 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa

    01 00000000`00ebc0a0 000007fe`fe39cc74 rpcrt4!LRPC_CCALL::SendReceive+0x156

    02 00000000`00ebc160 000007fe`fe39cf25 rpcrt4!NdrpClientCall3+0x244

    03 00000000`00ebc420 000007fe`f9bfd878 rpcrt4!NdrClientCall3+0xf2

    04 00000000`00ebc7b0 000007fe`f96845bf winspool!EndDocPrinter+0x15c

    05 00000000`00ebc7f0 00000000`750c4102 spoolss!EndDocPrinter+0x2f

    06 00000000`00ebc820 00000000`750c5013 ProseWarePrintProcessor+0x4102

    07 00000000`00ebe620 000007fe`f9676be2 ProseWarePrintProcessor!PrintDocumentOnPrintProcessor+0xb3

    08 00000000`00ebe650 000007fe`fe2f23d5 PrintIsolationProxy!sandbox::PrintProcessor::PrintDocThroughPrintProcessor+0x82

    09 00000000`00ebe6b0 000007fe`fe39b68e rpcrt4!Invoke+0x65

    0a 00000000`00ebe710 000007fe`fe2f48d6 rpcrt4!Ndr64StubWorker+0x61b

    0b 00000000`00ebecd0 000007fe`fdd50883 rpcrt4!NdrStubCall3+0xb5

    0c 00000000`00ebed30 000007fe`fdd50ccd ole32!CStdStubBuffer_Invoke+0x5b

    0d 00000000`00ebed60 000007fe`fdd50c43 ole32!SyncStubInvoke+0x5d

    0e 00000000`00ebedd0 000007fe`fdc0a4f0 ole32!StubInvoke+0xdb

    0f 00000000`00ebee80 000007fe`fdd514d6 ole32!CCtxComChnl::ContextInvoke+0x190

    10 00000000`00ebf010 000007fe`fdd5122b ole32!AppInvoke+0xc2

    11 00000000`00ebf080 000007fe`fdd4fd6d ole32!ComInvokeWithLockAndIPID+0x52b

    12 00000000`00ebf210 000007fe`fe2e50f4 ole32!ThreadInvoke+0x30d

    13 00000000`00ebf2b0 000007fe`fe2e4f56 rpcrt4!DispatchToStubInCNoAvrf+0x14

    14 00000000`00ebf2e0 000007fe`fe2e775b rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x146

    15 00000000`00ebf400 000007fe`fe2e769b rpcrt4!RPC_INTERFACE::DispatchToStub+0x9b

    16 00000000`00ebf440 000007fe`fe2e7632 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0x5b

    17 00000000`00ebf4c0 000007fe`fe2e532d rpcrt4!LRPC_SCALL::DispatchRequest+0x422

    18 00000000`00ebf5a0 000007fe`fe302e7f rpcrt4!LRPC_SCALL::HandleRequest+0x20d

    19 00000000`00ebf6d0 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    1a 00000000`00ebf810 00000000`777cb68b rpcrt4!LrpcIoComplete+0xa5

    1b 00000000`00ebf8a0 00000000`777cfeff ntdll!TppAlpcpExecuteCallback+0x26b

    1c 00000000`00ebf930 00000000`776a652d ntdll!TppWorkerThread+0x3f8

    1d 00000000`00ebfc30 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd

    1e 00000000`00ebfc60 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    This thread called into spooler to end a document based on yet another RPC!  To be clear, this is what we're looking at so far:

    image004

     

    Again, who called into this thread?

     

    0:003> .frame /c /r 19

    19 00000000`00ebf6d0 000007fe`fe302a35 rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf

    rax=57633d3cd2c9c145 rbx=0000000000000000 rcx=0000000000861ff0

    rdx=ffffffffff88ffe0 rsi=00000000002b8aa0 rdi=0000000000ec6bc0

    rip=000007fefe302e7f rsp=0000000000ebf6d0 rbp=00000000002cdef0

     r8=000000000003dbf0  r9=00000000000000fe r10=6dc0575d3d3cc4c8

    r11=0000000000860020 r12=0000000000ec82b0 r13=0000000000000000

    r14=0000000000000000 r15=0000000000ec8080

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rpcrt4!LRPC_ADDRESS::ProcessIO+0x3bf:

    000007fe`fe302e7f 4c8d357ad1fbff  lea     r14,[rpcrt4!COMMON_ResubmitListen <PERF> (rpcrt4+0x0) (000007fe`fe2c0000)]

    0:003> dt _CLIENT_ID rdi+8

    ntdll!_CLIENT_ID

       +0x000 UniqueProcess    : 0x00000000`00002694 Void

       +0x008 UniqueThread     : 0x00000000`000014cc Void

     

    So another call from spooler (note the same PID as earlier) - let's go back to spoolsv.dmp.  Here is this thread:

    0:048> ~~[14cc]s

    ntdll!NtAlpcSendWaitReceivePort+0xa:

    00000000`77801b6a c3              ret

    0:049> k

    Child-SP          RetAddr           Call Site

    00000000`0351e538 000007fe`fe2fa776 ntdll!NtAlpcSendWaitReceivePort+0xa

    00000000`0351e540 000007fe`fe2f4e42 rpcrt4!LRPC_CCALL::SendReceive+0x156

    00000000`0351e600 000007fe`fdd528c0 rpcrt4!I_RpcSendReceive+0x42

    00000000`0351e630 000007fe`fdd5282f ole32!ThreadSendReceive+0x40

    00000000`0351e680 000007fe`fdd5265b ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0xa3

    00000000`0351e720 000007fe`fdc0daaa ole32!CRpcChannelBuffer::SendReceive2+0x11b

    00000000`0351e8e0 000007fe`fdc0da0c ole32!CAptRpcChnl::SendReceive+0x52

    00000000`0351e9b0 000007fe`fdd5205d ole32!CCtxComChnl::SendReceive+0x68

    00000000`0351ea60 000007fe`fe39b949 ole32!NdrExtpProxySendReceive+0x45

    00000000`0351ea90 000007fe`fdd521d0 rpcrt4!NdrpClientCall3+0x2e2

    00000000`0351ed50 000007fe`fdc0d8a2 ole32!ObjectStublessClient+0x11d

    00000000`0351f0e0 000007fe`fb0591ac ole32!ObjectStubless+0x42

    00000000`0351f130 000007fe`fb057882 localspl!sandbox::PrintProcessorExecuteObserver::PrintDocThroughPrintProcessor+0x124

    00000000`0351f1f0 000007fe`fb05601d localspl!sandbox::PrintProcessorAdapterImpl::PrintDocumentOnPrintProcessor+0x3a

    00000000`0351f220 000007fe`fb013b70 localspl!sandbox::PrintProcessorAdapter::PrintDocumentOnPrintProcessor+0x9d

    00000000`0351f270 000007fe`fb014c7c localspl!PrintDocumentThruPrintProcessor+0x46c

    00000000`0351fa70 00000000`776a652d localspl!PortThread+0x4d0

    00000000`0351fd80 00000000`777dc521 kernel32!BaseThreadInitThunk+0xd

    00000000`0351fdb0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    Perfect.  So now we know that localspl was printing a document, which resulted in all of these RPC calls between spooler and Print Isolation Host, and ultimately the deadlock in Print Isolation Host is holding up this thread.  Just out of curiosity, are there any other threads blocked on this wait chain?

     

    0:049> !locks

    CritSec +5a00060 at 0000000005a00060

    WaiterWoken        No

    LockCount          105

    RecursionCount     1

    OwningThread       5e8

    EntryCount         0

    ContentionCount    84

    *** Locked

     

    Scanned 14148 critical sections

     

    Yes there are.  Thread 5e8, which we looked at earlier, is holding a lock that 104 other threads are waiting for!  This dump had 177 threads, so we know now that thread 5e8, 14cc, and 104 others in spooler are all hung on this deadlock. With about 60% of all the threads in spooler hung, the deadlock in ProseWarePrintProcessor is clearly the cause of our issue.  Here's the final wait chain diagram:

    image006

     

    To resolve the issue, ProseWarePrintProcessor needs to avoid calling GetModuleFileName while its DllMain routine is still running, since the former requires and the latter holds the Loader Lock.

  • Ntdebugging Blog

    Commitment Failures, Not Just a Failed Love Story

    • 1 Comments

    I was working on a debug the other day when I ran the “!vm” command and saw that the system had some 48,000 commit requests that failed. This was strange as the system was not out of memory and the page file was not full. I was left scratching my head and thinking “I wish I knew where !vm got that information from.” So I went on a quest to find out, here is what I found.

     

    13: kd> !vm 1

     

    *** Virtual Memory Usage ***

          Physical Memory:    12580300 (  50321200 Kb)

          Page File: \??\C:\pagefile.sys

            Current:  50331648 Kb  Free Space:  50306732 Kb

            Minimum:  50331648 Kb  Maximum:     50331648 Kb

          Available Pages:     4606721 (  18426884 Kb)

          ResAvail Pages:     12189247 (  48756988 Kb)

          Locked IO Pages:           0 (         0 Kb)

          Free System PTEs:   33460257 ( 133841028 Kb)

          Modified Pages:        20299 (     81196 Kb)

          Modified PF Pages:      6154 (     24616 Kb)

          NonPagedPool 0 Used:   19544 (     78176 Kb)

          NonPagedPool 1 Used:   22308 (     89232 Kb)

          NonPagedPool Usage:    53108 (    212432 Kb)

          NonPagedPool Max:    9408956 (  37635824 Kb)

          PagedPool 0 Usage:    168921 (    675684 Kb)

          PagedPool 1 Usage:   4149241 (  16596964 Kb)

          PagedPool 2 Usage:     17908 (     71632 Kb)

          PagedPool Usage:     4336070 (  17344280 Kb)

          PagedPool Maximum:  33554432 ( 134217728 Kb)

          Session Commit:         3438 (     13752 Kb)

          Shared Commit:          6522 (     26088 Kb)

          Special Pool:              0 (         0 Kb)

          Shared Process:        53597 (    214388 Kb)

          PagedPool Commit:    4336140 (  17344560 Kb)

          Driver Commit:          5691 (     22764 Kb)

          Committed pages:     5565215 (  22260860 Kb)

          Commit limit:       25162749 ( 100650996 Kb)

     

          ********** 48440 commit requests have failed  **********

     

    It turns out that this calculation is from a global ULONG array named “nt!MiChargeCommitmentFailures”.  The array has 4 members and they are used to trace the types of commit failures that have taken place. This is done by first calculating the new commit size NewCommitValue = CurrentCommitValue + SystemReservedMemory. Based on this calculation commit errors are tracked in a few different ways, which are listed below with the corresponding member in the array that is incremented.

     

    MiChargeCommitmentFailures[0] - If the system failed a commit request and an expansion of the pagefile has failed.

    MiChargeCommitmentFailures[1] - If the system failed a commit and we have already reached the maximum pagefile size.

    MiChargeCommitmentFailures[2] - If the system failed a commit while the pagefile lock is held.

    MiChargeCommitmentFailures[3] - If the system failed a commit and the NewCommitValue is less than or equal to CurrentCommitValue.

     

    In order to calculate the count of failures, "!vm" adds up the values stored in each array member of the array. Members 0 and 1 are always counted, member 2 is counted if the OS version is Windows 2003/XP and member 3 is counted if the build version is newer than Windows 2003/XP. 

     

    Let's look at the array in the dump I was debugging:

     

    13: kd> dc nt!MiChargeCommitmentFailures L4

    fffff800`01e45ce0  00000000 0000bd38 00000000 00000000  ....8...........

     

     

    Converting this to decimal we find the 48000+ commit failures I was seeing the in output of !VM.

     

    13: kd> ?0000bd38

    Evaluate expression: 48440 = 00000000`0000bd38

     

    Since I now had my answer, “where does the number come from?”, I was left wanting to know a bit more about the overall flow of why a VirtualAlloc fails to commit.

     

    When memory is allocated by VirtualAlloc the newly allocated memory is not committed to physical memory. Only when the memory is accessed by a read or write is it backed by physical memory.

     

    When this newly allocated memory is accessed for the first time it will need to be backed by commit space. Under normal conditions this is a smooth process, however when the system hits what’s called the commit limit and can’t expand this limit we see commit failures.

     

    So how is the commit limit calculated? Let’s say we have a system with 4GB of physical memory and a pagefile that is 6GB in size. To determine the commit limit we add physical memory and the pagefile size together - in this example the commit limit would be 10GB. Since memory manger will not let any user mode allocation consume every last morsel of commit space it keeps a small amount of the commit space for the system to avoid hangs. When the limit is reached the system tries to grow the page file. If there is no more room to grow the pagefile or the pagefile has reached its configured maximum size, the system will try and free some committed memory to make room for more requests. If expansion of the page file or the attempt to free memory do not allow the allocation to complete, the allocation fails and MiChargeCommitmentFailures is incremented.

     

     

    To sum it all up, commit limit is RAM + pagefile, commit failures happen when we hit the commit limit and the system is unable to grow the pagefile because it is already at its max.  It’s that simple, well almost.

     

    For those that will want to know more about how memory manger works please see the post from Somak: The Memory Shell Game.

     

    Randy Monteleone

  • Ntdebugging Blog

    Debugging a Debugger to Debug a Dump

    • 1 Comments

    Recently I came across an instance where my debugger did not do what I wanted.  Rarely do computers disobey me, but this one was unusually stubborn.  There was no other option; I had to bend the debugger to my will.

     

    There are many ways to make a computer program do what you want.  If you have the source code you can rewrite and recompile the program.  If you have a hex editor you can edit the code of the binary.  A shim can be used to modify a program at runtime.  In this instance I was in a hurry and I was ok with a temporary solution, so I used a debugger to change the execution of the debugger while it ran.

     

    Debug a debugger? Can you do such a thing?  Of course you can.

     

    In this example a memory dump was captured of a system and I was asked to determine if the system had run out of desktop heap.  Usually the !dskheap command can be used to determine how much heap has been used by each desktop. Unfortunately, this command failed me.

     

    23: kd> !dskheap

    Error Reading TotalFreeSize from nt!_HEAP @ fffffa8019c65c00

    Failed to GetHeapInfo for desktop @fffffa8019c65c00

    EnumDsktps failed on Winsta: 19c4f090FillWinstaArray failed

     

    The error indicates that the command couldn’t read from the _HEAP structure at fffffa8019c65c00 for desktop fffffa8019c65c00.  Further investigation found that reason I got this error is that the heap for the desktop in question is not valid memory.  Because the memory is described by a prototype PTE I assume that the heap has not been initialized (Note: See Windows Internals’ Memory Management chapter for more information about proto PTEs).

     

    23: kd> dt win32k!tagDESKTOP fffffa8019c65c00

       +0x000 dwSessionId      : 0

       +0x008 pDeskInfo        : 0xfffff900`c05e0a70 tagDESKTOPINFO

       +0x010 pDispInfo        : 0xfffff900`c0581e50 tagDISPLAYINFO

       +0x018 rpdeskNext       : 0xfffffa80`19c6ef20 tagDESKTOP

       +0x020 rpwinstaParent   : 0xfffffa80`19c4f090 tagWINDOWSTATION

       +0x028 dwDTFlags        : 0x110

       +0x030 dwDesktopId      : 0x19c65c00`00000003

       +0x038 spmenuSys        : (null)

       +0x040 spmenuDialogSys  : (null)

       +0x048 spmenuHScroll    : (null)

       +0x050 spmenuVScroll    : (null)

       +0x058 spwndForeground  : (null)

       +0x060 spwndTray        : (null)

       +0x068 spwndMessage     : 0xfffff900`c05e0d90 tagWND

       +0x070 spwndTooltip     : 0xfffff900`c05e0fa0 tagWND

       +0x078 hsectionDesktop  : 0xfffff8a0`00ef08e0 Void

       +0x080 pheapDesktop     : 0xfffff900`c05e0000 tagWIN32HEAP

       +0x088 ulHeapSize       : 0x18000

       +0x090 cciConsole       : _CONSOLE_CARET_INFO

       +0x0a8 PtiList          : _LIST_ENTRY [ 0xfffffa80`19c65ca8 - 0xfffffa80`19c65ca8 ]

       +0x0b8 spwndTrack       : (null)

       +0x0c0 htEx             : 0n0

       +0x0c4 rcMouseHover     : tagRECT

       +0x0d4 dwMouseHoverTime : 0

       +0x0d8 pMagInputTransform : (null)

    23: kd> dd 0xfffff900`c05e0000

    fffff900`c05e0000  ???????? ???????? ???????? ????????

    fffff900`c05e0010  ???????? ???????? ???????? ????????

    fffff900`c05e0020  ???????? ???????? ???????? ????????

    fffff900`c05e0030  ???????? ???????? ???????? ????????

    fffff900`c05e0040  ???????? ???????? ???????? ????????

    fffff900`c05e0050  ???????? ???????? ???????? ????????

    fffff900`c05e0060  ???????? ???????? ???????? ????????

    fffff900`c05e0070  ???????? ???????? ???????? ????????

    23: kd> !pte fffff900`c05e0000

                                               VA fffff900c05e0000

    PXE at FFFFF6FB7DBEDF90    PPE at FFFFF6FB7DBF2018    PDE at FFFFF6FB7E403010    PTE at FFFFF6FC80602F00

    contains 000000076245C863  contains 0000000762569863  contains 000000045FA17863  contains F8A000F4F9780400

    pfn 76245c    ---DA--KWEV  pfn 762569    ---DA--KWEV  pfn45fa17    ---DA--KWEV   not valid

                                                                                     Proto: FFFFF8A000F4F978

     

    There are many desktops in this session and I wanted to know about the usage of the other desktops, but the !dskheap command stopped after just one error.  I needed to force it to continue after this error, so I launched a debugger to debug my debugger.  There is a command to do this, just run .dbgdbg.

     

    23: kd> .dbgdbg

    Debugger spawned, connect with

        "-remotenpipe:icfenable,pipe=cdb_pipe,server=NINJA007"

     

    For clarity I will call the original debugger where I ran !dskheap debugger1, and the new debugger spawned by .dbgdbg debugger2 .

     

    Before switching to debugger2 I need to know what I am going to debug.  The error message gives a hint about where to set a breakpoint, I am looking for a failure from GetHeapInfo.

     

    23: kd> !dskheap

    Error Reading TotalFreeSize from nt!_HEAP @ fffffa8019c65c00

    Failed to GetHeapInfo for desktop @fffffa8019c65c00

    EnumDsktps failed on Winsta: 19c4f090FillWinstaArray failed

     

    I need to know which module GetHeapInfo is in, the .extmatch match command indicates which module contains the !dskheap command.

     

    23: kd> .extmatch dskheap

    !kdexts.dskheap

     

    Switching to debugger2 I set a breakpoint on kdexts!GetHeapInfo.  Use Ctrl+C to trigger a debug break in cdb (this is the same as a Ctrl+Break in windbg).

     

    0:004> bp kdexts!GetHeapInfo

    0:004> g

     

    Switch back to debugger1 and run the !dskheap command.

     

    23: kd> !dskheap

     

    In debugger2 I have hit the breakpoint.

     

    Breakpoint 0 hit

    kdexts!GetHeapInfo:

    000007f9`4237b9b0 4055            push    rbp

     

    The error says GetHeapInfo failed, so I am interested in what this function returns.  To see what GetHeapInfo returns I go up one level in the stack and set a new breakpoint on the code just after it returns.  This new breakpoint will also dump the return value of GetHeapInfo (return values are always in the rax register).

     

    0:000> gu

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> r rax

    rax=0000000000000000

    0:000> bc *

    0:000> bp 000007f9`4237b483 "r rax"

    0:000> g

     

    The next time the breakpoint hit the return value was 1, which in this instance means GetHeapInfo failed.  This is where I exerted my control over the computer: I forced the return value to 0.

     

    rax=0000000000000001

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> r rax=0

     

    I ran through the other breakpoints and changed rax as necessary.

     

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000001

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> r rax=0

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

     

    Everything was going well, until the computer defied me again.  The !dskheap output computes the percentage of heap usage by dividing the bytes used by the size of the heap.  Unfortunately, the size of the heap was left at 0 for the two heaps where I changed the return value.  It is well known that only Chuck Norris can divide by zero; to prevent a roundhouse kick to your computer the processor generates an exception.

     

    (2d0.928): Integer divide-by-zero - code c0000094 (first chance)

    First chance exceptions are reported before any exception handling.

    This exception may be expected and handled.

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> r r11

    r11=0000000000000000

    0:000> r rax

    rax=0000000000000000

    0:000> g

     

    Fortunately debugger1 handles the divide by zero exception and it is easy to run !dskheap again.

     

    23: kd> !dskheap

     

    Back in debugger2 I set a new breakpoint on the div instruction that outputs the divisor.  When the divisor (r11) is 0 I changed it to a non-zero value to avoid the wrath of Mr. Norris.

     

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> bp 000007f9`4237b90e

    0:000> bp 000007f9`4237b90e "r r11"

    breakpoint 1 redefined

    0:000> g

    rax=0000000000000001

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> r rax=0

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    rax=0000000000000001

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> r rax=0

    0:000> g

    rax=0000000000000000

    kdexts!EnumDsktps+0x197:

    000007f9`4237b483 4885c0          test    rax,rax

    0:000> g

    r11=0000000000033333

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

    r11=0000000000000000

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> r r11=1

    0:000> g

    r11=00000000000007ae

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

    r11=0000000000013333

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

    r11=0000000000013333

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

    r11=0000000000013333

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

    r11=0000000000000000

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> r r11=1

    0:000> g

    r11=0000000000013333

    kdexts!DisplayInfo+0x2ee:

    000007f9`4237b90e 49f7f3          div     rax,r11

    0:000> g

     

    Finally, back in debugger1 I now have complete output for !dskheap.  After a few strategic modifications of the program’s execution I got it to output the data I wanted.  As it turns out we aren’t out of desktop heap after all.

     

    23: kd> !dskheap

    Error Reading TotalFreeSize from nt!_HEAP @ fffffa8019c65c00

    Error Reading TotalFreeSize from nt!_HEAP @ fffffa801a53ea30

      Winstation\Desktop            Heap Size(KB)   Used Rate(%)

    ------------------------------------------------------------

      WinSta0\Default                  20480                 0%

      WinSta0\Disconnect                   0                 0%

      WinSta0\Winlogon                   192                 2%

      Service-0x0-3e7$\Default          7680                 1%

      Service-0x0-3e4$\Default          7680                 0%

      Service-0x0-3e5$\Default          7680                 0%

      Service-0x0-26f46a$\Default          0                 0%

      Service-0x0-2706f2$\Default       7680                 0%

    ------------------------------------------------------

                    Total Desktop: (   51392 KB -   8 desktops)

                    Session ID:  0

    ============================================================

  • Ntdebugging Blog

    Leaving the Do Not Disturb Sign on the Door Will Cause the KERNEL_APC_PENDING_DURING_EXIT Bugcheck

    • 1 Comments

    This is Ron Stock from the Global Escalation Services team and I recently worked with a customer to determine which misbehaving driver was crashing their critical server. This particular crash was a STOP 0x00000020 which maps to KERNEL_APC_PENDING_DURING_EXIT.

     

    The KERNEL_APC_PENDING_DURING_EXIT bugcheck type indicates the APC disable count for a thread was not equal to zero when the thread exited. The APC disable count is a field in the _KTHREAD structure and it is decremented when drivers disable APCs by calling functions such as KeEnterCriticalRegion, FsRtlEnterFileSystem or by acquiring a mutex. Disabling APC delivery to a thread is the equivalent of hanging the “Do Not Disturb” sign on your door. When drivers need to perform a critical operation they ‘hang the sign on the door’ to prevent interruption from APCs. When the same driver fails to ‘take the sign off the door’ by calling KeLeaveCriticalRegion, FsRtlExitFileSystem or KeReleaseMutex, the APC disable count is never incremented back to its original value. This forgetful behavior causes a bugcheck because the APC disable count is checked when the thread is exiting. The OS expects this value to be zero on thread exit.

     

    In my case the value was 0xffff (negative 1) indicating a driver had forgot to remove the ‘Do Not Disturb’ sign.

     

    kd> !analyze –v

     

    KERNEL_APC_PENDING_DURING_EXIT (20)

    The key data item is the thread's APC disable count.

    If this is non-zero, then this is the source of the problem.

    Arguments:

    Arg1: 0000000000000000, The address of the APC found pending during exit.

    Arg2: 000000000000ffff, The thread's APC disable count

    Arg3: 0000000000000000, The current IRQL

    Arg4: 0000000000000001

     

    Because the value is decremented earlier in time the current call stack is not particularly useful. It merely shows the thread exiting under normal conditions.

     

    0: kd> !thread -1 e

    THREAD fffffa8049f04b50  Cid 0004.0998  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0

    Not impersonating

    DeviceMap                 fffff8a000007ee0

    Owning Process            fffffa8048cad9e0       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      11503325       Ticks: 0

    Context Switch Count      185715         IdealProcessor: 0

    UserTime                  00:00:00.000

    KernelTime                00:00:06.078

    Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88003c4b400)

    Stack Init fffff88005078db0 Current fffff880050789b0

    Base fffff88005079000 Limit fffff88005073000 Call 0

    Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffff880`05078b08 fffff800`01984bd9 nt!KeBugCheckEx

    fffff880`05078b10 fffff800`019a1a3d nt!PspExitThread+0xffffffff`fffe3ae9

    fffff880`05078c10 fffff800`0195bc8a nt!PspTerminateThreadByPointer+0x4d

    fffff880`05078c60 fffff880`03c56769 nt!PsTerminateSystemThread+0x22

    fffff880`05078c90 fffff880`03c4b5b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99

    fffff880`05078cc0 fffff800`01966e5a srv2!SrvProcWorkerThread+0x1b6

    fffff880`05078d40 fffff800`016c0d26 nt!PspSystemThreadStartup+0x5a

    fffff880`05078d80 00000000`00000000 nt!KxStartSystemThread+0x16

     

    Driver Verifier is the ideal tool for this type of bugcheck. It has a feature called Critical Region logging which tracks the call stack and KTRHEAD value for each call to either KeEnterCriticalRegion() and KeLeaveCriticalRegion(). I had the customer enable this logging by selecting the “Miscellaneous checks” option in Driver Verifier using these steps-

    • Run Verifier.exe
    • Select “Create custom settings (For code developers)”
    • Select individual settings from a full list
    • Select Miscellaneous checks
    • Select Driver Names from a list
    • Manually choose all of the third-party drivers.
    • Reboot after making the changes. 

     

    After running through the steps above, we gathered another STOP 0x00000020 dump. I confirmed the “Miscellaneous checks” option was enabled by using the !verifier command

     

    0: kd> !verifier

     

    Verify Level 800 ... enabled options are:

          Miscellaneous checks enabled

     

    The stack in this new dump was in the same SrvProcWorkerThread thread exit path so we had a consistent pattern. The thread with the negative APC Disable count was fffffa804b5be040.

     

    0: kd> !thread -1 e

    THREAD fffffa804b5be040 Cid 0004.082c  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0

    Not impersonating

    DeviceMap                 fffff8a000007ee0

    Owning Process            fffffa8048cad9e0       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4458237        Ticks: 0

    Context Switch Count      36067          IdealProcessor: 0            

    UserTime                  00:00:00.000

    KernelTime                00:00:01.218

    Win32 Start Address srv2!SrvProcWorkerThread(0xfffff88004827400)

    Stack Init fffff88005cc6db0 Current fffff88005cc69b0

    Base fffff88005cc7000 Limit fffff88005cc1000 Call 0

    Priority 15 BasePriority 15 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffff880`05cc6b08 fffff800`0198dbd9 nt!KeBugCheckEx

    fffff880`05cc6b10 fffff800`019aaa3d nt!PspExitThread+0xffffffff`fffe3ae9

    fffff880`05cc6c10 fffff800`01964c8a nt!PspTerminateThreadByPointer+0x4d

    fffff880`05cc6c60 fffff880`048326d9 nt!PsTerminateSystemThread+0x22

    fffff880`05cc6c90 fffff880`048275b6 srv2!SrvProcTerminateWorkerThreadInternal+0x99

    fffff880`05cc6cc0 fffff800`0196fe5a srv2!SrvProcWorkerThread+0x1b6

    fffff880`05cc6d40 fffff800`016c9d26 nt!PspSystemThreadStartup+0x5a

    fffff880`05cc6d80 00000000`00000000 nt!KxStartSystemThread+0x16

     

    I dumped the Critical Region log by using the !verifier 200 command. The Critical Region log has enough room for 128 stacks. After dumping the log, the first thing to do is to find the KTHREAD value of the thread with the non-zero APC disable count. Unfortunately in my case thread fffffa804b5be040 didn’t appear in the log. In fact all 128 stacks had a driver named Suspect.sys calling KeEnterCriticalRegion or KeLeaveCriticalRegion.  Note: To protect our vendor friends, I renamed the actual sys file in this article to suspect.sys.

     

    The customer disabled the suspect.sys driver hoping this was the driver forgetting to re-enable APCs. If nothing else, this would perhaps remove the noisy suspect.sys from the log in the next dump.

     

    0: kd> !verifier 200

     

    Enter/Leave Critical Region log:

    There are up to 0x80 entries in the log.

     

    Displaying all the log entries.

     

    ======================================================================

    Thread fffffa8048ce4b50

    fffff80001b74293 nt!VerifierKeLeaveCriticalRegion+0xc3

    fffff8800100aafa Suspect.sys+0xaafa

    fffff88001001e30 Suspect.sys+0x1e30

    fffff80001abc68c nt!IopLoadUnloadDriver+0x1c

    fffff800016e1641 nt!ExpWorkerThread+0x111

    fffff8000196ee5a nt!PspSystemThreadStartup+0x5a

    fffff800016c8d26 nt!KiStartSystemThread+0x16

    ======================================================================

    Thread fffffa8048ce4b50

    fffff80001b6b0a2 nt!VerifierKeEnterCriticalRegion+0x92

    fffff880010062a3 Suspect.sys+0x62a3

    fffff8800100a7e2 Suspect.sys+0xa7e2

    fffff88001001e30 Suspect.sys+0x1e30

    fffff80001abc68c nt!IopLoadUnloadDriver+0x1c

    fffff800016e1641 nt!ExpWorkerThread+0x111

    fffff8000196ee5a nt!PspSystemThreadStartup+0x5a

     

    Unfortunately, the system continued to crash and in the next dump the critical region log was completely empty. My guess is the complier was optimizing the KeEnterCriticalRegion and KeLeaveCriticalRegion calls in the driver, causing them to be inlined and skipping the call to VerifierKeLeaveCriticalRegion/VerifierKeEnterCriticalRegion. I needed another attack plan.

     

    There is another Verifier option called I/O Verification and it works in a similar way to the steps below.  Please note that this functionality is not documented and may be subject to change at any time.

    1. A call to IoCallDriver() is made to send an IO packet to a driver associated with a device.
    2. Verifier hooks the call.
    3. Verifier creates a structure to record state info.
    4. Verifier fills in the structure with data including the thread’s APC Disable Count.
    5. Next Verifier calls the normal IoCallDriver() routine to “continue” the call made in step 1.
    6. The driver does its work (disables and re-enables APCs as needed)
    7. The call to IoCallDriver() returns when the driver is finished.
    8. Verifier checks the real APC count in the thread. Next it compares the value to the recorded value in the structure from step 4. If the two values do not match, then Verifier crashes the machine so we can pull the bad driver out of the dump.

     

    I had the customer enable I/O Verification using these steps-

    • Run Verifier.exe
    • Select “Create custom settings (For code developers)”
    • Select individual settings from a full list
    • Select I/O Verification
    • Select Driver Names from a list
    • Manually choose all of the third-party drivers.
    • Reboot after making the changes. 

     

    As we expected, the machine crashed again because of the APC Disable issue.  Because we enabled I/O Verification, the bugcheck type changed to DRIVER_VERIFIER_DETECTED_VIOLATION and now we have a smoking gun.

     

    I used the !verifier command to review the Verifier settings. The output below shows “I\O subsystem checking enabled” which confirms I/O Verification was been enabled.

     

    0: kd> !verifier

     

    Verify Level 810 ... enabled options are:

          Io subsystem checking enabled

     

    The parameters to KeBugcheck reconfirmed the APC disable count was -1 (ffff). And this time we have an additional breadcrumb, the driver dispatch routine address.

     

    DRIVER_VERIFIER_DETECTED_VIOLATION (c4)

    A device driver attempting to corrupt the system has been caught.  This is

    because the driver was specified in the registry as being suspect (by the

    administrator) and the kernel has enabled substantial checking of this driver.

    If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA will

    be among the most commonly seen crashes.

    Arguments:

    Arg1: 00000000000000c5, Thread APC disable count changed by driver dispatch routine.

    Arg2: fffff88001345610, Driver dispatch routine address.

    Arg3: 000000000000ffff, Current thread APC disable count.

    Arg4: 0000000000000000, Thread APC disable count before calling driver dispatch routine.

          The APC disable count is decremented each time a driver calls

          KeEnterCriticalRegion, FsRtlEnterFileSystem, or acquires a mutex. The APC

          disable count is incremented each time a driver calls KeLeaveCriticalRegion,

          FsRtlExitFileSystem, or KeReleaseMutex. Since these calls should always be in

          pairs, this value should be zero when a thread exits. A negative value

          indicates that a driver has disabled APC calls without re-enabling them. A

          positive value indicates that the reverse is true.

     

    Notice the Verifier functions on the call stack which we I leveraged for the “saved state” information I discussed above in the I/O Verification architecture.

     

    0: kd> kn

    # Child-SP          RetAddr           Call Site

    00 fffff880`05db1b08 fffff800`0174a9c0 nt!KeBugCheckEx

    01 fffff880`05db1b10 fffff800`01b66b4ent!VfBugCheckNoStackUsage+0x30

    02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e

    03 fffff880`05db1ba0 fffff880`04054756 nt!IovCallDriver+0x57e

    04 fffff880`05db1c00 fffff880`0404b7b0 srv2!Smb2ExecuteRead+0x9a6

    05 fffff880`05db1c80 fffff880`0404b6fb srv2!SrvProcessPacket+0xa0

    06 fffff880`05db1cc0 fffff800`01960e5a srv2!SrvProcWorkerThread+0x2fb

    07 fffff880`05db1d40 fffff800`016bad26nt!PspSystemThreadStartup+0x5a

    08 fffff880`05db1d80 00000000`00000000nt!KiStartSystemThread+0x16

     

    Next I dumped the driver dispatch routine noted in the KeBugCheckoutput above using the ln command. This points to fltmgr!FltpDispatch which tells me we have a filter manager minifilter driver making calls to disable APCs but rudely forgetting to re-enable them. As I noted above we save the state info before the call to IoCallDriver().

     

    0: kd> ln fffff88001345610

    (fffff880`01345610)   fltmgr!FltpDispatch   |  (fffff880`01345710)   fltmgr!FltReleasePushLock

    Exact matches:

        fltmgr!FltpDispatch (<no parameter info>)

     

    Now the goal was to determine which minifilter is leaving the “Do Not Disturb” sign on the door and forgetting to remove it. We can find this using the fltmgr device object.  The “saved state” structure is passed to VfAfterCallDriver as the first parameter so I switched to the VfAfterCallDriverframe (second frame) to dig it out. I used the /r flag to show the original values of the registers for this frame.

     

    0: kd> .frame /r 2

    02 fffff880`05db1b50 fffff800`01b6cc2e nt!VfAfterCallDriver+0x33e

    rax=0000000000000000 rbx=fffffa804b729790 rcx=00000000000000c4

    rdx=00000000000000c5 rsi=fffffa804a3b0000 rdi=fffff8000183ce80

    rip=fffff80001b66b4e rsp=fffff88005db1b50 rbp=fffffa804de8c290

    r8=fffff88001345610   r9=000000000000ffff r10=fffff80001b7a640

    r11=0000000000000000 r12=000000004de8c290 r13=0000000000000000

    r14=0000000000000000 r15=0000000000000000

    iopl=0         nv up ei ng nz na pe nc

    cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00000282

    nt!VfAfterCallDriver+0x33e:

    fffff800`01b66b4e cc              int     3

     

    Parameter 1 on x64 is always passed via rcx. I dumped the assembly for VfAfterCallDriver and confirmed the value in rcx (the base of the saved state structure) is moved to rbx.

     

    0: kd> u nt!VfAfterCallDriver

    nt!VfAfterCallDriver:

    fffff800`01b66810 48895c2410      mov     qword ptr[rsp+10h],rbx

    fffff800`01b66815 48896c2418      mov     qword ptr[rsp+18h],rbp

    fffff800`01b6681a 4889742420      mov     qword ptr[rsp+20h],rsi

    fffff800`01b6681f 57              push    rdi

    fffff800`01b66820 4154            push    r12

    fffff800`01b66822 4155            push    r13

    fffff800`01b66824 4883ec30        sub     rsp,30h

    fffff800`01b66828 488bfa          mov     rdi,rdx

    fffff800`01b6682b 488bd9          mov     rbx,rcx

     

    The device object is stored in the save state information at offset 0xa0.

     

    0: kd> ? fffffa804b729790 + 0xa0

    Evaluate expression: -6046048151504 = fffffa80`4b729830

     

    0: kd> dq fffffa80`4b729830 l1

    fffffa80`4b729830  fffffa80`4a3b0060

     

    0: kd> !devobj fffffa80`4a3b0060

    Device object (fffffa804a3b0060) is for:

      \FileSystem\FltMgr DriverObject fffffa80491fb7c0

    Current Irp 00000000 RefCount 0 Type 00000008 Flags 00040000

    DevExt fffffa804a3b01b0 DevObjExt fffffa804a3b0208

    ExtensionFlags (0x80000800)  DOE_DEFAULT_SD_PRESENT, DOE_DESIGNATED_FDO

    Characteristics (0000000000) 

    AttachedTo (Lower) fffffa804a3b1030 \FileSystem\Ntfs

    Device queue is not busy.

     

    0: kd> !devstack fffffa80`4a3b0060

      !DevObj   !DrvObj            !DevExt   ObjectName

    > fffffa804a3b0060  \FileSystem\FltMgr fffffa804a3b01b0 

      fffffa804a3b1030  \FileSystem\Ntfs   fffffa804a3b1180 

     

    As http://msdn.microsoft.com/en-us/library/ff541610(v=vs.85).aspxexplains – “The filter manager is installed with Windows, but it becomes active only when a minifilter driver is loaded. The filter manager attaches to the file system stack for a target volume. A minifilter driver attaches to the file system stack indirectly, by registering with the filter manager for the I/O operations the minifilter driver chooses to filter.”

     

    Using the power of the fltkd extension, I dumped the volume information associated with this device object. From the output below, we can extract the name of the filter attached to the volume. The culprit is named BadDriver.sys. The customer removed the driver and the problem went away long enough for the vendor to create an update for BadDriver.sys. Happy Ending!

     

    0: kd> !fltkd.volume fffffa80`4a3b0060

     

    FLT_VOLUME: fffffa804a3b0800 "\Device\HarddiskVolume3"

       FLT_OBJECT: fffffa804a3b0800  [04000000] Volume

          RundownRef               : 0x000000000000020a (261)

          PointerCount             : 0x00000001

          PrimaryLink              : [fffffa804ae06810-fffffa804a2b16f0]

       Frame                    : fffffa8049fcd420 "Frame 0"

       Flags                    : [00000064] SetupNotifyCalledEnableNameCaching FilterAttached

       FileSystemType           : [00000002] FLT_FSTYPE_NTFS

       VolumeLink               : [fffffa804ae06810-fffffa804a2b16f0]

       DeviceObject             : fffffa804a3b0060

       DiskDeviceObject         : fffffa804a1c0350

       FrameZeroVolume          : fffffa804a3b0800

       VolumeInNextFrame        : 0000000000000000

       Guid                     : "\??\Volume{552791b0-455d-11de-b7b9-00145eed6acc}"

       CDODeviceName            : "\Ntfs"

       CDODriverName            : "\FileSystem\Ntfs"

       TargetedOpenCount        : 258

       Callbacks                : (fffffa804a3b0910)

       ContextLock              : (fffffa804a3b0cf8)

       VolumeContexts           : (fffffa804a3b0d00)  Count=0

       StreamListCtrls          : (fffffa804a3b0d08)  rCount=2871

       FileListCtrls            : (fffffa804a3b0d88)  rCount=0

       NameCacheCtrl            : (fffffa804a3b0e08)

       InstanceList             : (fffffa804a3b0890)

          FLT_INSTANCE: fffffa804b5b1010 "BadDriver.sys Instance" "189600"

  • Ntdebugging Blog

    ‘Hidden’ ETW Stack Trace Feature – Get Stacks All Over the Place!

    • 1 Comments

     

    I’ve heard Mark R. say that “sometimes Microsoft gave me a gift” when coming across an interesting feature not in mainstream documentation.  How true that often the odd find can prove ultimately very valuable.  Such was the case when I read Bruce Dawson’s Stack Walking in Xperf blog entry.

    Here’s the excerpt that blew me away:

    You can also record call stacks for manifest-based ETW (Event Tracing for Windows) events, but the syntax is quite different, and it only works on Windows 7 and above. When you specify your ETW provider to xperf after “-on” you can specify extra parameters after the provider name, separated by colons. These are flags, a level, and, for manifest-based providers, a list of extra data to record, which can include call stacks. You can leave the flags and level fields blank and just specify ‘stack’ (in single quotes) after three colons like this:

         xperf -on Latency -stackwalk profile -start browse -on Microsoft-IE:::'stack'  
         rem Your scenario goes here...     
         xperf -stop browse -stop -d mytrace.etl

    This sounds rather ordinary at first,  but looking at the syntax you may realize that this means for just about every manifest based provider (Microsoft-*) one can generate a stack trace on all ETW events therein.  This is really cool because if you are only interested in stacks, you don’t have to monkey around with any other trace level or flags which is often the most frustrating part about manual ETW tracing.  Recall from Ivan’s blog that there are over 600+ inbox providers to explore so we can go crazy here exploring components!  Let’s demonstrate.

    First, remember to use the Win7 SDK version of XPerf.  The XPerf(WPT) install *.msi files are found in the Bin directory after installing Win32 Development Tools, so you have to do a two-step at present to get the tools installed.

    Let’s list all the providers…

    C:\utils\xperf7sdk>xperf -providers | findstr "Microsoft-"
          .
           .
           .
           93c05d69-51a3-485e-877f-1806a8731346:0x0001000000000000:0x4       : Microsoft-Windows-LUA
           9580d7dd-0379-4658-9870-d5be7d52d6de:0x0001000000000000:0x4       : Microsoft-Windows-WLAN-AutoConfig
           959f1fac-7ca8-4ed1-89dc-cdfa7e093cb0:0x0001000000000000:0x4       : Microsoft-Windows-HealthCenterCPL
           96ac7637-5950-4a30-b8f7-e07e8e5734c1:0x0001000000000000:0x4       : Microsoft-Windows-Kernel-BootDiagnostics
           96f4a050-7e31-453c-88be-9634f4e02139:0x0001000000000000:0x4       : Microsoft-Windows-UserPnp
           98583af0-fc93-4e71-96d5-9f8da716c6b8:0x0001000000000000:0x4       : Microsoft-Windows-Dwm-Udwm
           a50b09f8-93eb-4396-84c9-dc921259f952:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-MSDE<
           ad5162d8-daf0-4a25-88a7-01cbeb33902e:0x0001000000000000:0x4       : Microsoft-Windows-WPDClassInstaller
           ae4bd3be-f36f-45b6-8d21-bdd6fb832853:0x0001000000000000:0x4       : Microsoft-Windows-Audio
           af2e340c-0743-4f5a-b2d3-2f7225d215de:0x0001000000000000:0x4       : Microsoft-Windows-Netshell
           b03d4051-3564-4e93-93db-3c34f1b5b503:0x0001000000000000:0x4       : Microsoft-Windows-Diagnosis-TaskManager
           b1f90b27-4551-49d6-b2bd-dfc6453762a6:0x0001000000000000:0x4       : Microsoft-Windows-PowerCpl
           b2a40f1f-a05a-4dfd-886a-4c4f18c4334c:0x0001000000000000:0x4       : Microsoft-PerfTrack-IEFRAME
           c89b991e-3b48-49b2-80d3-ac000dfc9749:0x0001000000000000:0x4       : Microsoft-Windows-Documents
           c9bdb4eb-9287-4c8e-8378-6896f0d1c5ef:0x0001000000000000:0x4       : Microsoft-Windows-HomeGroup-ProviderService
           dbe9b383-7cf3-4331-91cc-a3cb16a3b538:0x0001000000000000:0x4       : Microsoft-Windows-Winlogon
           de513a55-c345-438b-9a74-e18cac5c5cc5:0x0001000000000000:0x4       : Microsoft-Windows-Help
           ded165cf-485d-4770-a3e7-9c5f0320e80c:0x0001000000000000:0x4       : Microsoft-Windows-DeviceUx
           e978f84e-582d-4167-977e-32af52706888:0x0001000000000000:0x4       : Microsoft-Windows-TabletPC-InputPanel
           ed6b3ba8-95b2-4cf5-a317-d4af7003884c:0x0001000000000000:0x4       : Microsoft-Windows-Sidebar
           f3f14ff3-7b80-4868-91d0-d77e497b025e:0x0001000000000000:0x4       : Microsoft-Windows-WMP
           fbcfac3f-8459-419f-8e48-1f0b49cdb85e:0x0001000000000000:0x4       : Microsoft-Windows-NetworkProfile
           ffdb9886-80f3-4540-aa8b-b85192217ddf:0x0001000000000000:0x4       : Microsoft-PerfTrack-MSHTML

    We have a lot to play with as you can see, but let’s say you are trying to track down a socket leak.  Since we know sockets are implemented in afd.sys, let’s search for that provider…

    C:\utils\xperf7sdk>xperf -providers | findstr "AFD"        
    e53c6823-7bb8-44bb-90dc3f86090d48a6                               : Microsoft-Windows-Winsock-AFD       
    Microsoft-Windows-Winsock-AFD


    Now, we can form the XPerf syntax to trace socket operations during the leak.  The “AFDSession” is just the session name we are making up, the rest is the prescriptive syntax. Just to explain the syntax we are starting multiple sessions at the time here – kernel logger session + AFDSession. Similarly we are stopping both sessions at the same time and then merging both into the one file AFD.etl file.

    xperf -on Latency -stackwalk profile -start AFDSession -on Microsoft-Windows-Winsock-AFD:::'stack'
    <reproduce leak>
    xperf -stop AFDSession -stop -d AFD.etl

    Run xperfview AFD.etl and switch on “Stack counts by type graph”. If you don’t see this frame, make sure you are on the 7SDK version

    clip_image001

    The beauty here is the Stack Counts by Type frame which allows us to explore which stack traces where collected, notice all the AFD stack count sources!

    clip_image002

    Let’s say we have the simple case of a massive number of Create’s happening and thus handle counts rocketing up (leaking) for a given process.  Perhaps exhausting all ephemeral ports on the system, ultimately resulting in any number of hang symptoms.

    Simply filter the Events for AfdCreate as follows.

    clip_image003

    Now, we have a filtered Frame of just those stack counts….

    clip_image004

    You can imagine that if there were a lot more creates here that the frequency distribution shown above would be much higher, but for the sake of just exploring the feature, let’s just look at these small number of creates anyway.  In fact, let’s select them all by just choosing Select View in the context menu which will get all of these events over the trace sample.

    clip_image005

    The frame area is now highlighted in light purple indicating the selected area.

    clip_image006

    Because we are interested in Stacks, make sure you have configured your symbol path and have checked Load Symbols

    (be patient, this could take a while to build up your symbol cache)

    clip_image007

    clip_image008

    Next, invoke the Summary Table

    clip_image009

    Click the flyout control on the left to select only the Creates again.

    clip_image010

    Next, make it easy on yourself by Filtering on the interesting process only.

    clip_image011

    After, much better.

    clip_image012

    Now, here is the fruit of our labor in the full stack trace inside iexplore.exe allocating the socket handles.

    clip_image013

    Continued…

    clip_image001[5]

     

    Consider that what we see above is read from the top down.  Iexplore.exe is doing this work through wininet, jsproxy, jscript, etc. all the way down through the calling of ws2_32.dll!socket which ultimately creates the socket and returned our file handle.  I’ve seen so many cases of ephemeral port exhaustion which is essentially a handle leak against sockets, these could have been solved in minutes with this tracing!  Keep in mind that this data isn’t perfectly suited for tracking handle leaks, especially small ones since there’s no accounting for outstanding handles.  However, sometimes the leak is egregious enough that you just need to look at activity over time to be pretty certain of the culprit.  Really, what we are doing here is just probing a particular ETW instrumented API, which is incredibly powerful.

    Typically, on 64-bit computers, you cannot capture the kernel stack in certain contexts when page faults are not allowed. To enable walking the kernel stack on AMD64, set the DisablePagingExecutive Memory Management registry value to 1 and reboot. The DisablePagingExecutive registry value is located under the following registry key: HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management

    In summary, this isn’t really a hidden feature since it’s documented in EnableTraceEx2’s Enable Parameters parameter, but I hope with this additional context you see what powerful analysis is possible.  Previously, this would have required live debugging a process, so now we have another example of the power of XPerf to decode and view key system activity, non-invasively!

    -Tate

  • Ntdebugging Blog

    Part 3: Understanding !PTE - Non-PAE and X64

    • 1 Comments

    Hello, Ryan Mangipano (ryanman) again with part three of my series on understanding the output of the !PTE command. In this last installment I’ll continue our manual conversion of Virtual Addresses by converting a Non-PAE VA. Afterwards I’ll convert a VA from X64 Long Mode. Then I’ll discuss the TLB. If you haven’t read part one and two of this series I recommend taking a looking before jumping into the rest of this article.

    Non-PAE system VA to Physical Address conversion

    In part 1 and part 2 I discussed using windbg to manually x86 PAE virtual to physical address conversion with 4KB and 2MB pages. Now I’ll mention the same thing for Non-PAE systems. The official processor manuals explain how systems not using PAE have only two level of tables instead of the three used by PAE. This fact is because PAE added the Page Directory Pointer Table. This means the CR3 register will contain a pointer to the Page Directory Table. Also note the size of the table entries will be 32-bit, not the 64-bit size observed in the PAE tables. PAE expanded the table size to allow for more bits for the purpose of allowing the addressing of more physical memory.

    The PAE bit is bit number five, which is the sixth bit due to bit numbering starting at zero. You can see PAE is not enabled on this system.

           0: kd> .formats @cr4

      Binary:  00000000 00000000 00000110 11010001

     Another method of checking this:

                   1: kd> j ((@cr4 & 0y00000000000000000000000000100000) != 0) '.echo PAE flag Enabled';'.echo PAE flag Disabled'

    PAE flag Enabled

     Now I'll use the following random virtual address which is valid:

     

    0: kd> !pte f72c5c00

    F72C5C00  - PDE at C0300F70        PTE at C03DCB14

              contains 01014963      contains 06CE7963

            pfn 1014 G-DA--KWV    pfn 6ce7 G-DA—KWV

     

    0: kd> .formats f72c5c00

      Binary:  11110111 00101100 01011100 00000000

     Let’s break it down:

     

    PD offset                        11110111 00

    Page Table Offset                       101100 0101

    Offset in the Physical Page             1100 00000000 (since it’s a even 12 bits,  just refer to it in hex as c00)

     

    0: kd> !dd @cr3 + (0y1111011100 * @@(sizeof(nt!HARDWARE_PTE)))L1

    # a07df70 01014963

     

    0: kd> !dd 1014000 + (0y1011000101 * @@(sizeof(nt!HARDWARE_PTE)))L1

    # 1014b14 06ce7963

     Now that I have the physical page base, I'll place  the last 3 hex digits (c00)  from the Virtual Address onto the address base.

     

    0: kd> !dd 6ce7c00 L4

    # 6ce7c00 00000001 c0000005 00000000 00000000

     

    0: kd> dd f72c5c00 L4

    f72c5c00  00000001 c0000005 00000000 00000000

    X64 VA to Physical Address Conversion

    Just as PAE added a third level to the non-PAE two-level system, x64 Long mode adds a fourth level to the hierarchy.  This table is called the Page-Map Level-4 (PML4 table). AMD refers to the entries in this table as PML4E (Page-Map Level-4 Entry).  Intel refers to each entry as PML4-Table Entry. Internally we refer to this as the eXtended Page directory Entry (PXE).  Regardless of how you refer to these entries they contain indexes into the PDP table (Page Directory Pointer Table).

     Here is the output of the !pte command against a 64-bit address:

                   7: kd> !pte fffffade`c24eb7c0

                                     VA fffffadec24eb7c0

    PXE @ FFFFF6FB7DBEDFA8     PPE at FFFFF6FB7DBF5BD8    PDE at FFFFF6FB7EB7B090    PTE at FFFFF6FD6F612758

    contains 0000000111800863  contains 0000000119826863  contains 0000000119839963  contains 0000000001FF6121

    pfn 111800     ---DA--KWEV  pfn 119826     ---DA--KWEV  pfn 119839     -G-DA--KWEV  pfn 1ff6       -G--A—KREV

    I'll break it down in binary and use data from the processor manuals to separate the bits

     

    7: kd> .formats fffffade`c24eb7c0

      Binary:  11111111 11111111 11111010 11011110 11000010 01001110 10110111 11000000

     

    Sign extend               11111111 11111111

    PML4 offset               11111010 1

    PDP offset                1011110 11

    PD offset                 000010 010

    Page-Table offset         01110 1011

    Physical Page Offset      0111 11000000

     

     Now that I have the numbers, I'll plug them in and find the physical address. If you are having problems following along, refer to part one of this blog and the AMD x64 System Programming manual. You should be comparing the output below to the !pte output above

     

    7: kd> !dq @cr3 + ( 0y111110101 * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #  147fa8 00000001`11800863

     

    7: kd> !dq 0x00111800000 + (  0y101111011  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #111800bd8 00000001`19826863

     

    7: kd> !dq 0x119826000 + ( 0y000010010  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #119826090 00000001`19839963

     

    7: kd> !dq 0x119839000 + ( 0y011101011  * @@(sizeof(ntkrnlmp!HARDWARE_PTE))) L1

    #119839758 00000000`01ff6121

     

    7: kd> !dc 1ff67c0 L4

    # 1ff67c0 5085ff48 48000005 68244c8b 04a8f633 H..P...H.L$h3...

     

    7: kd> dc fffffade`c24eb7c0 L4

    fffffade`c24eb7c0  5085ff48 48000005 68244c8b 04a8f633  H..P...H.L$h3...

    TLB- Translation Lookaside Buffer and Conclusion

    The CPU’s memory management unit performs these operations to translate virtual addresses to physical. Wouldn’t it be great if we could cache the virtual address to physical page information in a location that can be accessed very quickly so that the CPU doesn’t have to look this up for future references to this page?  That is just what the Translation Lookaside Buffer (TLB) does. Hopefully this will shed some light on some basic memory structures like Large Pages, Flags, and the TLB so I encourage you to read more about these topics from the following sources-

    How PAE x86 works (on MSDN): http://technet.microsoft.com/en-us/library/cc736309(WS.10).aspx

    Intel  & AMD processor manuals: http://www.intel.com/products/processor/manuals/index.htm and http://developer.amd.com/documentation/guides/Pages/default.aspx#manuals

    “Windows Internals, 5th Edition” Mark E. Russinovich and David A. Solomon with Alex Ionescu  -Chapter 9: Memory Management

     

     

  • Ntdebugging Blog

    I Want A Debugger Robot

    • 1 Comments

    Hi,

     

    My name is Sabin from the Platforms Global Escalation Services team at Microsoft, and today I want to share with you a recent experience I had debugging an issue reported by an hardware manufacturer.

     

    The customer was doing a reboot test for their new server product line. They found that after hundreds of continuous reboots there was always a single instance that the server took more than 20 minutes to start up, when compared to an average 2 minute normal startup time. This happened only once every 300+ to 1000+ reboots. The number of reboots it took before the problem happened again varied randomly so it was difficult to predict when the problem would occur.

     

    Although they setup a live kernel debugging environment, they didn’t want to watch the computer screen for 10+ hours waiting for the problem to happen so they could manually hit Ctrl+Break in windbg. So instead they setup a video camera to film the computer screen 24x7, and they managed to find that when the “mysterious delay” happened the computer showed a gray screen with “Microsoft (R) Windows (R) version 5.1 (Build 3790: Service Pack 2)”.

     

    The case came to me and the customer even shipped a problematic server to our office to troubleshoot the cause of the delay. The problem was that I didn’t want to stare at the computer screen for 10+ hours either!

     

    The first thing I thought was that it would be fantastic if there were a robot sitting in front of Windbg, watching the elapsed time for each reboot, so it could hit Ctrl+Break in windbg if the server takes more than 10 minute to start. Then I asked myself, “Why not?”

     

    I decided to build such a “robot” myself.  I went around and checked the Debuggers SDK document (which can be found in the windbg help document debugger.chm), and I realized that what I needed was a customized debugger. The functionality of the debugger is simple, it should be able to recognize the time when the server first starts and the time when the server reboots. If there is more than 10 minutes between these two times the customized debugger automatically breaks in to the server. The event callback interface IDebugEventCallbacks::SessionStatus and the client interface IDebugControl::SetInterrupt can meet my needs perfectly.

     

    It is not that difficult to build such a customized debugger, which I called DBGRobot. I would like to share some code snippets which you may find helpful when building a customized debugger for a special debugging scenario, or as the basis for building a more complicated debugging robot.

     

    First, we need to download and install the Windows Driver Kit Version 7.1.0. When installing the WDK be sure to select Debugging Tools for Windows.

     

     

     

    If you install the WDK to its default folder, which for version 7.1.0 is C:\WinDDK\7600.16385.1, the C:\WinDDK\7600.16385.1\Debuggers\sdk\samples folder will contain the sample code from the Debugger SDK. The dumpstk sample is the one particularly interesting to us. We can copy some common code from it, such as the out.cpp and out.hpp which is the implementation of the IDebugOutputCallbacks interface.

     

    Now let’s do some coding.  The common code is copied from the Debuggers SDK sample Dumpstk. I also listed it here for clarity.

     

    The first step is to create the IDebugClient, IDebugControl and IDebugSymbols interfaces (although IDebugSymbols is not used in this case). You need to call the DebugCreate() function to create the IDebugClient interface, and then use IDebugClient->QueryInterface() to query the IDebugControl and IDebugSymbols interfaces.

     

    void

    CreateInterfaces(void)

    {

        HRESULT Status;

     

        // Start things off by getting an initial interface from

        // the engine.  This can be any engine interface but is

        // generally IDebugClient as the client interface is

        // where sessions are started.

        if ((Status = DebugCreate(__uuidof(IDebugClient),

                                  (void**)&g_Client)) != S_OK)

        {

            Exit(1, "DebugCreate failed, 0x%X\n", Status);

        }

     

        // Query for some other interfaces that we'll need.

        if ((Status = g_Client->QueryInterface(__uuidof(IDebugControl),

                                               (void**)&g_Control)) != S_OK ||

            (Status = g_Client->QueryInterface(__uuidof(IDebugSymbols),

                                               (void**)&g_Symbols)) != S_OK)

        {

            Exit(1, "QueryInterface failed, 0x%X\n", Status);

        }

    }

     

    If you want to see the output from the debugging engine, you also need to implement the IDebugOutputCallbacks interface. The main function to be implemented is IDebugOutputCallbacks::Output(), which is quite simple as we only need to see the output in the command prompt stdout stream:

     

    STDMETHODIMP

    StdioOutputCallbacks::Output(

        THIS_

        IN ULONG Mask,

        IN PCSTR Text

        )

    {

        UNREFERENCED_PARAMETER(Mask);

        fputs(Text, stdout);

        return S_OK;

    }

     

    Here comes our main code logic: we need to implement the IDebugEventCallbacks interface and monitor the SessionStatus events. In order for the debugger engine to deliver the SessionStatus events to us we need to set the DEBUG_EVENT_SESSION_STATUS mask in IDebugEventCallbacks::GetInterestMask():

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::GetInterestMask(

            THIS_

            OUT PULONG Mask

            )

    {

     

        // Here we monitor all the event types, although it is not necessary

        *Mask = DEBUG_EVENT_SESSION_STATUS | DEBUG_EVENT_EXCEPTION | DEBUG_EVENT_CHANGE_DEBUGGEE_STATE | DEBUG_EVENT_CHANGE_ENGINE_STATE;    

        return S_OK;

     

    }

     

    In the RobotEventCallBacks::SessionStatus() function, if we receive the DEBUG_SESSION_ACTIVE session status we know that Windows has started and connected to the debug session. Once the OS has started we create a new thread that waits for hEvent, an unnamed global event. In the new thread we send a timeout value for 10 minutes to the WaitForSingleObject() call.  The hEvent will be set when the session status is DEBUG_SESSION_END or DEBUG_SESSION_REBOOT, which means that the server rebooted and disconnected from the debug session. If WaitForSingleObject() returns WAIT_TIMEOUT, indicating that the server has not rebooted after 10 minutes, we send a break to the server via g_Control->SetInterrupt(DEBUG_INTERRUPT_ACTIVE).

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::SessionStatus(

        THIS_

        IN ULONG Status

        )

    {

                   

        printf("Session Status changed, status = %d \n", Status);

     

        if (Status == DEBUG_SESSION_ACTIVE)

        {

                    //bSessionActive = TRUE;

                    ::CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)TimerThread, NULL, 0, NULL);

        }

        else if ((Status == DEBUG_SESSION_END) || (Status == DEBUG_SESSION_REBOOT) )

        {

                    //bSessionActive = FALSE;

                    SetEvent(hEvent);

     

        }

     

        return S_OK;

    }

     

    STDMETHODIMP_(HRESULT)

    RobotEventCallBacks::Exception(

          THIS_

          IN PEXCEPTION_RECORD64 Exception,

          IN ULONG FirstChance

          )    

    {

          printf("Captured exception 0x%x, FirstChance = %d", Exception->ExceptionCode, FirstChance);

          if (FirstChance && Exception->ExceptionCode == 0x80000003)

          {

                bInterrupted = TRUE;

          }

         

          return DEBUG_STATUS_NO_CHANGE;

    }

     

    void TimerThread()

    {

                    HRESULT Status;

                    DWORD result;

     

                    result = WaitForSingleObject(hEvent,10*60*1000);

     

                    if (result == WAIT_OBJECT_0) //the close event is signled within 10 minutes

                    {

                                    printf("Target reboots within 10 minutes \n");             

                                   

                    }

                    else if (result == WAIT_TIMEOUT)  //the close event is not singled within 10 minute, it means the server hangs

                    {

                        if ((Status = g_Control->SetInterrupt(DEBUG_INTERRUPT_ACTIVE)) !=S_OK)  //force the target to break

                        {

                                    Exit(1, "SetInterrupt failed, 0x%x\n", Status);

                        }

                                       

                    }

                   

    }

     

    Below is our main function. In the main function we need to create the interfaces, and set the callback interfaces g_Client->SetOutputCallbacks and g_Client->SetEventCallbacks. Then we call g_Control->WaitForEvent() in the while(TRUE) loop, which is a typical main loop for a debugger program. If WaitForEvent() returns it means that a debugging event happened, such an exception, and then you can send some debug commands via g_Control->Execute. Here we simply send a “.time” command to record the problematic time. The bInterrupted boolean is set in the RobotEventCallBacks::Exception() function when it detects a first chance 80000003 exception, which means a debugger break-in.

     

    Note: Initially I wanted to send a “.crash” command to crash the server and generate the memory dump.  When I tried this I found that although the “.crash” command did trigger a blue screen it failed to generate a memory dump because the problem happened so early in the boot process and the paging file was not initialized yet. So I changed it to just send a “.time” command and performed live debugging after the problem was captured by the tool. Initially I had forgotten to change the printf output, which is why you see the output “Succeeded in capturing the problem and crashing the server!” in the below screenshot.

     

    void __cdecl

    main(int argc, char* argv[])

    {

     

        HRESULT Status;

       

        hEvent = CreateEvent(NULL, FALSE, FALSE, NULL);  //create the global hEvent

     

        CreateInterfaces();  //create the global interfaces

     

        if ((Status = g_Client->SetOutputCallbacks(&g_OutputCb)) != S_OK)  //set the output callback interface

        {

            Exit(1, "SetOutputCallbacks failed, 0x%X\n", Status);

        }

     

        if ((Status = g_Client->SetEventCallbacks(&g_EventCb)) != S_OK)  //Set the event callback interface

        {

            Exit(1, "SetEventCallbacks failed, 0x%X\n", Status);

        }

     

     

     

        Status = g_Client->AttachKernel(DEBUG_ATTACH_KERNEL_CONNECTION, argv[1]); //Attach to the kernel debugging, the argv[1] should be “COM:Port=COM1,baud=115200”

     

        if (Status != S_OK)

        {

                    Exit(1, "AttachKernel failed, 0x%X\n", Status);              

     

        }

     

        while(TRUE)  //This is the main loop of the debugger program, it calls g_Control->WaitForEvent() repeatedly waiting for the debugging events

        {

     

                        Status = g_Control->WaitForEvent(DEBUG_WAIT_DEFAULT,

                                                              INFINITE);  //wait for 10 minutes

                                                             

                        printf("WaitForEvent returned. Status = 0x%x\n", Status);

                       

                       

                        if (Status == S_OK && bInterrupted)

                        {

                                    if ((Status = g_Control->Execute(DEBUG_OUTCTL_THIS_CLIENT, ".time", DEBUG_EXECUTE_ECHO)) !=S_OK)  //do .time to record the time the problem happens

                                        {

                                                    Exit(1, "Execute failed, 0x%x\n", Status);

                   

                                        }

                                       

                                    Exit(0, "Succeeded in capturing the problem!");

                        }

        }

       

        Exit(0, NULL);

    }

     

    Now that we have the code we can build DBGRobot.exe using the WDK 7.1.0 build environment. Remember to set the following environment variables, as mentioned in the readme.txt under \WinDDK\7600.16385.1\Debuggers\samples

     

    SET DBGLIB_LIB_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\lib

    SET DBGSDK_INC_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\inc

    SET DBGSDK_LIB_PATH=C:\WinDDK\7600.16385.1\Debuggers\sdk\lib

     

    Now I have the “robot” watching the problem for me. I set up the live kernel debugging environment as usual and then ran the DBGRobot.exe tool as the debugger instead of windbg.  I used the command: dbgrobot.exe com:port=COM1,baudrate=115200.

     

    This is the screenshot when the server starts and reboots normally:

    image

     

    I let the DBGRobot.exe run over night, and in the morning when I came back to the office, the “robot” had done its job perfectly and captured the problem for me! The problematic server was left in the break-in state so I just needed to launch windbg to attach to the server, and then I could do the debugging in windbg as normal.

    image

     

    The cause of the delay turned out to be a hardware problem, but the actual problem is not important here. I hope this article can give you some new ideas about debugging, and using the debug tools in new ways.

     

  • Ntdebugging Blog

    Understanding VSS in DPM

    • 1 Comments

    Data Protection Manager 2007 utilizes Windows’ Volume Shadow Copy Service to protect Data Sources on Production Servers, and to manage Recovery Points on the DPM server. This presentation provides step-by-step illustrations and explanations of how DPM uses VSS to backup a Protected Server’s file share and System State data sources, and then manage the resulting recovery points in DPM.

     

    Get Microsoft Silverlight

  • Ntdebugging Blog

    Uncovering How Workspaces Work in WinDbg

    • 1 Comments

    Author - Jason Epperly

    Workspaces have always been a little confusing to me. I knew how to bend them to do what I needed to get the job done, however they still remained a bit mysterious. Recently I decided to sort this out, just so I knew how they worked under the hood. But before I show you my investigation let's discuss the different types of workspaces. Windbg uses several built-in types including Base, User, Kernel, Remote, Processor Architecture, Per Dump, and Per Executable. It also uses named workspaces (or user defined workspaces). When you perform a particular type of debugging (e.g. live user-mode, post-mortem dump analysis etc.) these workspaces are combined into the final environment. Here's a diagram to illustrate the possible combination of workspaces.

    image

    • The green line is the case where WinDbg is used to open a dump file. In this scenario the Base workspace + per dump workspace is used. Note: per dump simply means each individual dump file opened gets its own workspace.
    • The blue line is the scenario where WinDbg is used to live debug a running application, using the Base+User-mode workspaces.
    • The orange line is an example of WinDbg used to perform a live kernel debug on an x86 machine. In this case windbg is using Base+Kernel+x86 workspaces.

    From the diagram you can see windbg typically uses a combination of two workspaces. While live kernel debugging it uses three workspaces.

    So what is in a workspace?

    • Session Information
      • All break point (bp's) and handling information for exceptions and events (sx settings).
      • All open source files.
      • All user-defined aliases.
    • Configuration settings
      • The symbol path.
      • The executable image path.
      • The source path.
      • The current source options that were set with l+, l- (Set Source Options).
      • Log file settings.
      • The COM or 1394 kernel connection settings, if the connection was started by using the graphical interface.
      • The most recent paths in each Open dialog box (except for the workspace file and text file paths, which are not saved).
      • The current .enable_unicode, .force_radix_output, and .enable_long_status settings.
    • WinDbg graphical interface.
      • The title of the WinDbg window
      • The Automatically Open Disassembly setting
      • The default font
      • The size and position of the WinDbg window on the desktop.
      • Which debugging information windows are open.
      • The size and position of each open window, including the window's size, its floating or docked status, whether it is tabbed with other windows, and all of the related settings in its shortcut menu.
      • The location of the pane boundary in the Debugger Command window and the word wrap setting in that window.
      • Whether the toolbar and status bar, and the individual toolbars on each debugging information window, are visible.
      • The customization of the Registers window.
      • The flags in the Calls window, Locals window, and Watch window.
      • The items that were viewed in the Watch window.
      • The cursor location in each Source window.

    All of these settings (except for the blue ones) are applied cumulatively (Base first, then the next workspace, etc). The blue items above are only loaded from the last workspace in the chain. To show this in action I created a simple walk through to illustrate the use of workspaces the debugger.

    First I opened windbg without the use of any command line options. When it opens in this dormant state (not attached to anything and has nothing opened) its using the Base workspace. If I don't change anything (e.g. window placement) I am not prompted with any workspace dialogs when I start debugging. However if I moved the debugger's main window to any location (we will call this position 1) followed by executing any of the highlighted operations below -

    image

    I am prompted with this dialog-

    image

    Choosing "Yes" on the dialog above integrates my changes into the "Base" workspace so window position 1 is now part of my Base workspace.

    Now I'm going to select "Open Executable" and browse to our old faithful target binary notepad.exe. Once the binary is opened, windbg uses Base+Notepad (per Executable file). Now I'll move the debugger's main window again (we will call this position 2) and choose the option Debug > Stop debugging. Because of the window location change, I am prompted with the following-

    image

    If I choose 'Yes', windbg will use window position 2 for anytime I open the notepad executable in the future. After closing the notepad.exe executable, windbg reverts back to using the Base workspace.

    This time I'll actually launch notepad (not from the debugger) and attach to the running notepad.exe process with the debugger. We are now at Base+User-mode. I moved the debugger window (new position 3), selected Debug > Stop Debugging and get prompted with this dialog-

    image

    Choosing "Yes" will store WinDbg window position 3 in the User-mode workspace. Once I have completed this step, Windbg is again using the Base workspace because we stopped debugging.

    To futher illustrate workspaces I'll attach a to a target Virtual Machine for Kernel Debugging but not break in. Windbg is using Base+Kernel now. I moved the window again and as soon as I break-in I get this dialog-

    image

    I chose 'No' on the dialog because I'm getting the hang of things. If I move the window again and type qd (quit and detach) to end the current kernel debug session, I will see this dialog-

    image

    So before we ended the session, we were at Base+Kernel+AMD64.

    Running through this exercise helped me understand why I usually create a named workspace, change all my settings and use the command line option -W to open my workspace. Hopefully this will clear up some of the complexities involved with workspaces. This is why the debugger help file recommends making all the changes you need at the lowest possible level (i.e. Base first, then the others).

    Hope this helps...

  • Ntdebugging Blog

    Updated Archive of the Debug Ninja’s Twitter Debug Tips

    • 1 Comments

    Every Wednesday (usually) I post a debug tip to our twitter page at https://twitter.com/#!/ntdebugging. This blog is an archive of these tips to allow our readers to find this information easily. I will update this blog periodically with the new tips; follow us on twitter if you want to see the tips as I post them.

    The goal of these tips is to share debug commands, and forms of commands (parameters, flags, etc) that my colleagues and I find useful. I hope you can add these commands to your toolkit and they will help you debug more efficiently.

    Tips:

    !thread/!process [address] e - on x64 will not show you the meaningless Args to Child information.

    .frame /c [FrameNumber] - sets context to specified stack frame. Provides more reliable information than .trap on x64.

    kn - Dumps call stack with frame numbers, easier than counting stacks for .frame.

    .frame /r [FrameNumber] - same as .frame /c, but shows registers without changing context.

    Note: With .frame /c or /r you can only trust the nonvolatile registers. See http://msdn.microsoft.com/en-us/library/9z1stfyw(VS.80).aspx for vol/nonvol regs.

    k=rbp rip FrameCount - Dumps call stack starting at rbp/rip on x64. Useful when the stack is corrupt. #debug ^DN

    .process/.thread /p /r [address] - sets new process context, sets .cache forcedecodeuser, and reloads user symbols. #debug ^DebugNinja

    !process [address] 17 - Sets the context for this command, avoids the need for .process to see user stacks. Try !process 0 17 #debug ^DN

    ~~[ThreadID]s - Changes threads in user mode. Use Thread ID number from output such as !locks. Ex: ~~[1bd4]s #debug ^DN

    runas /netonly /u:<account> windbg.exe - Launch windbg with domain account. Use when dbg computer isn't in domain and symbol server is. ^DN

    !heap -p -a <address> - Shows information about the heap block containing <address>, even if you aren't using pageheap. #debug ^DN

    ub - Unassembles starting at a location prior to your address. Accepts l<number> to specify how many instructions to go back. ub . l20 ^DN

    !stacks 2 [FilterString] - Finds kernel mode call stacks that contain the FilterString in a symbol. #debug ^DN

    !thread [address] 17 (or 1e on x64) - Sets context for this command, avoids the need for .thread/.process for user stacks. #debug ^DN

    .hh [Text] - Opens the debugger help. [Text] is the topic to lookup in the index. Example: .hh !pte   #debug ^DN

    ?? can dump structs using C++ style expressions. Ex: ??((nt!_KTHREAD*)(0xfffffa800ea43bb0))->ApcState #debug ^DN

    bp /t EThread - Sets a kernel mode breakpoint that only triggers when hit in the context of this thread. #debug ^DN

    bp /p EProcess - Sets a kernel mode breakpoint that only triggers when hit in the context of this process. #debug ^DN

    gc - If you run 'p' and hit a breakpoint, gc takes you where p would have gone if you had not hit the bp.  #debug ^DN

    gu - Go until the current function returns.  Effectively this unwinds one stack frame.  #debug #windbg ^DN

    pc - Steps through until the next 'call' instruction. Combine with other commands to find who returned your error> pc;p;r eax #debug ^DN

    pt - Steps through until the next 'ret' instruction. Similar to gu, but pt stops on the ret and gu stops after the ret. #debug ^DN

    .ignore_missing_pages 1 - supresses the error: "Page 2a49 not present in the dump file. Type ".hh dbgerr004" for details" #debug ^DN

    .exr -1 shows the most recent exception.  Useful in user dumps of crashes, especially for no execute crashes (NX/DEP). #debug ^DN

    wt - Trace calls until they return to the current address. More useful with -or to get return values. Use -l for depth. ^DN #debug

    .thread /w - Changes to the WOW64 32-bit context from 64-bit kernel mode. Wow64exts doesn't work in kernel mode. #debug ^DN

    ??sizeof(structure) - Gets the size of a structure, it's easier than counting. #debug ^DN

    sxe ld:module.dll - Enables an exception which will break into the debugger when module.dll is loaded. #debug ^DN

    vertarget - Shows OS version of the debug target. Also shows machine name, uptime, and session time (when the dump was taken). #debug ^DN

    !vm 1 - In a kernel debugger, shows basic information about memory usage. Available, committed, pagefile, pool, sysptes, etc. #debug ^DN

    .time - Shows session time (when dump was taken) and system uptime. In user mode shows process uptime, kernel/user time. #debug ^DN

    ba w size [address] - Break on write access only. Replace size with the num bytes you want to watch. Ex: ba w 4 005d5f10 #debug ^DN

    .bugcheck - Displays the bugcheck code of a blue screen crash. The format is more concise than !analyze.  #debug ^DN

    .process -i <address> - Make the process active and break into the debugger. Use in live kernel debugs to get into process context. ^DN

    .reload /f /o - Overwrites cached files in your downstream symbol store.  Useful when your store has corrupt pdbs. #debug ^DN

    ->* - Use with dt to dump pointers. Example: dt _EPROCESS [Address] ObjectTable->*

    !for_each_module s -a @#Base @#End "PTag" - Find the drivers using pool tag "PTag". #debug ^DN

    .unload [DllName] - Unloads the debug extension you didn't intend to load. Omit DllName to unload the last dll loaded. #debug ^DN

    !exqueue dumps the executive worker queues.  Use flags 7f to dump the worker threads and the queues. #debug ^DN

    lmvm <module> - Dumps information about the module.  Remember to use <module> and not <module.dll>. #debug ^DN

Page 21 of 25 (242 items) «1920212223»