• Ntdebugging Blog

    Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows Server 2012

    • 2 Comments

    What is a bug check 0x133?

    Starting in Windows Server 2012, a DPC watchdog timer is enabled which will bug check a system if too much time is spent in DPC routines. This bug check was added to help identify drivers that are deadlocked or misbehaving.  The bug check is of type "DPC_WATCHDOG_VIOLATION" and has a code of 0x133.  (Windows 7 also included a DPC watchdog but by default, it only took action when a kernel debugger was attached to the system.)  A description of DPC routines can be found at http://msdn.microsoft.com/en-us/library/windows/hardware/ff544084(v=vs.85).aspx.

     

    The DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways. First, if a single DPC exceeds a specified number of ticks, the system will stop with 0x133 with parameter 1 of the bug check set to 0.  In this case, the system's time limit for single DPC will be in parameter 3, with the number of ticks taken by this DPC in parameter 2.  Alternatively, if the system exceeds a larger timeout of time spent cumulatively in all DPCs since the IRQL was raised to DPC level, the system will stop with a 0x133 with parameter 1 set to 1.  Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds, however the actual timeout values on the system are set much higher.

     

    How to debug a 0x133 (0, …

    In the case of a stop 0x133 with the first parameter set to 0, the call stack should contain the offending driver.  For example, here is a debug of a 0x133 (0,…) kernel dump:

     

    0: kd> .bugcheck

    Bugcheck code 00000133

    Arguments 00000000`00000000 00000000`00000283 00000000`00000282 00000000`00000000  

     

    Per MSDN, we know that this DPC has run for 0x283 ticks, when the limit was 0x282.

     

    0: kd> k

    Child-SP          RetAddr           Call Site

    fffff803`08c18428 fffff803`098525df nt!KeBugCheckEx

    fffff803`08c18430 fffff803`09723f11 nt! ??::FNODOBFM::`string'+0x13ba4

    fffff803`08c184b0 fffff803`09724d98 nt!KeUpdateRunTime+0x51

    fffff803`08c184e0 fffff803`09634eba nt!KeUpdateTime+0x3f9

    fffff803`08c186d0 fffff803`096f24ae hal!HalpTimerClockInterrupt+0x86

    fffff803`08c18700 fffff803`0963dba2 nt!KiInterruptDispatchLBControl+0x1ce

    fffff803`08c18898 fffff803`096300d0 hal!HalpTscQueryCounter+0x2

    fffff803`08c188a0 fffff880`04be3409 hal!HalpTimerStallExecutionProcessor+0x131

    fffff803`08c18930 fffff880`011202ee ECHO!EchoEvtTimerFunc+0x7d                //Here is our driver, and we can see it calls into StallExecutionProcessor

    fffff803`08c18960 fffff803`097258b4 Wdf01000!FxTimer::TimerHandler+0x92

    fffff803`08c189a0 fffff803`09725ed5 nt!KiProcessExpiredTimerList+0x214

    fffff803`08c18ae0 fffff803`09725d88 nt!KiExpireTimerTable+0xa9

    fffff803`08c18b80 fffff803`0971fe76 nt!KiTimerExpiration+0xc8

    fffff803`08c18c30 fffff803`0972457a nt!KiRetireDpcList+0x1f6

    fffff803`08c18da0 00000000`00000000 nt!KiIdleLoop+0x5a

     

    Let’s view the driver’s unassembled DPC routine and see what it is doing

     

    0: kd> ub fffff880`04be3409

    ECHO!EchoEvtTimerFunc+0x54:

    fffff880`04be33e0 448b4320        mov     r8d,dword ptr[rbx+20h]

    fffff880`04be33e4 488b0d6d2a0000  mov     rcx,qword ptr [ECHO!WdfDriverGlobals (fffff880`04be5e58)]

    fffff880`04be33eb 4883631800      and     qword ptr [rbx+18h],0

    fffff880`04be33f0 488bd7          mov     rdx,rdi

    fffff880`04be33f3 ff150f260000    call    qword ptr [ECHO!WdfFunctions+0x838(fffff880`04be5a08)]

    fffff880`04be33f9 bbc0d40100      mov     ebx,1D4C0h

    fffff880`04be33fe b964000000      mov     ecx,64h

    fffff880`04be3403 ff15f70b0000    call    qword ptr[ECHO!_imp_KeStallExecutionProcessor (fffff880`04be4000)]   //Its Calling KeStallExecutionProcessor with 0x64 (decimal 100) as a parameter

    0: kd> u fffff880`04be3409

    ECHO!EchoEvtTimerFunc+0x7d:

    fffff880`04be3409 4883eb01        sub     rbx,1

    fffff880`04be340d 75ef            jne     ECHO!EchoEvtTimerFunc+0x72 (fffff880`04be33fe)     //Here we can see it is jumping back to call KeStallExecutionProcessor in a loop

    fffff880`04be340f 488b5c2430      mov     rbx,qword ptr[rsp+30h]

    fffff880`04be3414 4883c420        add     rsp,20h

    fffff880`04be3418 5f              pop     rdi

    fffff880`04be3419 c3              ret

    fffff880`04be341a cc              int     3

    fffff880`04be341b cc              int     3

     

    0: kd> !pcr

    KPCR for Processor 0 at fffff80309974000:

        Major 1 Minor 1

          NtTib.ExceptionList: fffff80308c11000

              NtTib.StackBase: fffff80308c12080

             NtTib.StackLimit: 000000d70c7bf988

           NtTib.SubSystemTib: fffff80309974000

                NtTib.Version: 0000000009974180

            NtTib.UserPointer: fffff803099747f0

                NtTib.SelfTib: 000007f7ab80c000

     

                      SelfPcr: 0000000000000000

                         Prcb: fffff80309974180

                         Irql: 0000000000000000

                          IRR: 0000000000000000

                          IDR: 0000000000000000

                InterruptMode: 0000000000000000

                          IDT: 0000000000000000

                          GDT: 0000000000000000

                          TSS: 0000000000000000

     

                CurrentThread: fffff803099ce880

                   NextThread: fffffa800261cb00

                   IdleThread: fffff803099ce880

     

                    DpcQueue:  0xfffffa80020ce790 0xfffff880012e4e9c [Normal] NDIS!NdisReturnNetBufferLists

                               0xfffffa800185f118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                               0xfffff8030994fda0 0xfffff8030972bc30 [Normal] nt!KiBalanceSetManagerDeferredRoutine

                               0xfffffa8001dbc118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                               0xfffffa8002082300 0xfffff88001701df0 [Normal] USBPORT

     

    The !pcr output shows us queued DPCs for this processor. If you want to see more information about DPCs and the DPC Watchdog, you could dump the PRCB listed in the !pcr output like this:

     

    dt nt!_KPRCB fffff80309974180 Dpc*

     

    Often the driver will be calling into a function like KeStallExecutionProcessor in a loop, as in our example debug.  To resolve this problem, contact the driver vendor to request an updated driver version that spends less time in its DPC Routine.

     

    How to troubleshoot a 0x133 (1, …

    Determining the cause of a stop 0x133 with a first parameter of 1 is a bit more difficult because the problem is a result of DPCs running from multiple drivers, so the call stack is insufficient to determine the culprit.  To troubleshoot this stop, first make sure that the NT Kernel Logger or Circular Kernel Context Logger ETW traces are enabled on the system.  (For directions on setting this up, see http://blogs.msdn.com/b/ntdebugging/archive/2009/12/11/test.aspx.)

     

    Once the logging is enabled and the system bug checks, dump out the list of ETW loggers using !wmitrace.strdump. Find the ID of the NT Kernel logger or the Circular logger.  You can then use !wmitrace.logsave (ID) (path to ETL) to write out the ETL log to a file.  Load it up with Windows Performance Analyzer and add the DPC or DPC/ISR Duration by Module, Function view (located in the Computation group) to your current analysis window:

     

     

    Next, make sure the table is also shown by clicking the box in the upper right of the view:

     

     

    Ensure that the Address column is added on the left of the gold bar, then expand each address entry to see individual DPC enters/exits for each function.  Using this data, you can determine which DPC routines took the longest by looking at the inclusive duration column, which should be added to the right of the gold bar: 

     

    In this case, these DPCs took 1 second, which is well over the recommended maximum of 100 us.  The module column (and possible the function column, if you have symbols) will show which driver is responsible for that DPC routine.  Since our ECHO driver was based on WDF, that is the module named here.

     

    For an example of doing this type of analysis in xperf, see http://blogs.msdn.com/b/ntdebugging/archive/2008/04/03/windows-performance-toolkit-xperf.aspx.

     

    More Information

    For additional information about Stop 0x133 errors, see this page on MSDN: http://msdn.microsoft.com/en-us/library/windows/hardware/jj154556(v=vs.85).aspx.

     

    For DPC timing recommendations and for advice on capturing DPC timing information using tracelog, see http://msdn.microsoft.com/en-us/library/windows/hardware/ff545764(v=vs.85).aspx.

     

    Guidelines for writing DPC routines can be found at http://msdn.microsoft.com/en-us/library/windows/hardware/ff546551(v=vs.85).aspx.

     

     

    -Matt Burrough

  • Ntdebugging Blog

    Basics of Debugging Windows

    • 8 Comments

    Hello, this is East again. This blog post is about a topic that we always skip over when discussing debugging; what and where are the tools for debugging. I will touch on the different types of debuggers, loading symbols and the basics of getting started with loading up a dump under your preferred debugger.

    Microsoft currently offers 4 types of debugging tools. With these tools you can remote debug another machine over firewire or serial cable (USB also but may not work consistently), as well as debug usermode processes and dump files.

    Command line debuggers:

    1 ) kd.exe: kernel debugger – Used to review Crash dumps created by a blue screen crash event or a stop error. (kd –z <location of dump> –y <location of symbols>)

    2 ) cdb.exe: User mode debugger for reviewing applications,  processes, and process dumps  (cdb  –z <location of dump> –y <location of symbols> )

    3 ) ntsd.exe: CDB and NTSD are virtually identical, except that NTSD spawns a new text window when it is started, whereas CDB inherits the Command Prompt window from which it was invoked.  When I refer to "CDB", it applies to both CDB and NTSD.

    Graphical User Interface Debugger:

    4) Windbg.exe is a GUI based debugger. It can debug the same things as KD & CDB using the same commands. Windbg gives you the ability to have multiple windows open simultaneously  to review source code or other selectable items under the view menu.

    I like using windbg for all of my user and kernel debugging, while  others I work with prefer kd for kernel debugging and cdb for user mode debugging.

     

    There are 32bit and 64bit debuggers available.

    NOTE: Some people use Visual Studio as well, but this blog post will not cover using Visual Studio as a debugger.

     

    You can review applications that already have started on your machine using CDB or Windbg. You can have the problematic application launch under the debugger as well:

    Cdb or Windbg

    -p <pid> specifies the decimal process ID to attach to ( use tlist or the task manger to obtain the PID)

    -psn <name> specifies the process to attach to by service name

    <application to launch> -y <symbol path>

    NOTE: windbg allows you to use menu options as well: select “Attach to a Process” on the File menu to debug a user-mode application that is currently running.

     

    What are dumps?

    Memory dumps are a record of what was in memory and the registers at the time of a crash. There are 3 types of memory dumps:

    NOTE: The type of dump that will be written upon bugcheck can be configured by right clicking my computer ->properties ->Advanced tab ->Settings, in the Write debugging section you will use the first drop down box to select what type of memory dump you want. (See KB307973)

    ·         Mini dump – is a subset of memory that is in use by the application creating the dump.

    A mini memory Dump file is written to %SystemRoot%\Minidump\Memory.dmp by default and is usually less than a 1mb in size.

    ·         Kernel only – This is used to review the machine’s kernel memory at the time of the crash.

    ·         Full/Complete – This is the largest kernel mode dump file. It contains all information from kernel and user mode address spaces that was in physical memory at the time of the dump (about the same size as the physical memory on the box).

    Kernel and Complete Memory Dumps are written to %SystemRoot%\Memory.dmp by default.

    Note: You can configure the server to crash using certain keystrokes . This would be useful when troubleshooting a hung server or a timing issue, KB244139 explains how to configure your server for a manual crash.

    You can also create dump files from an application or process, these are known as User-mode dumps.  Additional information can be found on these types dump in the Debugging Tools for Windows help file.

     

    How do I read a dump file?

    In order to make fast progress with a memory dump file, it is best to load symbol files. Symbol files contains data that the debugger uses to interpret the application or driver code. They may contain:

    -          Global variable names

    -          Function names

    Private Symbols would contain the above information and:

    -          Local variable names

    -          Source-line numbers

    -          Type information for variables, structures, etc.

     Microsoft currently has two ways you can access symbols for the Operating System:

    Service pack download site – You will need to create:

    -          Separate directories for Windows 2000 RTM, Windows 2000 SP1, Windows 2000 SP2, Windows XP RTM, etc.

    -          Separate directories for all of the above for free vs. checked build

    -          Separate directories for hotfix symbols

     

    Public symbol server – uses a symbol store, which is a collection of symbol files. The symbol server uses the time stamp & file size to match up symbols to the active binary.After getting your symbol files together, you will need a way to tell the debugger where they are located and set up some other options.

    To set the symbol path do one of the following:

    -          _NT_SYMBOL_PATH environment variable

    -          -y command line option

    -          .sympath (Set Symbol Path) debugger command

    -          WinDbg: File | Symbol File Path dialog, or CTRL+S

    To set the executable Image Path (needed for minidumps only), do one of the foolowing:

    -          -i command line option

    -          .exepath debugger command

    -          WinDbg: File | Image File Path dialog, or CTRL+i

    -          Source Path

    -          .srcpath WinDbg: File | Source File Path dialog, or CTRL+P

    If symbol errors appear when you begin, you can try the below commands to help narrow down some problems;

    !sym noisy — gives verbose symbol information

    AND

    .reload —  to reload all symbols

     

    Also using the srv* in your symbol path tells the debugger to load and save symbols being used out to a specific directory:

    srv*DownstreamStore*<symbol locations>

     

    NOTE: You must always use .reload after you change the symbol path or fix a symbol error — the debugger doesn’t automatically reload your symbols!

     

    Now that we are done with the overview, let’s configure our machine as a host computer to open memory a dump.  I will be using Microsoft Public Symbol servers and I want to store current symbols locally to my host machine.

    Using windbg I will set my current workspace symbols to: srv*c:\pubsymbols*http://msdl.microsoft.com/download/symbols

    Click the menu option File ->Symbol File Path or Ctrl + S. This will bring up an empty box that will allow you to enter or browse to your symbol path.

    If using kd you want to set an environment variable (_NT_SYMBOL_PATH) under “my computer properties -> advanced tab” to always start with your symbols set to:  “srv*c:\pubsymbols*http://msdl.microsoft.com/download/symbols” or use this same path in your command line:

    Kd –z <path to dump.file> -y srv*c:\pubsymbols*http://msdl.microsoft.com/download/symbols

     

    NOTE: Windbg will append any workspace symbol path with the one set by the _NT_SYMBOL_PATH environment variable during loading of a memory dump.

    Ok, now we know what debugger we want to use and we know our symbol locations. Let’s open our first kernel memory dump , located on <drive letter> <path to dump file>

    Using windbg, I will load a dump file using menu options File ->Open crash Dump (ctrl + D) or drag the the dump file into the debugger; you can even  start windbg at the command prompt.  My command would look like this:

    Windbg  –z C:\training\case 7f\MEMORY051308.22.DMP

    I did not use the –y for symbol path, as it is set already in my default workspace or in my environment variable.

    When the debugger first loads a dump file it displays several lines of information before giving you a prompt to get started with your commands (by default):

    Microsoft (R) Windows Debugger Version 6.9.0003.113 X86  ß debugger version

    Copyright (c) Microsoft Corporation. All rights reserved. ß Copyright of the debugger creator

    Loading Dump File [C:\training\case 7f\MEMORY051308.22.DMP] ß location of the dump file loading

    Kernel Summary Dump File: Only kernel address space is available ß type of memory dump (mini, kernel, or full)

    Symbol search path is: srv*c:\pubsymbols*http://msdl.microsoft.com/download/symbols ß Symbol path for this debug session

    Executable search path is:  ß points to the directory the executable files are located. For most situations this is not needed. For other situations please check the debugger help file.

     

    The next 4 lines talk about The OS version, service packs and how many processors are on the box

    1 -Windows Server 2003 Kernel Version 3790 (Service Pack 2) MP (8 procs) Free x86 compatible

    2 - Product: Server, suite: Enterprise TerminalServer SingleUserTS

    3 - Built by: 3790.srv03_sp2_gdr.070304-2240

    4 - Kernel base = 0x80800000 PsLoadedModuleList = 0x808a6ea8

     

    Next we would see when the machine crashed and how long it was up prior to this crash:

    Debug session time: Wed May 14 01:27:36.768 2008 (GMT-4)

    System Uptime: 0 days 16:32:51.921

     

    After completing the above process, the debugger starts loading the dump file and parsing through the loaded symbols. Here you may notice some warnings for some user space processes which are not included in the kernel dump. This is ok.

    WARNING: Process directory table base BFF0A080 doesn't match CR3 007AF000

    WARNING: Process directory table base BFF0A080 doesn't match CR3 007AF000

    Loading Kernel Symbols

    ...........................................................................................................................................

    Loading User Symbols

    PEB is paged out (Peb.Ldr = 7ffdf00c).  Type ".hh dbgerr001" for details

    Loading unloaded module list

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    1- Use !analyze -v to get detailed debugging information.

    2 - BugCheck 7F, {8, f773ffe0, 0, 0}

    3 - *** ERROR: Module load completed but symbols could not be loaded for ql2300.sy

    The three things I want to point out from above are:

    1 - !analyze –v: This is the debugger command used to help analyze a dump file by reviewing information passed to KeBugCheck including specific parameters of that crash. It will analyze this information and provide a definition of the bugcheck, a stack showing all current function calls, and, when possible, the name of an offending driver or process that the debugger thinks is at fault.  Please review the debugger help file for additional information in this area.

    2 – The type of bugcheck that occurred on the machine.

    3 – An error telling you about symbols missing or not available to help diagnose a particular driver or application. This can lead to a misdiagnostis if you’re not careful.

    Once loading is completed you should be at a kd> prompt. This prompt shows you the current processor you are using (if the machine has more than one).

    For this dump we are at processor 3 on an 8 proc machine:

    3: kd>

     

    To view the current crash stack location you can use the "K" command. There are multiple forms of this command, each one dumping the basic plus additional information. As functions are executed and call other functions, a call stack is created in stack memory. Here are two common commands to view the stack:

     

    3: kd> k

    ChildEBP RetAddr

    00000000 baebf0ce nt!KiTrap08+0x75

    b3a4bffc baebf737 storport!RaCallMiniportInterrupt+0x2

    b3a4c008 8088d889 storport!RaidpAdapterInterruptRoutine+0x1d

    b3a4c008 80a59d8e nt!KiInterruptDispatch+0x49

    b3a4c09c 80a5c2fc hal!HalpGenerateInterrupt+0x1d2

    b3a4c0c0 80a5c44d hal!HalpLowerIrqlHardwareInterrupts+0x108

    b3a4c0d0 808256ed hal!KfLowerIrql+0x59

    <snippet>

     

    3: kd> kb

    ChildEBP RetAddr  Args to Child

    00000000 baebf0ce 00000000 00000000 00000000 nt!KiTrap08+0x75

    b3a4bffc baebf737 97bedb88 b3a4c02c 8088d889 storport!RaCallMiniportInterrupt+0x2

    b3a4c008 8088d889 977b9e18 97bedad0 03010006 storport!RaidpAdapterInterruptRoutine+0x1d

    b3a4c008 80a59d8e 977b9e18 97bedad0 03010006 nt!KiInterruptDispatch+0x49

    b3a4c09c 80a5c2fc 97797004 97bedad0 00000102 hal!HalpGenerateInterrupt+0x1d2

    b3a4c0c0 80a5c44d 00000101 977b9e02 b3a4c0d8 hal!HalpLowerIrqlHardwareInterrupts+0x108

    b3a4c0d0 808256ed b3a4c0e8 baebf1c6 977b9bb0 hal!KfLowerIrql+0x59

    <snippet>

     

    Either one can be used depending on how much information you want to see and can use.

    This completes the Basic of Debugging Windows, Part I. I will create a Part II using specific questions gathered from our readers.

     

    Miscellaneous information:

    To go further with this topic I would suggest starting with the debugger help file included with the Microsoft Debugging Tools. 

    ADPlus – An automated way to use the cdb.exe to capture/create a usermode dump when a process hangs or crashes. (more info - http://msdn.microsoft.com/en-us/library/cc265629.aspx or kb286350)

    Public Symbols for Microsoft Operating Systems:

    Microsoft Public Symbol server : srv * DownstreamStore * http://msdl.microsoft.com/download/symbols

    example: srv*c:\mysyms*http://msdl.microsoft.com/download/symbols

     Microsoft Symbol packages http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx#d

    Use !Analyze-v to gather additional information about the bugcheck and a bucket-id for your dump file. The bucket-id can be submitted to Microsoft for review for similar crashes and resolutions. Try using the Microsoft Online Crash Analysis to submit your crash dump bucket-id for possible follow up from Microsoft or for Microsoft to look for trends: http://oca.microsoft.com/en/Welcome.aspx

    For concepts, tools and information about the system architecture:

    http://msdn.microsoft.com/en-us/default.aspx

    Windows Internal 4th edition (by Mark E. Russinovich & David A. Solomon) the whole book or Chapter 14 - Crash Dump Analysis

    Advanced Windows Debugging (by Mario Hewardt & Daniel Pravat )

    http://technet.microsoft.com/en-us/default.aspx

  • Ntdebugging Blog

    Video: Using XPERF to root cause CPU consumption

    • 4 Comments

    Hi Debuggers,

     

    Here is a short video showing how to root cause CPU consumption issues on Vista/Server 2008 and higher operating systems with Xperf.

     

    For this scenario, Xperf is an easy to implement and powerful replacement for previous actions like gathering a series of dumps during the CPU, kernrate data, perfmon, other profilers, etc.

     

    Xperf allows one to quickly turn on profiling and see where the CPU time is being spent by using tracing already in the box!

     

     

    -Tate

     

     

    Note - Feel free to post questions here, or jump into a discussion about the video on our Twitter page ( @ntdebugging ).

     

     

    Share this post :
  • Ntdebugging Blog

    Configuring a Hyper-V VM For Kernel Debugging

    • 2 Comments

    Yesterday's blog prompted some questions about how to set up a debugger for a Windows OS running in a Hyper-V VM.  I was surprised that I wasn't able to find good, publicly available, Microsoft issued documentation for this configuration.

     

    The first step is to configure the Windows OS in the VM to enable a kernel debugger on COM1.  One would use these same steps if you were preparing the OS to be debugged using a null modem cable.  Hyper-V will allow us to redirect the COM port so that we don't need such a cable.

    1. Start an administrative command prompt.
    2. Turn on debugging with this command:
      bcdedit /debug on
    3. Configure the debugger to use COM1 with this command:
      bcdedit /dbgsettings SERIAL DEBUGPORT:1 BAUDRATE:115200
      Note that these are the default settings and already exist in most bcd stores.  However setting them again won't damage anything, and guards against a situation where the dbgsettings have been previously modified.
    4. Reboot so that the boot loader can read the new settings and configure the OS for debugging.

    CommandPrompt

     

    Next, configure Hyper-V to redirect the COM1 port to a named pipe.  We will use this pipe in place of a traditional null modem cable.

    1. Open Hyper-V Manager and browse to the settings page of the VM you configured to debug.
    2. Under the Hardware list choose COM 1.
    3. Change the Attachment to 'Named pipe:' and provide a pipe name.
      1. Note that the Hyper-V Manager provides the complete path to your named pipe.  Make a note of this path as you will need it in the next step.

    Settings2

     

    After the OS and the VM are configured for debugging, we need to connect a debugger.

    1. On the Hyper-V parent partition download and install the Debugging Tools for Windows from http://msdn.microsoft.com/en-us/windows/hardware/gg463009.
    2. After installing the debugging tools you will have a ‘Debugging Tools for Windows’ entry in your start menu.
      1. From this folder right click ‘WinDbg’ and choose ‘Run as administrator’.  Windbg needs administrative rights to connect to the pipe.
    3. In windbg open the File menu and choose ‘Kernel Debug’.
    4. Enter a Baud Rate of 115200, to match the settings made in the VM.
    5. Enter the Port that you configured in the VM settings page.
      1. To connect to the pipe remotely, substitute the '.' in the path with the Hyper-V server name.
    6. Ensure that the Pipe and Reconnect boxes are checked.
    7. Set Resets to 0.
    8. Click OK to start debugging.
    9. Windbg should display the string ' Waiting to reconnect...'

    image004

     

    To test the debugger connection in windbg, from the ‘Debug’ menu choose ‘Break’.  This should cause the server to break into the debugger and display a kd> prompt.  Please note that breaking into the debugger will cause the OS running in the VM to halt until you tell the debugger to go, the OS will appear to be hung during this time.  The command 'g' followed by Enter will tell the debugger to ‘go’ causing the VM to resume operation.

     

    Windbg

  • Ntdebugging Blog

    What Is In A RHS Dump File Created By Windows Error Reporting

    • 2 Comments

    Hello all, East here.  I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.

     

    First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:

    1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.

    2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.

     

    When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.

     

    NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

     

    When a deadlock is detected the cluster log will show:

    "00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued."

     

    You should find the WER report file and dumps located at the default location:

    <SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue

     

    Here is what a cluster created Windows Error Report looks like:

    Version=1

    EventType=WSFC Resource Deadlock

    EventTime=129497767591366327

    ReportType=1

    Consent=1

    ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c

    Response.type=4

    Sig[0].Name=Resource Name

    Sig[0].Value=New flexdisk

    Sig[1].Name=Resource Type

    Sig[1].Value=flexdisk

    Sig[2].Name=Call Type

    Sig[2].Value=ISALIVE

    DynamicSig[1].Name=OS Version

    DynamicSig[1].Value=6.1.7601.2.1.0.274.10

    DynamicSig[2].Name=Locale ID

    DynamicSig[2].Value=1033

    File[0].CabName=memory.hdmp

    File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file   

    File[0].Flags=2097152

    File[0].Type=3

    File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp

    File[1].CabName=minidump.mdmp

    File[1].Path=WERC2CE.tmp.mdmp

    File[1].Flags=2162690

    File[1].Type=2

    File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp

    FriendlyEventName=WSFC Resource Deadlock

    ConsentKey=WSFC Resource Host Monitor

    AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS

    AppPath=C:\Windows\Cluster\rhs.exe

    ReportDescription=WSFC Resource Deadlock

     

    Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:

     

    The Resource Name = New flexdisk

    The Resource Type = flexdisk

    The type of call the resource was engaged in = ISALIVE

     

    You have all this information in the cluster log as well:

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'.  RHS deadlock during an Isalive check for the New flexdisk resource.

    00000dd4.00000aac::2011/05/13-16:12:39.136 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] About to send WER report.

    000004dc.00000fb4::2011/05/13-16:12:39.136 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.

    000004dc.00000fb4::2011/05/13-16:12:39.136 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.

    00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued.  WER report and user-mode dump created

    000004dc.00000fb4::2011/05/13-16:12:39.559 ERR   [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4

    000004dc.00000fb4::2011/05/13-16:12:39.575 INFO  [RCM] Created monitor process 3280 / 0xcd0

    00000cd0.00000200::2011/05/13-16:12:39.575 INFO  [RHS] Initializing.

    000004dc.00000fb4::2011/05/13-16:12:39.591 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)

    000004dc.00000fb4::2011/05/13-16:12:39.591 WARN  [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.

     

    Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”.  Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.

     

    Run the following command after opening the dump under your debugger “~*k”.  In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above.  If you don’t have a cluster log, this is the thread with wer.dll on the stack.

    0:000> ~*k

     

    .  0  Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74

    00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1

    00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d

    00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       1  Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184

    00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18

    00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       2  Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d

    00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       3  Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    <snipped to save space>

     

    We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers.  Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax:  “ ~~[aac]s ”

    0:003> kn

    # Child-SP          RetAddr           Call Site

    00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa

    01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers  ( /r )  for that frame context:

    0:003> .frame /r d

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0

    rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0

    rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000

    r8=00000000ffffffff  r9=00000000000004ff r10=0000000000000000

    r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000

    r14=0000000000000000 r15=000000004dcd5877

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rhs!RhsCall::DeadlockMonitor+0x324:

    00000000`ff74b270 448b442430      mov     r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff

     

    Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.

     

    Dump that out to find the thread containing the stack with:

    0:003> dqs 000000000035f8b0

    00000000`0035f8b0  00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType

    00000000`0035f8b8  00000000`00000004

    00000000`0035f8c0  00000000`001e7da0

    00000000`0035f8c8  00000000`00000e48 <-- Thread id of concerning resource

    00000000`0035f8d0  00000000`00346920

    00000000`0035f8d8  0000000c`00000001

    00000000`0035f8e0  00000000`00000003

    00000000`0035f8e8  00000000`4dcd5876

    00000000`0035f8f0  00000000`048d2dd3

    00000000`0035f8f8  00000000`00000100

    00000000`0035f900  00000001`00000100

    00000000`0035f908  00000000`00000000

    00000000`0035f910  00430000`00000031

    00000000`0035f918  8a0000a4`52e48988

    00000000`0035f920  004c0041`00530049

    00000000`0035f928  00450045`00560049

     

    If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.

     

    Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:

    00000dd4.00000e48::2011/05/13-16:07:37.270 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.

    000004dc.00000fd4::2011/05/13-16:07:37.286 WARN  [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.

     

    In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues.  If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.

  • Ntdebugging Blog

    Don't Believe Everything You Read

    • 3 Comments

    Recently, I was contacted by a customer who was advised by an ISV to set a registry value under one of the sub keys in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\.  Let's call it UseQuantumComputing = 1 (value name has been changed to protect the ISV).  The customer wanted to know what this value actually did and no one could find any documentation explaining it.  These issues often come to our team because we have access to the Windows source code. I did a bit of code review to find out what this value does.  As it turns out, nowhere in Windows source code between Windows 2000 and Windows Server 2012 do we ever check for or set UseQuantumComputing.

     

    I can think of a few reasons the ISV would suggest setting this value.  Perhaps they were under the impression this did something but got confused about the value name.  It's possible they hoped making a registry change would have a placebo effect.  Or, perhaps their software actually checks this value, not Windows.

     

    The latter of these possibilities is probably the worst case scenario. An ISV should not create a registry value inside of keys used for Windows’ own internal use. Why? The first reason is that there's no guarantee that Microsoft won't end up coincidentally using that same value name later. This would cause a conflict between the two users of the value.  Second, we have to consider the scenario where two different ISVs both decide to use the same value. That would be bad too.  Lastly, there's no guarantee that the key in use will still exist in later versions, or that it will be writeable or readable by the ISV due to permission changes.  In addition to all these reasons, there is the common sense issue that it is just confusing. Now the ISV's software and uninstaller needs to look all over the registry, not just in their own keys.

     

    On a similar note, I also recently had a case where a "Windows Tips" blog (not created, endorsed, or run by Microsoft) suggested using a registry value that was implemented in Windows but was not documented by Microsoft.  It turns out this value wasn't thoroughly tested (because it was undocumented and wasn't intended to be used in production), and using it would cause server hangs under certain conditions.  These hangs were only discovered after a large customer decided to implement the undocumented value across their enterprise. 

     

    Here are a few tips for IT Pros, developers, and users alike:

    • Don't implement random registry settings if you can't find documentation for that setting on an official Microsoft site, like MSDN, TechNet, or support.microsoft.com(information on forums or answer boards (e.g. social.*.microsoft.com or answers.*.microsoft.com) is not official documentation).  At best these unknown registry settings they will do nothing, at worst they will cause you headaches later.
    • If a key/value isn't documented, changes to it likely are not tested, and could put your machine in a state that makes it difficult or impossible to support.
    • If you are a developer, keep any of your registry settings in your own key. Don't pollute in others' keys.
    • If an ISV or Microsoft suggests you implement a setting, make sure you understand the implications of that setting.

     

    I'll leave you with the warning displayed in many of our KBs - it's there for a reason!

     

    WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.

     

    -Matt

  • Ntdebugging Blog

    Missing System Writer Case Explained

    • 6 Comments

    I worked on a case the other day where all I had was a procmon log and event logs to troubleshoot a problem where the System Writer did not appear in the VSSADMIN LIST WRITERS output. This might be review for the folks that know this component pretty well but I figured I would share how I did it for those that are not so familiar with the System Writer.

     

    WHAT WE KNOW:

    1. System State Backups fail
    2. Running a VSS List Writers does not list the system writer

     

    Looking at the event logs I found the error shown below. This error indicates there was a failure while “Writer Exposing its Metadata Context”. Each writer is responsible for providing a list of files, volumes, and other resources it is designed to protect. This list is called metadata and is formatted as XML. In the example we are working with the error is “Unexpected error calling routine XML document is too long”.  While helpful, this message alone does not provide us with a clear reason why the XML document is too long.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine XML document is too long. hr = 0x80070018, The program issued a command but the command length is incorrect. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    The second event that was logged was also not very helpful as it only indicates that the writer did have a failure. It looks like we are going to need to collect more data to figure this out.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine CreateVssExamineWriterMetadata. hr = 0x80042302, A Volume Shadow Copy Service component encountered an unexpected error. Check the Application event log for more information. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    From the error above we learned that there was an issue with the metadata file for the System Writer. These errors are among some of the most common issues seen with this writer. There are some not so well documented limitations within the writer due to some hard set limits on path depth and the number of files in a given path. These limitations are frequently exposed by the C:\Windows\Microsoft.Net\ path. Often, this path is used by development software like Visual Studio as well as server applications like IIS. Below I have listed a few known issues that should help provide some scope when troubleshooting System Writer issues.

     

    Known limitations and common points of failure:

    • More than 1,000 folders in a folder causes writer to fail during OnIdentify
    • More than 10,000 files in a folder causes writer to fail during OnIdentify (frequently C:\Windows\Microsoft.Net)
    • Permissions issues (frequently in C:\Windows\WinSXS and C:\Windows\Microsoft.Net)
    • Permissions issues with COM+ Event System Service
      • This service needs to be running and needs to have Network Service with Service User Rights

     

    What data can I capture to help me find where the issue is?

     

    The best place to start is with a Process Monitor (Procmon) capture. To prepare for this capture you will need to download Process Monitor, open the Services MMC snap-in, as well as open an administrative command prompt which will be used in a later step of the process.

     

    You should first stop the Cryptographic Services service using the Services MMC.

     

     

    Once stopped you will want to open Procmon, note that by default Procmon will start capturing when opened. Now that you have Procmon open and capturing data you will start the cryptographic service. This will allow you to capture any errors during service initialization. Once the service is started you will use the command prompt opened earlier to run “vssadmin.exe list writers”. This will signal the writers on the system to capture their metadata, which is a common place we see failures with the System Writer. When the vssadmin command completes, stop the Procmon capture and save this data to disk.

     

    Now that we have data how do we find the errors?

     

    Open your newly created Procmon file. First, add a new filter for the path field that contains “VSS\Diag”.

     

     

    We do this because this is the registry path that all writers will log to when entering and leaving major functions. Now that we have our filtered view we need to look for an entry from the System Writer. You can see the highlighted line below shows the “IDENTIFY” entry for the System Writer. From here we can ascertain the PID of the svchost.exe that the system writer is running in. We now want to include only this svchost. To accomplish this you can right click on the PID for the svchost.exe and select “Include ‘(PID)’”. 

     

     

    Now that we have found our System Writers svchost we will want to remove the filter for “VSS\Diag”; to do that you can return to the filter tool in Procmon and uncheck the box next to the entry.

     

     

    We now have a complete view of what this service was doing at the time it started and completed the OnIdentify. Our next step is to locate the IDENTIFY (Leave) entry as this is often a great marker for where your next clue will be. While in most cases we can’t directly see the error the writer hit we can make some educated connections based on the common issues we spoke about above. If we take a look at the events that took place just before the IDENTIFY (Leave) we can see that we were working in the C:\Windows\Microsoft.NET\assembly\ directory. This is one of the paths that the System Writer is responsible for protecting. As mentioned above, there are some known limitations to the depth of paths and number of files in the “C:\Windows\Microsoft.NET” folder. This is a great example of that limitation as seen in our procmon capture. The example below shows the IDENTIFY (Leave) with the line before that being where the last work was taking place. Meaning this is what the writer was touching when it failed.

     

     

    What does this tell us and what should we do next?

     

    Given the known path limitations, we need to check out the number of files and folders in the C:\Windows\Microsoft.Net\ path and see where the bloat is. Some of these files can be safely removed, however only files located in the Temp locations (Temporary ASP.NET Files) are safe to delete.

     

    Recently we released KB2807849 which addresses the issue shown above.

     

    There are other possible causes of the event log errors mentioned above, such as issues with file permissions. For those problems follow the same steps as above and you are likely to see the IDENTIFY (Leave) just after file access error is displayed in your procmon log. For these failures you will need to investigate the permissions on the file we failed on. Likely the file is missing permissions for the writer’s service account Network Service or Local System. All that is needed here is to add back the missing permissions for your failed file.

     

    While these issues can halt your nightly backups, it is often fairly easy to find the root cause. It just takes time and a little bit of experience with Process Monitor. 

     

    Good luck and successful backups!

  • Ntdebugging Blog

    Part 1 - ETW Introduction and Overview

    • 2 Comments

    Introduction:

     

    Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.

    ETW is useful from a variety of scenarios, including:

     -User & Admin: Being in control of your system and knowing what is going on.
     -User & Admin: Troubleshooting performance, hardware and OS issues.
     -Enthusiast: Learning further about the OS and the low level guts of the OS.
     -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies
     -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.

    ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.

    A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.

    Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.

     

    image

    Some uses of ETW:

     

    Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.

    As it stands today, ETW has two major thrusts:

    1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.

    First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):

    image

    A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.

    Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.

    ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.

     

    image

    There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.

    The second major thrust of ETW is exposing performance data.

    2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).

    Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.

    Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.

    image

    image

    Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.

    For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).

    Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.

    image

    There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.

    Where to go from here:

     

    In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).

    The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.

     

     

    Share this post :

  • Ntdebugging Blog

    For testing purposes only, and other desktop watermarks

    • 3 Comments

    Hi all, Matt here again.  One of our team’s main functions is to work with our development teams to create hotfixes when customers run into issues that can only be resolved through a code change.  The developers will often prepare a private test fix that either tests the proposed change, or adds additional instrumentation to help pinpoint the issue. The private test fix is sent to the customer reporting the problem so they can confirm that it does indeed correct (or identify) the flaw.

     

    When testing a private test fix, customers frequently ask, why does my desktop now show a message on the lower right corner of the desktop, and what does it mean?  The message reads “For testing purposes only”, and looks like this:

    For testing purposes only

     

    Often, users are concerned that his message means that they aren’t allowed to use the server in production, or maybe that it is now “unsupported.”  These aren’t the case!  Since this message appears as a result of installing a fix during the course of a Microsoft Support case, the servers are, by definition, being supported. 

     

    The purpose of this message is simply to remind users that code that Microsoft Support has asked them to test has been installed on the system, and this code may not have yet undergone the full suite of quality assurance testing that fixes that are made public usually do.   

     

    For comparison, let’s look at some of the other watermarks you may find in the lower corner of the desktop – as these can often be confused for the above message, and may explain some of the customer concerns around these messages.

     

    First up is the old trusty text you see when a box is booted into ‘Safe Mode’. I’m sure every IT Pro has seen this at one time or another, so I won’t go into detail, but rest assured, the testing purposes text is completely unrelated to booting in safe mode or having a subset of services running.

    Safe Mode

     

    Next up is our ‘Evaluation copy’ watermark.  This message is shown on the desktops of copies of Windows that have a “time bomb” (ones that will cease to function after a certain date.)  This message is typically seen on beta versions of Windows which are designed to stop functioning sometime after the desired beta testing period ends. 

    Evaluation copy

     

    Third, we have our Windows is not genuine message.  This is shown if, for example, a copy of Windows is not activated during the grace period after the installation process, or if a number of hardware changes have been made and Windows needs to be re-activated.  This has nothing to do with the ‘testing purposes’ message.  See http://technet.microsoft.com/en-us/library/dd979803.aspx for more information about this message.

    This copy of Windows is not genuine

     

    Fourth, we have the general Windows build stamp.  This is enabled via the registry using the PaintDesktopVersion DWORD (http://technet.microsoft.com/en-us/library/cc782257(WS.10).aspx).  Some administrators like to enable this option so they always know what version of Windows they are using, sort of like a mini-bginfo.  Unlike the others, this message does not indicate anything else about a server’s state.

    Windows 7

     

    Finally, we have ‘Test Mode’.  This is actually somewhat related to the testing purposes message.  This ‘Test Mode’ text is shown when test signing is enabled on a PC.  This is done by running “bcdedit /set testsigning on” from an UAC-elevated command prompt.  Test signing is used to allow developers to load drivers they are still working on that have not yet been code signed with an official certificate.  This is actually one of the steps we need to do when loading our test fixes.  For more information on Test Signing, see http://msdn.microsoft.com/en-us/library/ff553484%28v=VS.85%29.aspx.

    Test Mode

     

    So now that you know what causes these various watermarks to appear, perhaps you’re wondering how to make the “For testing purposes only” message disappear.  This is a question we are frequently asked.  While you are running a private test fix, there is no way to disable this message.  Your only option is to remove the private test fix from your system.  This is something your engineer will ask you to do before you install the final, public version of a hotfix.  You can easily identify and uninstall private test fixes by going into Control Panel, Programs and Features, View Installed Updates, then look for hotfixes with the words FOR TESTING PURPOSES ONLY in their name, like the one shown in the image below.  You may also notice that the KB number listed for these fixes is often a place holder, and not a real KB article ID.

    Installed Updates

     

    If the ‘For testing purposes only’ message is still displayed even after uninstalling the private test fix, there is one more place to check.  If a system has the Microsoft Test Root Authority certificate installed into its Trusted Root Certification Authorities store, the text will be displayed.  We use this certificate to allow a PC to run test code that has been signed by our development team, but not yet fully tested and signed off with the official Microsoft certificate.  To remove this certificate from your system, go to Start -> Run, and enter certmgr.msc and hit enter.  In the Certificate Manager MMC, browse to Trusted Root Certification Authorities, then into Certificates.  You should see one labeled Microsoft Test Root Authority, as shown below.  This will need to be deleted and the system rebooted to remove the testing purposes message.  Do not do this if you still have a private test fix installed though, as it would prevent that binary from continuing to function and may mean you can no longer boot in to Normal or Safe mode.

    certmgr

     

    If you reboot and find that ‘Test Mode’ has replaced the ‘For testing purposes only’ text, you’ll need to launch a command prompt with administrative privileges, then run “bcdedit /set testsigning off” and reboot.  You can always check if test signing is enabled by running “bcdedit /enum” and looking for this line:

    bcdedit /enum

     

    That’s all for today.  Hopefully this post helped clear up any confusion about our different desktop watermarks.

  • Ntdebugging Blog

    Debugging a Bugcheck 0xF4

    • 1 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today’s blog will be a quick walkthrough of the analysis of a bugcheck 0xF4 and how I determined that the action plan going forward should consist of enabling pool tagging on this system.

     

    I began my review with !analyze –v. From the output I can see that a process required for the system to function properly unexpectedly exited or was terminated. The goal of this debugging session will be to determine what failed and why.

     

    0: kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    CRITICAL_OBJECT_TERMINATION (f4)

    A process or thread crucial to system operation has unexpectedly exited or been

    terminated.

    Several processes and threads are necessary for the operation of the

    system; when they are terminated (for any reason), the system can no

    longer function.

    Arguments:

    Arg1: 00000003, Process                                A value of 0x3 in this parameter indicated that it was a process that terminated, not a thread

    Arg2: 8a03ada0, Terminating object        This value is a pointer to the _EPROCESS object that terminated

    Arg3: 8a03af14, Process image file name              Process Name

    Arg4: 805d1204, Explanatory message (ascii)         text message about the problem

     

    We shall begin by dumping out all the parameters of the bugcheck. Let’s dump out the "Terminating Object" below

    0: kd> !object 8a03ada0

    Object: 8a03ada0  Type: (8a490900) Process

        ObjectHeader: 8a03ad88 (old version)

        HandleCount: 3  PointerCount: 228

     

    First, let’s dump out the process image file name from the bugcheck parameter 3 above.

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe

     

     

    0: kd> dt _EPROCESS 8a03ada0 imageFileName

    CSRSRV!_EPROCESS

       +0x174 ImageFileName : [16]  "csrss.exe"

     

    Notice that if we add the base of the _EPROCESS object (8a03ada0- Parameter 2)to the offset of the imageFileName field (+0x174) we get parameter 3. The imageFileName field.

    0: kd> ? 8a03ada0+0x174

    Evaluate expression: -1979470060 = 8a03af14

     

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe.......

     

    Let’s dump out the ASCII message from parameter number 4

    0: kd> dc 805d1204

    805d1204  6d726554 74616e69 20676e69 74697263  Terminating crit

    805d1214  6c616369 6f727020 73736563 25783020  ical process

     

    Let’s review the debugger help file for more information. We can see that this bugcheck occurs when a critical process or thread terminates. “Several processes and threads are necessary for the operation of the system. When they are terminated for any reason, the system can no longer function. “

    0: kd> .hh bug check 0xf4

     

     

     

     

     

     

    Next, we need to determine why this process terminated. !analyze –v also provided us with an exception record which provides us with an error code:

     

    PROCESS_NAME:  csrss.exe

     

    EXCEPTION_RECORD:  9a85e9d8 -- (.exr 0xffffffff9a85e9d8)

    ExceptionAddress: 7c92c375 (ntdll!RtlFindMessage+0x0000007c)

       ExceptionCode: c0000006 (In-page I/O error)

      ExceptionFlags: 00000000

    NumberParameters: 3

       Parameter[0]: 00000000

       Parameter[1]: 7c99c3d8

       Parameter[2]: c000009a

    Inpage operation failed at 7c99c3d8, due to I/O error c000009a

     

    EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

     

    Since we have an error code, let’s investigate that error code. We can quickly perform this operation from within the debugger using the !error command

     

    0: kd> !error c000009a

    Error code: (NTSTATUS) 0xc000009a (3221225626) - Insufficient system resources exist to complete the API.

     

    0: kd > .hh !error

     

     

    Let’s check the output of the !vm command

     

    0: kd> !vm 2

     

    *** Virtual Memory Usage ***

                    Physical Memory:      760875 (   3043500 Kb)

                    Page File: \??\C:\pagefile.sys

                      Current:   4190208 Kb  Free Space:   4156380 Kb

                      Minimum:   4190208 Kb  Maximum:      4190208 Kb

                    Available Pages:      579241 (   2316964 Kb)

                    ResAvail Pages:       673481 (   2693924 Kb)

                    Locked IO Pages:          69 (       276 Kb)

                    Free System PTEs:     115226 (    460904 Kb)

                    Free NP PTEs:              0 (         0 Kb)

                    Free Special NP:           0 (         0 Kb)

                    Modified Pages:          221 (       884 Kb)

                    Modified PF Pages:       219 (       876 Kb)

                    NonPagedPool Usage:    65534 (    262136 Kb)

                    NonPagedPool Max:      65536 (    262144 Kb)

                    ********** Excessive NonPaged Pool Usage *****

                    PagedPool 0 Usage:     24167 (     96668 Kb)

                    PagedPool 1 Usage:       967 (      3868 Kb)

                    PagedPool 2 Usage:       967 (      3868 Kb)

                    PagedPool 3 Usage:       984 (      3936 Kb)

                    PagedPool 4 Usage:       977 (      3908 Kb)

                    PagedPool Usage:       28062 (    112248 Kb)

                    PagedPool Maximum:     92160 (    368640 Kb)

     

                    ********** 2075 pool allocations have failed **********

     

                    Session Commit:         1562 (      6248 Kb)

                    Shared Commit:          2526 (     10104 Kb)

                    Special Pool:              0 (         0 Kb)

                    Shared Process:         4821 (     19284 Kb)

                    PagedPool Commit:      28062 (    112248 Kb)

                    Driver Commit:          5138 (     20552 Kb)

                    Committed pages:      153449 (    613796 Kb)

                    Commit limit:        1767229 (   7068916 Kb)

     

    0: kd> !poolused

    unable to get PoolTrackTable - pool tagging is disabled, enable it to use this command

    Use gflags.exe and check the box that says "Enable pool tagging".

     

    The output above has informed us that pool tagging is disabled. Let’s demonstrate how you can verify that it is disabled:

     

    0: kd> dd nt!NtGlobalFlag L1

    805597ec  00000000

     

    0: kd> !gflag

    Current NtGlobalFlag contents: 0x00000000

     

    Let’s explore the debugging help file entry on the !poolused command

     

    0: kd > .hh !poolused

     

     

     

    Reading the text above, we are informed that “Pool tagging is permanently enabled on Windows Server 2003 and later versions of Windows. On Windows XP and earlier versions of Windows, you must enable pool tagging by using Gflags.”

     

     

    Using the vertarget command, I can see that this system was running Windows XP.

    0: kd> vertarget

    Windows XP Kernel Version 2600 (Service Pack 2) MP (2 procs) Free x86 compatible

     

     

     

    0: kd > .hh !gflag

     

     

     

    By reviewing the help file entry for the !gflag extension, I was able to determine that if pooltagging was set, the following bit would have been set:

     

    0x400 "ptg"         Enable pool tagging.

     

    0: kd> .formats 0x400

    Evaluate expression:

    ….

      Binary:  00000000 00000000 00000100 00000000                0x00000400

     

    Gflags is included in the Debugging Tools for Windows. The screenshot below is from a Windows 7 system. Notice that Pool Tagging is enabled permanently as described above.

     

     

     

    Summary: This system bugchecked when the critical process csrss.exe failed an I/O operation due to insufficient non-paged pool. For an action plan, we recommended the use of gflags to enable pool tagging in order to obtain more information about pool consumption.

     

     

    Share this post :

Page 5 of 24 (240 items) «34567»