• Ntdebugging Blog

    Video: Using XPERF to root cause CPU consumption

    • 4 Comments

    Hi Debuggers,

     

    Here is a short video showing how to root cause CPU consumption issues on Vista/Server 2008 and higher operating systems with Xperf.

     

    For this scenario, Xperf is an easy to implement and powerful replacement for previous actions like gathering a series of dumps during the CPU, kernrate data, perfmon, other profilers, etc.

     

    Xperf allows one to quickly turn on profiling and see where the CPU time is being spent by using tracing already in the box!

     

     

    -Tate

     

     

    Note - Feel free to post questions here, or jump into a discussion about the video on our Twitter page ( @ntdebugging ).

     

     

    Share this post :
  • Ntdebugging Blog

    Bcdedit Tips and Tricks For Debugging Part 1

    • 6 Comments

    Hello everyone, my name is Sean Walker, and I am on the Platforms OEM team in Washington.  This article is for those people who have had a hard time switching from the old boot.ini configuration to the new BCD store (myself included). Doing the simple tasks such as enabling kernel debugging over com1 are easy to do with bcdedit.exe or the msconfig GUI, you just enable them and reboot the computer. However, if you need to do something more advanced such as break into the early boot process during resume from hibernation, things get a lot more complicated.

     

    This article has some samples for enabling and disabling debug settings that you may not be familiar with, and a list of bcdedit debug settings for Windows Vista/Server 2008 and Windows 7/Server 2008 R2.  This information has been helpful to me for quickly and accurately getting to the debug at hand rather than fumbling around with bcdedit.  Much of the following information has been taken from various sources, including the windbg help files, the OEM team blog, the MSDN bcdedit reference, and the WHDC debugger site.

     

    NOTE: For the examples below, you will need to run bcdedit.exe from an administrator (UAC-elevated) command prompt.  To output a summary view of the current state of the BCD store, just run "bcdedit.exe" from the command prompt.  To get detailed information about all of the store(s) that Windows knows about, use the following command:

    bcdedit /enum all

     

    What is a BCD store?

    A BCD store is a binary file that contains boot configuration data for Windows, basically it is a small registry file.  Boot applications use the system BCD store, located on the system partition, during the boot process.  You can also create additional BCD stores in separate files but only one store at a time can be designated as the system store.

     

    NOTE: The "/store" switch can be used to specify a particular BCD store for bcdedit commands (instead of the default store).  To enumerate all the settings in another BCD store, in this case e:\bcd_store\BCD, use the following command:

    bcdedit /store e:\bcd_store\BCD /enum all

     

    This will show you which options are currently set, and what their values are.  When /store switch is omitted, the system store is used.

     

    Using bootdebug

    To enable debugging for early boot problems, you may need to enable the bootdebug switch.  This is easy to do with bcdedit:

    bcdedit /set bootdebug on

     

    However, this only sets bootdebug for the current "boot application", which is generally winload.exe, so it does not break into the very early boot process.  There are multiple applications used for booting, hibernating, and resuming (bootmgr.exe, winload.exe and winresume.exe are examples of these).  Each application (called BCD Objects) has its own settings (called BCD Elements) in the BCD store and each can be modified globally and/or individually.

     

    So, to deal with different (or multiple) debug scenarios, you just enable boot debugging based on the boot application you are concerned with.  For early debugging, you can enable bootdebug for bootmgr:

    bcdedit /set {bootmgr} bootdebug on

     

    To set bootdebug for winload.exe (which will most often be your current, and default, boot object) all three of the following will give you the same result:

    bcdedit /set bootdebug on

    bcdedit /set {current} bootdebug on

    bcdedit /set {default} bootdebug on

     

    If you are modifying the settings in another store, or are booted into another OS on the same computer (such as WinPE), you need to specify the location of the BCD store:

    bcdedit /store d:\Boot\BCD /set {default} bootdebug on

     

    Not all of the boot objects have "friendly" names, so you may need to specify the full GUID (Globally Unique ID) to modify it.  As an example, if you wanted to enable bootdebug on resume from hibernation, you would include the identifier (see figure 1) for the "Resume from Hibernate" object:

    bcdedit /set {89a932d0-d5bc-11e0-a0af-00215add5ebc} bootdebug on

     

    image001

    Figure 1: Color coded bcdedit output

     

    Why won't my USB or 1394 debug work?

    When there are multiple debug ports of a certain type in a computer Windows may not default to the correct one for your situation.  This happens most commonly when there are either multiple 1394 host controllers or USB EHCI controllers.  When this occurs it can range from a slight inconvenience (different port is used so the cable needs to be plugged into another port), to complete failure (internal port is used, which is not accessible).  In the case of USB debugging the Intel USB 2.0 specification only provides one debug port, so debugging is not possible if the wrong host controller is used.

     

    There are several caveats with USB debugging, not the least of which is that you need to buy a separate, expensive, debug cable.  Some of the difficulties and implementation details necessary to get USB debugging to work are encompassed in the WHDC USB FAQ and in Setting Up Kernel Debugging with USB 2.0.

     

    NOTE: A correction to the WHDC USB documentation for Windows 7/Windows 2008 R2 is that the busparams switch now takes decimal rather than hexadecimal values, and the "loadoptions" parameter is no longer required.  So, to enable the busparams element (for USB or 1394 debugging) in Vista/2008, you would use something like this:

    bcdedit /set {current} loadoptions busparams=0.1D.7

     

    And the Win7/2008 R2 example would be:

    bcdedit /set {current} busparams 0.29.7

     

    In the case of loadoptions or busparams, deleting the setting is not as easy as changing a flag from yes to no. You must specifically delete the value to get rid of it, and one of the examples below can be used:

     

    For Vista/2008:

    bcdedit /deletevalue {current} loadoptions

     

    And Windows 7/2008 R2:

    bcdedit /deletevalue {current} busparams

     

    Bcdedit settings and examples

    This is just scratching the surface of using bcdedit for your troubleshooting and/or debugging needs, so there are more articles to follow. Part 2 will include some more detailed debugging scenarios, such as Hyper-V guest and host debugging.  Below is a consolidated table with many of the debugging switches/settings as well as a number of different usage examples.

     

    Table of debug-related bcdedit settings

    Option

    Description

    bootdebug

    Enables or disables the boot debugger for a specified boot entry. Although this command works for any boot entry, it is effective only for boot applications.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Bcdedit /set bootdebug on

    debug

    Enables or disables the kernel debugger for a specified boot entry.

    Enable value(s): on, 1

    Disable value(s): off, 0

    /dbgsettings

    Used to modify the global settings for the debug connection (does not include hypervisor).  Values:

    Can change all settings at once instead of using the /set command to change them individually. Usage example:

    bcdedit /dbgsettings 1394 channel:30

    debugport

    Used to specify the debugger type.

    Values:

    Serial port – com1, com2, comx

    1394 port – 1394

    USB port - USB

    channel

    Specifies 1394 channel used.

    Values:

    Decimal integer between 0 and 62, inclusive.

    baudrate

    Used to specify the baud rate of a serial debug port.

    Values: 9600, 19200, 38400, 57600, 115200

    targetname

    Specifies a string to use as the identification for the USB 2.0 connection. This string can be any value.

    Usage example:

    bcdedit /dbgsettings usb targetname:usbdebug

    /hypervisorsettings

    Used the same way as /dbgsettings to configure all settings at once.

    Usage example:

    bcdedit /hypervisorsettings 1394 channel:10

    hypervisordebug

    Enables or disables hypervisor debug mode. This is for debugging a Hyper-V host system.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    bcdedit /set {current} hypervisordebug on

    /noumex

    Specifies that the kernel debugger ignores user-mode exceptions. By default, the kernel debugger breaks for certain user-mode exceptions, such as STATUS_BREAKPOINT and STATUS_SINGLE_STEP. The /noumex parameter is effective only when there is no user-mode debugger attached to the process.

    /start

    This option specifies the debugger start policy. If a start policy is not specified, ACTIVE is the default.

    Values: active, disable, autoenable

    loadoptions

    Used to describe settings that are not covered by other types. One setting that is relevant here is busparams.

    Values: Any value followed by the setting.

    Usage example (Vista/2008):

    bcdedit /set {current} loadoptions busparams=0.1d.0

    busparams

    A boot setting (specified with loadoptions key word) used to point to the PCI address of the debugger in use. The PCI bus, device, and function are used, in the format bb.dd.ff. This is generally used to identify the location of a 1394 or USB debug port. In Vista/2008, hexadecimal values are used, whereas decimal values are used for Win7.

    Values: Decimal values between 0 and 255.

    Usage example:

    In Win7 - bcdedit /set busparams 0.29.0

    In Vista - bcdedit /set loadoptions busparams=0.1d.0

    kernel

    The loadoptions parameter used to point to a different kernel binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: The 8.3 filename of the replacement kernel include the exe extension.

    Usage examples:

    In Win7 – bcdedit /set kernel kernchk.exe

    In Vista - bcdedit /set loadoptions kernel=kernchk.exe

    hal

    The loadoptions parameter used to point to a different hal binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: the 8.3 filename of the replacement kernel include the .dll extension.

    Usage examples:

    In Win7 – bcdedit /set hal halchk.dll

    In Vista - bcdedit /set loadoptions hal=halchk.dll

    testsigning

    Controls whether Windows 7, Windows Server 2008, or Windows Vista will load any type of test-signed kernel-mode code. This option is not set by default, which means test-signed kernel-mode drivers on 64-bit versions of Windows 7, Windows Server 2008, and Windows Vista will not load without setting the testsigning switch

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    Bcdedit /set testsigning on

     

  • Ntdebugging Blog

    Don't Believe Everything You Read

    • 3 Comments

    Recently, I was contacted by a customer who was advised by an ISV to set a registry value under one of the sub keys in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\.  Let's call it UseQuantumComputing = 1 (value name has been changed to protect the ISV).  The customer wanted to know what this value actually did and no one could find any documentation explaining it.  These issues often come to our team because we have access to the Windows source code. I did a bit of code review to find out what this value does.  As it turns out, nowhere in Windows source code between Windows 2000 and Windows Server 2012 do we ever check for or set UseQuantumComputing.

     

    I can think of a few reasons the ISV would suggest setting this value.  Perhaps they were under the impression this did something but got confused about the value name.  It's possible they hoped making a registry change would have a placebo effect.  Or, perhaps their software actually checks this value, not Windows.

     

    The latter of these possibilities is probably the worst case scenario. An ISV should not create a registry value inside of keys used for Windows’ own internal use. Why? The first reason is that there's no guarantee that Microsoft won't end up coincidentally using that same value name later. This would cause a conflict between the two users of the value.  Second, we have to consider the scenario where two different ISVs both decide to use the same value. That would be bad too.  Lastly, there's no guarantee that the key in use will still exist in later versions, or that it will be writeable or readable by the ISV due to permission changes.  In addition to all these reasons, there is the common sense issue that it is just confusing. Now the ISV's software and uninstaller needs to look all over the registry, not just in their own keys.

     

    On a similar note, I also recently had a case where a "Windows Tips" blog (not created, endorsed, or run by Microsoft) suggested using a registry value that was implemented in Windows but was not documented by Microsoft.  It turns out this value wasn't thoroughly tested (because it was undocumented and wasn't intended to be used in production), and using it would cause server hangs under certain conditions.  These hangs were only discovered after a large customer decided to implement the undocumented value across their enterprise. 

     

    Here are a few tips for IT Pros, developers, and users alike:

    • Don't implement random registry settings if you can't find documentation for that setting on an official Microsoft site, like MSDN, TechNet, or support.microsoft.com(information on forums or answer boards (e.g. social.*.microsoft.com or answers.*.microsoft.com) is not official documentation).  At best these unknown registry settings they will do nothing, at worst they will cause you headaches later.
    • If a key/value isn't documented, changes to it likely are not tested, and could put your machine in a state that makes it difficult or impossible to support.
    • If you are a developer, keep any of your registry settings in your own key. Don't pollute in others' keys.
    • If an ISV or Microsoft suggests you implement a setting, make sure you understand the implications of that setting.

     

    I'll leave you with the warning displayed in many of our KBs - it's there for a reason!

     

    WARNING: If you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. Microsoft cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.

     

    -Matt

  • Ntdebugging Blog

    Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows Server 2012

    • 2 Comments

    What is a bug check 0x133?

    Starting in Windows Server 2012, a DPC watchdog timer is enabled which will bug check a system if too much time is spent in DPC routines. This bug check was added to help identify drivers that are deadlocked or misbehaving.  The bug check is of type "DPC_WATCHDOG_VIOLATION" and has a code of 0x133.  (Windows 7 also included a DPC watchdog but by default, it only took action when a kernel debugger was attached to the system.)  A description of DPC routines can be found at http://msdn.microsoft.com/en-us/library/windows/hardware/ff544084(v=vs.85).aspx.

     

    The DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways. First, if a single DPC exceeds a specified number of ticks, the system will stop with 0x133 with parameter 1 of the bug check set to 0.  In this case, the system's time limit for single DPC will be in parameter 3, with the number of ticks taken by this DPC in parameter 2.  Alternatively, if the system exceeds a larger timeout of time spent cumulatively in all DPCs since the IRQL was raised to DPC level, the system will stop with a 0x133 with parameter 1 set to 1.  Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds, however the actual timeout values on the system are set much higher.

     

    How to debug a 0x133 (0, …

    In the case of a stop 0x133 with the first parameter set to 0, the call stack should contain the offending driver.  For example, here is a debug of a 0x133 (0,…) kernel dump:

     

    0: kd> .bugcheck

    Bugcheck code 00000133

    Arguments 00000000`00000000 00000000`00000283 00000000`00000282 00000000`00000000  

     

    Per MSDN, we know that this DPC has run for 0x283 ticks, when the limit was 0x282.

     

    0: kd> k

    Child-SP          RetAddr           Call Site

    fffff803`08c18428 fffff803`098525df nt!KeBugCheckEx

    fffff803`08c18430 fffff803`09723f11 nt! ??::FNODOBFM::`string'+0x13ba4

    fffff803`08c184b0 fffff803`09724d98 nt!KeUpdateRunTime+0x51

    fffff803`08c184e0 fffff803`09634eba nt!KeUpdateTime+0x3f9

    fffff803`08c186d0 fffff803`096f24ae hal!HalpTimerClockInterrupt+0x86

    fffff803`08c18700 fffff803`0963dba2 nt!KiInterruptDispatchLBControl+0x1ce

    fffff803`08c18898 fffff803`096300d0 hal!HalpTscQueryCounter+0x2

    fffff803`08c188a0 fffff880`04be3409 hal!HalpTimerStallExecutionProcessor+0x131

    fffff803`08c18930 fffff880`011202ee ECHO!EchoEvtTimerFunc+0x7d                //Here is our driver, and we can see it calls into StallExecutionProcessor

    fffff803`08c18960 fffff803`097258b4 Wdf01000!FxTimer::TimerHandler+0x92

    fffff803`08c189a0 fffff803`09725ed5 nt!KiProcessExpiredTimerList+0x214

    fffff803`08c18ae0 fffff803`09725d88 nt!KiExpireTimerTable+0xa9

    fffff803`08c18b80 fffff803`0971fe76 nt!KiTimerExpiration+0xc8

    fffff803`08c18c30 fffff803`0972457a nt!KiRetireDpcList+0x1f6

    fffff803`08c18da0 00000000`00000000 nt!KiIdleLoop+0x5a

     

    Let’s view the driver’s unassembled DPC routine and see what it is doing

     

    0: kd> ub fffff880`04be3409

    ECHO!EchoEvtTimerFunc+0x54:

    fffff880`04be33e0 448b4320        mov     r8d,dword ptr[rbx+20h]

    fffff880`04be33e4 488b0d6d2a0000  mov     rcx,qword ptr [ECHO!WdfDriverGlobals (fffff880`04be5e58)]

    fffff880`04be33eb 4883631800      and     qword ptr [rbx+18h],0

    fffff880`04be33f0 488bd7          mov     rdx,rdi

    fffff880`04be33f3 ff150f260000    call    qword ptr [ECHO!WdfFunctions+0x838(fffff880`04be5a08)]

    fffff880`04be33f9 bbc0d40100      mov     ebx,1D4C0h

    fffff880`04be33fe b964000000      mov     ecx,64h

    fffff880`04be3403 ff15f70b0000    call    qword ptr[ECHO!_imp_KeStallExecutionProcessor (fffff880`04be4000)]   //Its Calling KeStallExecutionProcessor with 0x64 (decimal 100) as a parameter

    0: kd> u fffff880`04be3409

    ECHO!EchoEvtTimerFunc+0x7d:

    fffff880`04be3409 4883eb01        sub     rbx,1

    fffff880`04be340d 75ef            jne     ECHO!EchoEvtTimerFunc+0x72 (fffff880`04be33fe)     //Here we can see it is jumping back to call KeStallExecutionProcessor in a loop

    fffff880`04be340f 488b5c2430      mov     rbx,qword ptr[rsp+30h]

    fffff880`04be3414 4883c420        add     rsp,20h

    fffff880`04be3418 5f              pop     rdi

    fffff880`04be3419 c3              ret

    fffff880`04be341a cc              int     3

    fffff880`04be341b cc              int     3

     

    0: kd> !pcr

    KPCR for Processor 0 at fffff80309974000:

        Major 1 Minor 1

          NtTib.ExceptionList: fffff80308c11000

              NtTib.StackBase: fffff80308c12080

             NtTib.StackLimit: 000000d70c7bf988

           NtTib.SubSystemTib: fffff80309974000

                NtTib.Version: 0000000009974180

            NtTib.UserPointer: fffff803099747f0

                NtTib.SelfTib: 000007f7ab80c000

     

                      SelfPcr: 0000000000000000

                         Prcb: fffff80309974180

                         Irql: 0000000000000000

                          IRR: 0000000000000000

                          IDR: 0000000000000000

                InterruptMode: 0000000000000000

                          IDT: 0000000000000000

                          GDT: 0000000000000000

                          TSS: 0000000000000000

     

                CurrentThread: fffff803099ce880

                   NextThread: fffffa800261cb00

                   IdleThread: fffff803099ce880

     

                    DpcQueue:  0xfffffa80020ce790 0xfffff880012e4e9c [Normal] NDIS!NdisReturnNetBufferLists

                               0xfffffa800185f118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                               0xfffff8030994fda0 0xfffff8030972bc30 [Normal] nt!KiBalanceSetManagerDeferredRoutine

                               0xfffffa8001dbc118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                               0xfffffa8002082300 0xfffff88001701df0 [Normal] USBPORT

     

    The !pcr output shows us queued DPCs for this processor. If you want to see more information about DPCs and the DPC Watchdog, you could dump the PRCB listed in the !pcr output like this:

     

    dt nt!_KPRCB fffff80309974180 Dpc*

     

    Often the driver will be calling into a function like KeStallExecutionProcessor in a loop, as in our example debug.  To resolve this problem, contact the driver vendor to request an updated driver version that spends less time in its DPC Routine.

     

    How to troubleshoot a 0x133 (1, …

    Determining the cause of a stop 0x133 with a first parameter of 1 is a bit more difficult because the problem is a result of DPCs running from multiple drivers, so the call stack is insufficient to determine the culprit.  To troubleshoot this stop, first make sure that the NT Kernel Logger or Circular Kernel Context Logger ETW traces are enabled on the system.  (For directions on setting this up, see http://blogs.msdn.com/b/ntdebugging/archive/2009/12/11/test.aspx.)

     

    Once the logging is enabled and the system bug checks, dump out the list of ETW loggers using !wmitrace.strdump. Find the ID of the NT Kernel logger or the Circular logger.  You can then use !wmitrace.logsave (ID) (path to ETL) to write out the ETL log to a file.  Load it up with Windows Performance Analyzer and add the DPC or DPC/ISR Duration by Module, Function view (located in the Computation group) to your current analysis window:

     

     

    Next, make sure the table is also shown by clicking the box in the upper right of the view:

     

     

    Ensure that the Address column is added on the left of the gold bar, then expand each address entry to see individual DPC enters/exits for each function.  Using this data, you can determine which DPC routines took the longest by looking at the inclusive duration column, which should be added to the right of the gold bar: 

     

    In this case, these DPCs took 1 second, which is well over the recommended maximum of 100 us.  The module column (and possible the function column, if you have symbols) will show which driver is responsible for that DPC routine.  Since our ECHO driver was based on WDF, that is the module named here.

     

    For an example of doing this type of analysis in xperf, see http://blogs.msdn.com/b/ntdebugging/archive/2008/04/03/windows-performance-toolkit-xperf.aspx.

     

    More Information

    For additional information about Stop 0x133 errors, see this page on MSDN: http://msdn.microsoft.com/en-us/library/windows/hardware/jj154556(v=vs.85).aspx.

     

    For DPC timing recommendations and for advice on capturing DPC timing information using tracelog, see http://msdn.microsoft.com/en-us/library/windows/hardware/ff545764(v=vs.85).aspx.

     

    Guidelines for writing DPC routines can be found at http://msdn.microsoft.com/en-us/library/windows/hardware/ff546551(v=vs.85).aspx.

     

     

    -Matt Burrough

  • Ntdebugging Blog

    Configuring a Hyper-V VM For Kernel Debugging

    • 2 Comments

    Yesterday's blog prompted some questions about how to set up a debugger for a Windows OS running in a Hyper-V VM.  I was surprised that I wasn't able to find good, publicly available, Microsoft issued documentation for this configuration.

     

    The first step is to configure the Windows OS in the VM to enable a kernel debugger on COM1.  One would use these same steps if you were preparing the OS to be debugged using a null modem cable.  Hyper-V will allow us to redirect the COM port so that we don't need such a cable.

    1. Start an administrative command prompt.
    2. Turn on debugging with this command:
      bcdedit /debug on
    3. Configure the debugger to use COM1 with this command:
      bcdedit /dbgsettings SERIAL DEBUGPORT:1 BAUDRATE:115200
      Note that these are the default settings and already exist in most bcd stores.  However setting them again won't damage anything, and guards against a situation where the dbgsettings have been previously modified.
    4. Reboot so that the boot loader can read the new settings and configure the OS for debugging.

    CommandPrompt

     

    Next, configure Hyper-V to redirect the COM1 port to a named pipe.  We will use this pipe in place of a traditional null modem cable.

    1. Open Hyper-V Manager and browse to the settings page of the VM you configured to debug.
    2. Under the Hardware list choose COM 1.
    3. Change the Attachment to 'Named pipe:' and provide a pipe name.
      1. Note that the Hyper-V Manager provides the complete path to your named pipe.  Make a note of this path as you will need it in the next step.

    Settings2

     

    After the OS and the VM are configured for debugging, we need to connect a debugger.

    1. On the Hyper-V parent partition download and install the Debugging Tools for Windows from http://msdn.microsoft.com/en-us/windows/hardware/gg463009.
    2. After installing the debugging tools you will have a ‘Debugging Tools for Windows’ entry in your start menu.
      1. From this folder right click ‘WinDbg’ and choose ‘Run as administrator’.  Windbg needs administrative rights to connect to the pipe.
    3. In windbg open the File menu and choose ‘Kernel Debug’.
    4. Enter a Baud Rate of 115200, to match the settings made in the VM.
    5. Enter the Port that you configured in the VM settings page.
      1. To connect to the pipe remotely, substitute the '.' in the path with the Hyper-V server name.
    6. Ensure that the Pipe and Reconnect boxes are checked.
    7. Set Resets to 0.
    8. Click OK to start debugging.
    9. Windbg should display the string ' Waiting to reconnect...'

    image004

     

    To test the debugger connection in windbg, from the ‘Debug’ menu choose ‘Break’.  This should cause the server to break into the debugger and display a kd> prompt.  Please note that breaking into the debugger will cause the OS running in the VM to halt until you tell the debugger to go, the OS will appear to be hung during this time.  The command 'g' followed by Enter will tell the debugger to ‘go’ causing the VM to resume operation.

     

    Windbg

  • Ntdebugging Blog

    What Is In A RHS Dump File Created By Windows Error Reporting

    • 2 Comments

    Hello all, East here.  I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.

     

    First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:

    1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.

    2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.

     

    When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.

     

    NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

     

    When a deadlock is detected the cluster log will show:

    "00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued."

     

    You should find the WER report file and dumps located at the default location:

    <SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue

     

    Here is what a cluster created Windows Error Report looks like:

    Version=1

    EventType=WSFC Resource Deadlock

    EventTime=129497767591366327

    ReportType=1

    Consent=1

    ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c

    Response.type=4

    Sig[0].Name=Resource Name

    Sig[0].Value=New flexdisk

    Sig[1].Name=Resource Type

    Sig[1].Value=flexdisk

    Sig[2].Name=Call Type

    Sig[2].Value=ISALIVE

    DynamicSig[1].Name=OS Version

    DynamicSig[1].Value=6.1.7601.2.1.0.274.10

    DynamicSig[2].Name=Locale ID

    DynamicSig[2].Value=1033

    File[0].CabName=memory.hdmp

    File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file   

    File[0].Flags=2097152

    File[0].Type=3

    File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp

    File[1].CabName=minidump.mdmp

    File[1].Path=WERC2CE.tmp.mdmp

    File[1].Flags=2162690

    File[1].Type=2

    File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp

    FriendlyEventName=WSFC Resource Deadlock

    ConsentKey=WSFC Resource Host Monitor

    AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS

    AppPath=C:\Windows\Cluster\rhs.exe

    ReportDescription=WSFC Resource Deadlock

     

    Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:

     

    The Resource Name = New flexdisk

    The Resource Type = flexdisk

    The type of call the resource was engaged in = ISALIVE

     

    You have all this information in the cluster log as well:

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'.  RHS deadlock during an Isalive check for the New flexdisk resource.

    00000dd4.00000aac::2011/05/13-16:12:39.136 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] About to send WER report.

    000004dc.00000fb4::2011/05/13-16:12:39.136 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.

    000004dc.00000fb4::2011/05/13-16:12:39.136 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.

    00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued.  WER report and user-mode dump created

    000004dc.00000fb4::2011/05/13-16:12:39.559 ERR   [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4

    000004dc.00000fb4::2011/05/13-16:12:39.575 INFO  [RCM] Created monitor process 3280 / 0xcd0

    00000cd0.00000200::2011/05/13-16:12:39.575 INFO  [RHS] Initializing.

    000004dc.00000fb4::2011/05/13-16:12:39.591 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)

    000004dc.00000fb4::2011/05/13-16:12:39.591 WARN  [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.

     

    Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”.  Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.

     

    Run the following command after opening the dump under your debugger “~*k”.  In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above.  If you don’t have a cluster log, this is the thread with wer.dll on the stack.

    0:000> ~*k

     

    .  0  Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74

    00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1

    00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d

    00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       1  Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184

    00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18

    00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       2  Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d

    00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       3  Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    <snipped to save space>

     

    We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers.  Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax:  “ ~~[aac]s ”

    0:003> kn

    # Child-SP          RetAddr           Call Site

    00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa

    01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers  ( /r )  for that frame context:

    0:003> .frame /r d

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0

    rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0

    rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000

    r8=00000000ffffffff  r9=00000000000004ff r10=0000000000000000

    r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000

    r14=0000000000000000 r15=000000004dcd5877

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rhs!RhsCall::DeadlockMonitor+0x324:

    00000000`ff74b270 448b442430      mov     r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff

     

    Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.

     

    Dump that out to find the thread containing the stack with:

    0:003> dqs 000000000035f8b0

    00000000`0035f8b0  00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType

    00000000`0035f8b8  00000000`00000004

    00000000`0035f8c0  00000000`001e7da0

    00000000`0035f8c8  00000000`00000e48 <-- Thread id of concerning resource

    00000000`0035f8d0  00000000`00346920

    00000000`0035f8d8  0000000c`00000001

    00000000`0035f8e0  00000000`00000003

    00000000`0035f8e8  00000000`4dcd5876

    00000000`0035f8f0  00000000`048d2dd3

    00000000`0035f8f8  00000000`00000100

    00000000`0035f900  00000001`00000100

    00000000`0035f908  00000000`00000000

    00000000`0035f910  00430000`00000031

    00000000`0035f918  8a0000a4`52e48988

    00000000`0035f920  004c0041`00530049

    00000000`0035f928  00450045`00560049

     

    If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.

     

    Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:

    00000dd4.00000e48::2011/05/13-16:07:37.270 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.

    000004dc.00000fd4::2011/05/13-16:07:37.286 WARN  [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.

     

    In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues.  If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.

  • Ntdebugging Blog

    MmCm – A Non Paged Pool Accounting Adventure

    • 1 Comments

    Here's one from the Rube Goldberg debug collection!

    The dripping sarcasm is because I'm about to show the reeeeally long way to figure out what's eating MmCm, skip down to the end if you are in a time crunch.

     

    Otherwise, do resist the temptation to skip ahead as some of the techniques can be used in other debugs.

     

    -Tate

     

     

    What is this Pool Tag?

     

    MmCm is a Non Paged Pool (NPP) tag allocated via a kernel mode call (a driver or the OS itself…applications cannot allocate this memory directly) to either MmAllocateContiguousMemory  or MmAllocateContiguousMemorySpecifyCache.

     

    Great comments from msdn…

     

    MmAllocateContiguousMemory can be called to allocate a contiguous block of physical memory for a long-term internal buffer, usually from the DriverEntry routine…

     

    A device driver that must use contiguous memory should allocate only what it needs during driver initialization because physical memory is likely to become fragmented as the operating system runs.

     

    Warning  If you use the MmAllocateContiguousMemory routine on computers with large amounts of memory, the operating system's performance might severely degrade when the system tries to create a contiguous chunk of memory. This degradation is reduced for Windows Server 2008. Memory that MmAllocateContiguousMemory allocates is uninitialized...

     

     

    First, how much is normal?

     

    As the comments indicate it is contingent mostly on the driver's needs, done at initialization time.  We have seen consumption in the single MB range up to tens of MB consumed depending on what drivers are loaded.

     

     

    What drivers are the most likely consumers of this memory and why would the amount vary?

     

    The most likely consumers on any machine would be the following types of drivers. Note that each will likely allocate a multiple of some amount per instance for each physical adapter exposed to Windows. (i.e. A single NIC driver used for 3 NICs in a system will likely allocate 3 times the MmCm vs. a single NIC.)

     

    Storage Adapters for your SAN...SCSI Controller, Fibre Channel Controller, etc.

    Video Adapters

    Network Adapters

     

    These drivers need to pre-allocate memory to service transfers and as mentioned need to do this right at boot time since memory can become fragmented later.  In other words, this allocation pattern attempts to "bake-in" a known set of scratch space if you will for the adapter to function properly given the features you wish to use.

     

     

    So what's the problem?

     

    Not everyone plays nice and the machine can run out of memory!  This is downright painful on x86 Exchange servers and their limited NPP ceiling.

     

    Recall from an earlier post and your likely experience, that running out of such memory can cause hangs or general system instability as the machine runs out of NPP.  The scary part is that this usually happens unexpectedly under incrementally higher than normal user demand. i.e when you most need the server it fails.  Not coincidently most often we see older  x86 machines, especially /3GB configured Exchange servers, have this memory consumption problem. I often wonder just how many machines are out there just waiting for a liiiiitle more user load and are hovering at the brink of NPP exhaustion…

     

    <Ok, queue the sad music for the heart breaking x86 Exchange server example>

     

    This server has the standard 128MB maximum for NPP because of /3GB

     

    0: kd> !vm

     

    *** Virtual Memory Usage ***

    Physical Memory:      851418 (   3405672 Kb)

    Page File: \??\C:\pagefile.sys

      Current:   4193280 Kb  Free Space:   4038364 Kb

      Minimum:   4193280 Kb  Maximum:      4193280 Kb

    Available Pages:      639032 (   2556128 Kb)

    ResAvail Pages:       796753 (   3187012 Kb)

    Locked IO Pages:         272 (      1088 Kb)

    Free System PTEs:      27083 (    108332 Kb)

    Free NP PTEs:           5778 (     23112 Kb)

    Free Special NP:           0 (         0 Kb)

    Modified Pages:         4138 (     16552 Kb)

    Modified PF Pages:      4116 (     16464 Kb)

    NonPagedPool Usage:    25151 (    100604 Kb)

    NonPagedPool Max:      32026 (    128104 Kb)

    PagedPool 0 Usage:     11497 (     45988 Kb)

    PagedPool 1 Usage:      1645 (      6580 Kb)

    PagedPool 2 Usage:      1667 (      6668 Kb)

    PagedPool 3 Usage:      1662 (      6648 Kb)

    PagedPool 4 Usage:      1679 (      6716 Kb)

    PagedPool Usage:       18150 (     72600 Kb)

    PagedPool Maximum:     63488 (    253952 Kb)

    Session Commit:          586 (      2344 Kb)

    Shared Commit:          4720 (     18880 Kb)

    Special Pool:              0 (         0 Kb)

    Shared Process:         7517 (     30068 Kb)

    PagedPool Commit:      18214 (     72856 Kb)

    Driver Commit:          8779 (     35116 Kb)

    Committed pages:      237424 (    949696 Kb)

    Commit limit:        1867524 (   7470096 Kb)

     

    Given 100Megs used, 28MB of that is MmCm

     

    0: kd> !poolused /t 10 2

       Sorting by  NonPaged Pool Consumed

     

      Pool Used:

                NonPaged            Paged

    Tag    Allocs     Used    Allocs     Used

    MmCm     3210 28779488         0        0        Calls made to MmAllocateContiguousMemory , Binary: nt!mm

    NDpp     1013  4076960         0        0        packet pool , Binary: ndis.sys

    MPIO   121985  3457752         0        0        UNKNOWN pooltag 'MPIO', please update pooltag.txt

    elxs        6  3299344         0        0        UNKNOWN pooltag 'elxs', please update pooltag.txt

    BCM0       24  3057232         0        0        UNKNOWN pooltag 'BCM0', please update pooltag.txt

    LSwi        1  2654208         0        0        initial work context

     RaME        3  2572288         0        0        RiAllocateMiniportDeviceExtension , Binary: storport.sys

    ElxA        5  2360208         3      160        UNKNOWN pooltag 'ElxA', please update pooltag.txt

    Io        223  2319712       123     5136        general IO allocations , Binary: nt!io

    TPLA      512  2097152         0        0        UNKNOWN pooltag 'TPLA', please update pooltag.txt

    TCPt       40  1662448         0        0        TCP/IP network protocol , Binary: TCP

    Mdl      7225  1435160         0        0        Io, Mdls

     Pool        3  1134592         0        0        Pool tables, etc.

     Devi      602  1124400         0        0        Device objects

     RcpI        1  1048576         0        0        Internal memory mgr initial heap block , Binary: sacdrv.sys

    brcm       38  1032528         0        0        UNKNOWN pooltag 'brcm', please update pooltag.txt

    TOTAL      219941 75555240     70157 56589336

     

    That's not a lot right?  Well, it may be more than you like.  Depending on load cycles on this server it could be too much, you'd have to know a bit more history of the maximum amount of NPP demanded at peak usage, etc. to know how close to the edge this server may be.  Let's go with "what's using the ~28MB?" as the question to answer...

     

     

    Who uses this memory?

     

    Here's where it gets interesting.  A useful heuristic here is to group the sizes of the allocations in Excel.

    You get this data from a !poolfind MmCm

     

    0: kd> !poolfind MmCm

     

    Scanning large pool allocation table for Tag: MmCm (f9a67000 : f9b67000)

     

    *f7fc7000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

    *f8298000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

    *faeae000 :large page allocation, Tag  is MmCm, size  is 0x4000 bytes

    *faeb2000 :large page allocation, Tag  is MmCm, size  is 0xd000 bytes

    *fae84000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

    *fae95000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

    *fae23000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

    *fae24000 :large page allocation, Tag  is MmCm, size  is 0x3000 bytes

    *fae2d000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

    .

    .

    .

    fdf2e000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf2f000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf30000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf31000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf32000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf33000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf34000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf35000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf36000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf37000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf38000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf39000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3a000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3b000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3c000 size:  f18 previous size:    0  (Allocated) MmCm

     

    Searching NonPaged pool (fe000000 : ffb7e000) for Tag: MmCm

     

     

    When I have to do this, usually some variation of following works.

     

    .logopen

    !poolfind

    .logclose

     

    Open the txt file up in notepad, clean it up a bit (remove the frees and extra lines)

    Open the txt in Excel and use the old standby Text Import Wizard via file open

    I usually also save the large pool allocs to one txt file and the regular to another and then cut paste to combine them in Excel.

     

    There's probably an easier way to do this import, but this works well enough.

    image

    You can really go crazy getting the data all pretty but don't bother as even something as ugly as this is useful because a quick sort on Column B shows the allocation size pattern.  Note the Text Import Wizard allows you to skip columns here, so you can import only the address column and the size column.  i.e. faeae000’s column and the 0x4000 column in the above.

     

    image

    Via Excel I just generated another table on the fly to keep track of our totals here.  Note, I’m just counting the number of each size of alloc by visual inspection.  i.e. There are 9 allocations of 0x1000 size in my list.

     

    Allocation Sizes in Bytes (h)  

    Number of Allocations

    Total MmCm in Bytes

    0xf18 bytes 

    2014

    7,782,096

    0x1000 bytes

    9

    36,864

    0x2000 bytes

    1017

    8,331,264

    0xf000 bytes

    23

    1,413,120

    0x11000 bytes

    92

    6,406,144

    0x156000 bytes

    2

    2,801,664

     

    Ahhh…..In total we've got 26 of 28 Megs sampled here, well representing the MmCm usage minus some insignificant allocs and rounding, great.

     

    (I love the new Win7 calc.exe)

    clip_image003

    So how does this help me again?  Here's the crucial step.  Look at samples of these allocations (which is why I kept the address of the alloc in the Excel import).  There are very often telltale tags or strings inside the alloc that give you a strong indication if not proof positive who allocated this memory.

     

    Here's a common allocation example:

     

    Dump several samples of the f18 sized allocations…

     

    0: kd> dc fdf18000 fdf18000+f18

    fdf18000  0be30000 6d436d4d fd1df008 00000000  ....MmCm........<--the pooltag, okay…knew that...

    fdf18010  1f2e3d4c 00000000 00000000 00000000  L=..............<-------hum, this 1f2e3d4c seems to be in every one of these puppies...

    fdf18020  00000000 00000000 00000000 00000000  ................

    fdf18030  00000000 00000000 00000000 00000000  ................

    fdf18040  00000000 00000000 00000000 00000000  ................

     

    Remember we learned that these are predominantly allocated on boot?  Turns out that Storport actually allocates 1000 of these(per adapter) based on the extension size of the adapter and guess what it does, it puts a handy little tag in here defined as 1f2e3d4c.    Here's the proof.

     

    storport!RaInitializeRaidResources+0x47:

    f5d6aeb3 8bd7            mov     edx,edi

    f5d6aeb5 8bce            mov     ecx,esi

    f5d6aeb7 ff1510d1d6f5    call    dword ptr [storport!_imp_InterlockedPushEntrySList (f5d6d110)]

    f5d6aebd ff4508          inc     dword ptr [ebp+8]

    f5d6aec0 8b4508          mov     eax,dword ptr [ebp+8]

    f5d6aec3 3b450c          cmp     eax,dword ptr [ebp+0Ch]

    f5d6aec6 c747084c3d2e1f  mov     dword ptr [edi+8],1F2E3D4Ch<------------------ah ha!

    f5d6aecd 72be            jb      storport!RaInitializeRaidResources+0x21 (f5d6ae8d)

     

     

    Cool! So now I've accounted for 7.8 Megs of the 28, and I know at this point that since I've got ~2000 allocs I've got at least two storage adapters on this machine.

    Actually, If you are suspicious about the roughly 1000 0x2000 sized allocations and you think it's another adapter with a larger extension size, you'd be right!

     

    0: kd> dc fcc20000

    fcc20000  fcc22000 00000000 1f2e3d4c 00000000  . ......L=......<-------again!

    fcc20010  f7b7e000 00000000 00000000 00000000  ................

    fcc20020  00000000 00000000 00000000 00000000  ................

    fcc20030  00000000 00000000 00000000 00000000  ................

    fcc20040  00000000 00000000 00000000 00000000  ................

    fcc20050  00000000 00000000 00000000 00000000  ................

    fcc20060  00000000 00000000 00000000 00000000  ................

    fcc20070  00000000 00000000 00000000 00000000  ................

     

     

    Now, I've accounted for likely all my storage adapters (7,782,096 + 8,331,264  =  16,113,360) which are between the three of them consuming 16 of 28 Megs!

    (I say three because of the ~3000 allocations in total, with 1000 per adapter as stated.  So two adapters of size 0xf18 and one of size 0x2000)

    Sure enough checking msinfo32 on this machine shows three adapters.

     

    So what about the rest of the odd ball sized allocations and 12 Megs?  Especially the 92 0x11000 and the two whopping 0x156000 bytes ones?

     

    First , the two whopper 0x156000 byte allocs.  Here's where you are limited usually only by your own creativity and patience...( this particular sample is courtesy of a fellow persistent debugger,  Pushkar)

     

    //Dump the beginning of one…

     

    0: kd> dc fdb24000

    fdb24000  00036c40 fdb24080 fdc43100 08b24080  @l...@...1...@..

    fdb24010  00000000 08c43100 00000000 00000001  .....1..........

    fdb24020  00000000 00000000 00000000 00000000  ................

    fdb24030  00000000 00000000 00000000 00000000  ................

    fdb24040  00000000 00000000 00000000 00000000  ................

    fdb24050  00000000 00000000 00000000 00000000  ................

    fdb24060  00000000 00000000 00000000 00000000  ................

    fdb24070  00000000 00000000 00000000 00000000  ................

     

    On the line below with the highlight we are looking for interesting pointers and !pool'ing them to figure out who's associated with this alloc.

     

    0: kd> dc

    fdb24080  00000003 00000940 08b24100 00000000  ....@....A......

    fdb24090  fdef0080 00000000 08b241d8 00000000  .........A......<-----This is the line

    fdb240a0  00000080 08b25080 00000000 00000080  .....P..........

    fdb240b0  08b35080 00000000 00000080 08b29080  .P..............

    fdb240c0  00000000 00000080 08b37080 00000000  .........p......

    fdb240d0  00000080 08b2d080 00000000 00000080  ................

    fdb240e0  08b39080 00000000 00000080 08b31080  ................

    fdb240f0  00000000 00000080 08b3b080 00000000  ................

     

    //Oh, it happens to be a Device object...

     

    0: kd> !pool fdef0080

    Pool page fdef0080 region is Nonpaged pool

    *fdec1000 : large page allocation, Tag is Dev., size is 0x33000 bytes

    Owning component : Unknown (update pooltag.txt)

     

    //Dump it out via devobj, adding the object offset and poolheader length…this one looks like it's going to belong to my Emulex adapters

     

    0: kd> !devobj fdec1000+38

    Device object (fdec1038) is for:

    ElxPlus*** ERROR: Module load completed but symbols could not be loaded for elxplus.sys

    \Driver\elxplus DriverObject fdef41c0

    Current Irp 00000000 RefCount 1 Type 0000002a Flags 00000048

    DevExt fdec10f0 DevObjExt fdef3ca0

    ExtensionFlags (0000000000) 

    AttachedTo (Lower) fdf68ba0 \Driver\PnpManager

    Device queue is not busy.

     

    On to the 0x11000 sized, first pass is to just dc out the memory like before and look for interesting tags or strings…At the end of Network card adapter driver allocations via their calls into the Ndis layer for instance, you can flag them by the END of the allocation may have an ND** string.  I don't see any of those here but it "looks like" the 0x11000 sized allocations have network related data and strings as a common factor throughout, so I'm guessing these are associated with one of the four network adapters on this machine.  Turns out if you sample more of these they may have pointers back to other interesting pool allocations that can flag the network miniport adapter, etc.

     

    You may be thinking, hum…I get the storport one but these last two sound like mere correlation?  Absolutely.  However, given the lack of randomness sampling multiple allocations it's often strong enough to at least inform your hypothesis and test, quickly.

     

     

    Yes, there is an easier way!

     

    If this memory is allocated at driver initialization and most drivers initialize at system boot time, it stands to reason that most of this memory will be consumed on boot.  Cool!  Then it also stands to reason we can use our friend Poolmon.exe to quickly check how much is being used shortly after boot and do some quick testing by excluding some of the usual suspect adapters.  First, you could just trust me and check for ancillary storage, NIC, and enhanced video adapters and disable/remove these first as a low risk test.

     

    In any case, the testing sequence is:

     

    1.      Promptly record the amount of MmCm after a reboot via Poolmon.exe

    (Sort by Non Paged Pool (toggle through by hitting 'P') and then descending by bytes (same, but 'B')

    2.      Disable the ancillary test adapter(s)

    3.      Reboot, check poolmon.exe again

    4.      What's the decrease?

     

    5.      Perhaps repeat testing with updated drivers or disabled features if you must have the additional hardware in place (hoping they use less MmCm).

     

    I hope this post saved you time by being wary of configuring more adapters than absolutely necessary, especially /3GB x86 Exchange servers and if you have to reactively triage this a quick and dirty way to determine the primary consumers of this memory.

     

    Enjoy!

     

    -Tate

     

    P.S. If you can't scale down the number or features of let's say one of the suspect adapters, do check with your driver vendor to see if there is a way to throttle back the consumption.  For instance, when we shipped the Scalable Networking Pack feature, we often saw higher consumption in the NIC drivers supporting this functionality, later driver releases from hardware vendors reportedly used much less.

     

     

     

     

    Bonus:

     

    Here's another handy debug trick to isolate who can call these two Memory manager functions and narrow down the driver population...

     

    //Find the address of MmAllocateContiguousMemorySpecifyCache

    0: kd> x nt!MmAllocateContiguousMemorySpecifyCache

    e080efac nt!MmAllocateContiguousMemorySpecifyCache = <no type information>

     

    //Use the handy !for_each_module to scan each module for the address (e080efac )

    0: kd> !for_each_module ".echo ${@#ModuleName} ;s -d ${@#ModuleName} L?${@#Size} e080efac"

     

    //Hits will look like this, with the matching address listed…

    ATMFD

    nt

    e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

    hal

    e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

    RDPWD

    .

    .

    .

    ati2mtag

    f5575234  e080efac e083b69c e08454c9 e082f651  .........T..Q...

    f5836084  e080efac e08329f9 e0839c13 e08e2d78  .....)......x-..

    .

    .

    .

    ql2300

    f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    storport

    f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    elxstor

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    SCSIPORT

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

     

     

    //Example match in the import table for ati2mtag...

    0: kd> dps f5575234

    f5575234  e080efac nt!MmAllocateContiguousMemorySpecifyCache

    f5575238  e083b69c nt!ZwQueryInformationProcess

    f557523c  e08454c9 nt!PsGetCurrentThreadProcessId

     

     

    //Repeat for the other call...

    0: kd> x nt!MmAllocateContiguousMemory

    e080be42 nt!MmAllocateContiguousMemory = <no type information>

     

     

     

     

    Share this post :

     

     

     

     

  • Ntdebugging Blog

    Part 1 - ETW Introduction and Overview

    • 2 Comments

    Introduction:

     

    Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.

    ETW is useful from a variety of scenarios, including:

     -User & Admin: Being in control of your system and knowing what is going on.
     -User & Admin: Troubleshooting performance, hardware and OS issues.
     -Enthusiast: Learning further about the OS and the low level guts of the OS.
     -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies
     -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.

    ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.

    A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.

    Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.

     

    image

    Some uses of ETW:

     

    Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.

    As it stands today, ETW has two major thrusts:

    1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.

    First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):

    image

    A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.

    Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.

    ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.

     

    image

    There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.

    The second major thrust of ETW is exposing performance data.

    2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).

    Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.

    Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.

    image

    image

    Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.

    For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).

    Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.

    image

    There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.

    Where to go from here:

     

    In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).

    The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.

     

     

    Share this post :

  • Ntdebugging Blog

    For testing purposes only, and other desktop watermarks

    • 3 Comments

    Hi all, Matt here again.  One of our team’s main functions is to work with our development teams to create hotfixes when customers run into issues that can only be resolved through a code change.  The developers will often prepare a private test fix that either tests the proposed change, or adds additional instrumentation to help pinpoint the issue. The private test fix is sent to the customer reporting the problem so they can confirm that it does indeed correct (or identify) the flaw.

     

    When testing a private test fix, customers frequently ask, why does my desktop now show a message on the lower right corner of the desktop, and what does it mean?  The message reads “For testing purposes only”, and looks like this:

    For testing purposes only

     

    Often, users are concerned that his message means that they aren’t allowed to use the server in production, or maybe that it is now “unsupported.”  These aren’t the case!  Since this message appears as a result of installing a fix during the course of a Microsoft Support case, the servers are, by definition, being supported. 

     

    The purpose of this message is simply to remind users that code that Microsoft Support has asked them to test has been installed on the system, and this code may not have yet undergone the full suite of quality assurance testing that fixes that are made public usually do.   

     

    For comparison, let’s look at some of the other watermarks you may find in the lower corner of the desktop – as these can often be confused for the above message, and may explain some of the customer concerns around these messages.

     

    First up is the old trusty text you see when a box is booted into ‘Safe Mode’. I’m sure every IT Pro has seen this at one time or another, so I won’t go into detail, but rest assured, the testing purposes text is completely unrelated to booting in safe mode or having a subset of services running.

    Safe Mode

     

    Next up is our ‘Evaluation copy’ watermark.  This message is shown on the desktops of copies of Windows that have a “time bomb” (ones that will cease to function after a certain date.)  This message is typically seen on beta versions of Windows which are designed to stop functioning sometime after the desired beta testing period ends. 

    Evaluation copy

     

    Third, we have our Windows is not genuine message.  This is shown if, for example, a copy of Windows is not activated during the grace period after the installation process, or if a number of hardware changes have been made and Windows needs to be re-activated.  This has nothing to do with the ‘testing purposes’ message.  See http://technet.microsoft.com/en-us/library/dd979803.aspx for more information about this message.

    This copy of Windows is not genuine

     

    Fourth, we have the general Windows build stamp.  This is enabled via the registry using the PaintDesktopVersion DWORD (http://technet.microsoft.com/en-us/library/cc782257(WS.10).aspx).  Some administrators like to enable this option so they always know what version of Windows they are using, sort of like a mini-bginfo.  Unlike the others, this message does not indicate anything else about a server’s state.

    Windows 7

     

    Finally, we have ‘Test Mode’.  This is actually somewhat related to the testing purposes message.  This ‘Test Mode’ text is shown when test signing is enabled on a PC.  This is done by running “bcdedit /set testsigning on” from an UAC-elevated command prompt.  Test signing is used to allow developers to load drivers they are still working on that have not yet been code signed with an official certificate.  This is actually one of the steps we need to do when loading our test fixes.  For more information on Test Signing, see http://msdn.microsoft.com/en-us/library/ff553484%28v=VS.85%29.aspx.

    Test Mode

     

    So now that you know what causes these various watermarks to appear, perhaps you’re wondering how to make the “For testing purposes only” message disappear.  This is a question we are frequently asked.  While you are running a private test fix, there is no way to disable this message.  Your only option is to remove the private test fix from your system.  This is something your engineer will ask you to do before you install the final, public version of a hotfix.  You can easily identify and uninstall private test fixes by going into Control Panel, Programs and Features, View Installed Updates, then look for hotfixes with the words FOR TESTING PURPOSES ONLY in their name, like the one shown in the image below.  You may also notice that the KB number listed for these fixes is often a place holder, and not a real KB article ID.

    Installed Updates

     

    If the ‘For testing purposes only’ message is still displayed even after uninstalling the private test fix, there is one more place to check.  If a system has the Microsoft Test Root Authority certificate installed into its Trusted Root Certification Authorities store, the text will be displayed.  We use this certificate to allow a PC to run test code that has been signed by our development team, but not yet fully tested and signed off with the official Microsoft certificate.  To remove this certificate from your system, go to Start -> Run, and enter certmgr.msc and hit enter.  In the Certificate Manager MMC, browse to Trusted Root Certification Authorities, then into Certificates.  You should see one labeled Microsoft Test Root Authority, as shown below.  This will need to be deleted and the system rebooted to remove the testing purposes message.  Do not do this if you still have a private test fix installed though, as it would prevent that binary from continuing to function and may mean you can no longer boot in to Normal or Safe mode.

    certmgr

     

    If you reboot and find that ‘Test Mode’ has replaced the ‘For testing purposes only’ text, you’ll need to launch a command prompt with administrative privileges, then run “bcdedit /set testsigning off” and reboot.  You can always check if test signing is enabled by running “bcdedit /enum” and looking for this line:

    bcdedit /enum

     

    That’s all for today.  Hopefully this post helped clear up any confusion about our different desktop watermarks.

  • Ntdebugging Blog

    Debugging a Bugcheck 0xF4

    • 1 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today’s blog will be a quick walkthrough of the analysis of a bugcheck 0xF4 and how I determined that the action plan going forward should consist of enabling pool tagging on this system.

     

    I began my review with !analyze –v. From the output I can see that a process required for the system to function properly unexpectedly exited or was terminated. The goal of this debugging session will be to determine what failed and why.

     

    0: kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    CRITICAL_OBJECT_TERMINATION (f4)

    A process or thread crucial to system operation has unexpectedly exited or been

    terminated.

    Several processes and threads are necessary for the operation of the

    system; when they are terminated (for any reason), the system can no

    longer function.

    Arguments:

    Arg1: 00000003, Process                                A value of 0x3 in this parameter indicated that it was a process that terminated, not a thread

    Arg2: 8a03ada0, Terminating object        This value is a pointer to the _EPROCESS object that terminated

    Arg3: 8a03af14, Process image file name              Process Name

    Arg4: 805d1204, Explanatory message (ascii)         text message about the problem

     

    We shall begin by dumping out all the parameters of the bugcheck. Let’s dump out the "Terminating Object" below

    0: kd> !object 8a03ada0

    Object: 8a03ada0  Type: (8a490900) Process

        ObjectHeader: 8a03ad88 (old version)

        HandleCount: 3  PointerCount: 228

     

    First, let’s dump out the process image file name from the bugcheck parameter 3 above.

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe

     

     

    0: kd> dt _EPROCESS 8a03ada0 imageFileName

    CSRSRV!_EPROCESS

       +0x174 ImageFileName : [16]  "csrss.exe"

     

    Notice that if we add the base of the _EPROCESS object (8a03ada0- Parameter 2)to the offset of the imageFileName field (+0x174) we get parameter 3. The imageFileName field.

    0: kd> ? 8a03ada0+0x174

    Evaluate expression: -1979470060 = 8a03af14

     

    0: kd> dc 8a03af14

    8a03af14  73727363 78652e73 00000065 00000000  csrss.exe.......

     

    Let’s dump out the ASCII message from parameter number 4

    0: kd> dc 805d1204

    805d1204  6d726554 74616e69 20676e69 74697263  Terminating crit

    805d1214  6c616369 6f727020 73736563 25783020  ical process

     

    Let’s review the debugger help file for more information. We can see that this bugcheck occurs when a critical process or thread terminates. “Several processes and threads are necessary for the operation of the system. When they are terminated for any reason, the system can no longer function. “

    0: kd> .hh bug check 0xf4

     

     

     

     

     

     

    Next, we need to determine why this process terminated. !analyze –v also provided us with an exception record which provides us with an error code:

     

    PROCESS_NAME:  csrss.exe

     

    EXCEPTION_RECORD:  9a85e9d8 -- (.exr 0xffffffff9a85e9d8)

    ExceptionAddress: 7c92c375 (ntdll!RtlFindMessage+0x0000007c)

       ExceptionCode: c0000006 (In-page I/O error)

      ExceptionFlags: 00000000

    NumberParameters: 3

       Parameter[0]: 00000000

       Parameter[1]: 7c99c3d8

       Parameter[2]: c000009a

    Inpage operation failed at 7c99c3d8, due to I/O error c000009a

     

    EXCEPTION_CODE: (NTSTATUS) 0xc0000006 - The instruction at 0x%p referenced memory at 0x%p. The required data was not placed into memory because of an I/O error status of 0x%x.

     

    Since we have an error code, let’s investigate that error code. We can quickly perform this operation from within the debugger using the !error command

     

    0: kd> !error c000009a

    Error code: (NTSTATUS) 0xc000009a (3221225626) - Insufficient system resources exist to complete the API.

     

    0: kd > .hh !error

     

     

    Let’s check the output of the !vm command

     

    0: kd> !vm 2

     

    *** Virtual Memory Usage ***

                    Physical Memory:      760875 (   3043500 Kb)

                    Page File: \??\C:\pagefile.sys

                      Current:   4190208 Kb  Free Space:   4156380 Kb

                      Minimum:   4190208 Kb  Maximum:      4190208 Kb

                    Available Pages:      579241 (   2316964 Kb)

                    ResAvail Pages:       673481 (   2693924 Kb)

                    Locked IO Pages:          69 (       276 Kb)

                    Free System PTEs:     115226 (    460904 Kb)

                    Free NP PTEs:              0 (         0 Kb)

                    Free Special NP:           0 (         0 Kb)

                    Modified Pages:          221 (       884 Kb)

                    Modified PF Pages:       219 (       876 Kb)

                    NonPagedPool Usage:    65534 (    262136 Kb)

                    NonPagedPool Max:      65536 (    262144 Kb)

                    ********** Excessive NonPaged Pool Usage *****

                    PagedPool 0 Usage:     24167 (     96668 Kb)

                    PagedPool 1 Usage:       967 (      3868 Kb)

                    PagedPool 2 Usage:       967 (      3868 Kb)

                    PagedPool 3 Usage:       984 (      3936 Kb)

                    PagedPool 4 Usage:       977 (      3908 Kb)

                    PagedPool Usage:       28062 (    112248 Kb)

                    PagedPool Maximum:     92160 (    368640 Kb)

     

                    ********** 2075 pool allocations have failed **********

     

                    Session Commit:         1562 (      6248 Kb)

                    Shared Commit:          2526 (     10104 Kb)

                    Special Pool:              0 (         0 Kb)

                    Shared Process:         4821 (     19284 Kb)

                    PagedPool Commit:      28062 (    112248 Kb)

                    Driver Commit:          5138 (     20552 Kb)

                    Committed pages:      153449 (    613796 Kb)

                    Commit limit:        1767229 (   7068916 Kb)

     

    0: kd> !poolused

    unable to get PoolTrackTable - pool tagging is disabled, enable it to use this command

    Use gflags.exe and check the box that says "Enable pool tagging".

     

    The output above has informed us that pool tagging is disabled. Let’s demonstrate how you can verify that it is disabled:

     

    0: kd> dd nt!NtGlobalFlag L1

    805597ec  00000000

     

    0: kd> !gflag

    Current NtGlobalFlag contents: 0x00000000

     

    Let’s explore the debugging help file entry on the !poolused command

     

    0: kd > .hh !poolused

     

     

     

    Reading the text above, we are informed that “Pool tagging is permanently enabled on Windows Server 2003 and later versions of Windows. On Windows XP and earlier versions of Windows, you must enable pool tagging by using Gflags.”

     

     

    Using the vertarget command, I can see that this system was running Windows XP.

    0: kd> vertarget

    Windows XP Kernel Version 2600 (Service Pack 2) MP (2 procs) Free x86 compatible

     

     

     

    0: kd > .hh !gflag

     

     

     

    By reviewing the help file entry for the !gflag extension, I was able to determine that if pooltagging was set, the following bit would have been set:

     

    0x400 "ptg"         Enable pool tagging.

     

    0: kd> .formats 0x400

    Evaluate expression:

    ….

      Binary:  00000000 00000000 00000100 00000000                0x00000400

     

    Gflags is included in the Debugging Tools for Windows. The screenshot below is from a Windows 7 system. Notice that Pool Tagging is enabled permanently as described above.

     

     

     

    Summary: This system bugchecked when the critical process csrss.exe failed an I/O operation due to insufficient non-paged pool. For an action plan, we recommended the use of gflags to enable pool tagging in order to obtain more information about pool consumption.

     

     

    Share this post :

Page 5 of 24 (239 items) «34567»