• Ntdebugging Blog

    How Windows Shuts Down

    • 7 Comments

    Hi my name is Bryan,  I'm a escalation engineer on the Microsoft CPR platforms team.   A common problem scenario involves shutting down Windows. When troubleshooting problems during shut down we have to look at the Winlogon.exe process which can be tricky and must be done correctly.

     

    Troubleshooting Techniques

     

    Common Settings

     

    Settings that would affect how Windows shuts down are mostly in HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon, and are documented thoroughly at http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/12314.mspx. The setting to really pay attention is:

     

    DisableCAD (REG_DWORD). When set to 1 the GINA will bypass the Ctrl-Alt-Del dialog and go straight to the logon prompt. The system cannot be shutdown when the logon prompt at the console is displayed.

     

    Another common setting that is sometimes needed for security reasons is in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management.

     

    ClearPageFileAtShutdown (REG_DWORD). When set to 1 the memory manager will clear the paging file on shutdown. If the paging file is large this can significantly the time it takes for the system to shut down.

     

    Winlogon Event Notification

     

    Prior to Windows Vista/2008 Server, Winlogon Notification Packages can be used to reliably monitor the state of Winlogon. These are simple export DLLs that are registered with entry points for each event monitored by the package. Winlogon explicitly calls these entry points when a specific event occurs. Each package handles the notification within the context of the Winlogon process. Keep in mind that a bug in a notification package can cause problems in Winlogon and render the system inaccessible. You can find more information about Winlogon Notification Packages at http://msdn2.microsoft.com/en-us/library/aa380545.aspx.

     

    Starting with Windows Vista support for Winlogon Notification Packages has been pulled. Most of this functionality is still provided through SCM event notifications. See the following link for more information about this.

     

    http://technet2.microsoft.com/WindowsVista/en/library/6ec4ec6d-6b84-44c9-b3af-116589a42b861033.mspx?mfr=true

     

    You can find information about how to write a service here.

     

    http://msdn2.microsoft.com/en-us/library/ms685969.aspx

     

    However these notifications are performed asynchronously so the exact moment of each state within Winlogon cannot be reliably known.

     

    Winlogon Logging

     

    Winlogon logging is obtained by applying the checked build of Winlogon and setting the following registry values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon.

     

    DebugFlags (REG_SZ). This is a comma-separated list of debug flags to capture. The list of flags is Error, Warning, Trace, Init, Timeout, Sas, State, MPR, CoolSwitch, Profile, DebugLsa, DebugSpm, DebugMpr, DebugGo, Migrate, DebugServices, Setup, SC, Notify, and Job.

     

    LogFile (REG_SZ). This is the path to the log file where the events should be written. If this value is missing then the events are written to the debug port.

     

    These options can also be placed in the win.ini file. This option and other debug logging options for Winlogon can be found in these articles.

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;232575

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;221833

     

    Checked builds of Windows binaries are available through MSDN subscriptions. The checked binary applied to the system must match the service pack level of the system.

     

    When looking at the Winlogon logs there are three check points to identify. They are the three ExitWindowsEx calls made during shutdown. These log entries were made during a shutdown initiated with “shutdown –r –f –t 0”.

     

    328.372> Winlogon-Trace: Doing remote-initiated (Administrator) Reboot=true, Force=true

    328.372> Winlogon-Trace: Starting user thread for Logoff, flags = 3807

    328.764> Winlogon-Trace: Doing immediate shutdown, LastGinaRet = b, Flags = 3804

    328.764> Winlogon-Trace: Calling ExitWindowsEx(0x3804, 0)

    .

    .

    .

    328.1528> Winlogon-Trace: ExitWindowsEx called to shut down COM processes

    .

    .

    .

    328.332> Winlogon-Trace: Starting shutdown

    328.332> Winlogon-Trace: Starting user thread for Logoff, flags = 7

    .

    .

    .

    328.284> Winlogon-Trace: Calling ExitWindowsEx(0x7, 0)

     

    If shutdown stalls then based on the log entries you should be able to identify which part of shutdown it is stuck in.

     

    Debugging

     

    In order to debug Winlogon you will need to first install the debugging tools for Windows (http://www.microsoft.com/whdc/devtools/debugging/default.mspx). It contains information about how to set up the debugger for Winlogon debugging. If the debugger is not set up correctly on Winlogon the system can easily get in an unusable state and the system will need to be reinstalled.

     

    You should debug Winlogon whenever a problem manifests itself on the Winlogon desktop where the SAS and logon prompt dialogs are displayed. When debugging Winlogon you should start by looking at the main process thread. If shutdown (or any other Winlogon activity) is hung that thread will show what happened to prevent shutdown from completing. Also look at these Winlogon flags which provide some indication of whether or not the main thread is even processing a shutdown.

     

    0105fe8c winlogon!g_fReadyForShutdown

    0105fdf8 winlogon!ShutdownHasBegun

    01062b3c winlogon!ShutdownInProgress

    01062b30 winlogon!ShutdownTime

     

    I will point out how these variables are modified.

     

    Shutdown Sequence of Events

     

    If you are debugging a shutdown issue you first need to understand the sequence of events that take place during shutdown. This will help isolate what activity may be failing.

     

    RPC Call

     

    An application, local or remote, tells Windows to shut down by calling the Win32 API InitiateSystemShutdownEx (http://msdn2.microsoft.com/en-us/library/aa376874.aspx). This API creates an RPC call over the named pipe InitShutdown. The remote connect and RPC call looks like this in a network trace.

     

    11:08:40.025                 CLIENT         SERVER        SMB              SMB: C; Nt Create Andx, FileName = \InitShutdown

    11:08:40.027                 CLIENT         SERVER        MSRPC        MSRPC: c/o Request: unknown Call=0x1 Opnum=0x2 Context=0x0 Hint=0x20

     

    A local connect would not need to go over the network, but it uses the same mechanism to make this call.

     

    The server side of this RPC call is handled by the function BaseInitiateShutdownEx in the Winlogon.exe process. This RPC worker performs the following actions.

     

    1.       Checks the privilege level of the client. If the client privilege test fails then the error status code is returned. This will typically be ERROR_ACCESS_DENIED (0x5).

    2.       Parses the shutdown command sent by the RPC client. If the command is malformed then the status error code of ERROR_INVALID_PARAMETER (0x57) is returned.

    3.       Checks the Winlogon flags ShutdownInProgress, ShutdownHasBegun, g_fReadyForShutdown, and the current state of the terminal desktop to see if we can shutdown. Winlogon cannot begin the shutdown if a shutdown is in progress or if it is not ready. Winlogon also will not start a shutdown if the force flag was not specified in the command and the desktop is locked. If the ShutdownInProgress or ShutdownHasBegun flag is set then the return value is ERROR_SHUTDOWN_IN_PROGRESS (0x45B). If the g_fReadyForShutdown is not set then the return value is ERROR_NOT_READY (0x15).

     

    Debugger output:

    dd winlogon!g_fReadyForShutdown l 1

    0105fe8c  00000001

     

    dd winlogon!ShutdownInProgress l 1

    01062b3c  00000000

     

    dd winlogon!ShutdownHasBegun l 1

    0105fdf8  00000000

     

    4.       Winlogon initializes the shutdown data including the global variable ShutdownTime. If this variable has been set then we know we’ve gotten this far.

     

    Debugger Output:

    dq winlogon!ShutdownTime l 1

    01062b30  01c7a859`baee0060

     

    .formats 01c7a859`baee0060

    Evaluate expression:

      Time:    Wed Jun  6 12:42:54.506 2007 (GMT-4)

     

    5.       If Winlogon has determined that it can proceed then the ShutdownInProgress flag is set. Subsequent shutdown requests from this point will fail in step 3 when it checks this flag.

    6.       Generates the shutdown audit event if auditing was enabled. Note that shutdown may fail at some point after this even though the audit log is generated.

    7.       Creates another thread to continue working on the shutdown. At this point the RPC worker thread returns to the caller.

     

    To this point there has been no visible indication on the server being shutdown. If the RPC worker fails for some reason then the client application will get an indication of the failure. If the RPC worker successfully hands the shutdown request to the next thread then the client application will get ERROR_SUCCESS (0) as the return code. The client application will get no indication of a failure after that point.

     

    Worker Thread

     

    The RPC worker thread hands control to the LogoffThreadProc thread. If the delay specified in the shutdown command was 0 then the ShutdownInProgress flag is cleared and the ShutdownHasBegun is set. In the Winlogon log you will see a line that starts with “Doing immediate shutdown”. Otherwise you will not see any visible indication at this point that a shutdown is occurring.

     

    If the delay specified in the shutdown command was greater than 0 then the countdown shutdown dialog is displayed.

     

    Shutdown

     

    When that dialog completes the AbortShutdown flag is checked. This flag would get set as a result of a call to AbortSystemShutdown (http://msdn2.microsoft.com/en-us/library/aa376630.aspx). If it is set then the shutdown is aborted. Otherwise the ShutdownInProgress flag is cleared, the ShutdownHasBegun flag is set and if the shutdown was initiated by a system process then we shut the system down immediately at this point with no further clean up.

     

    After these flags are updated the ExitWindowsInProgress flag is set and we call the Win32 API ExitWindowsEx (http://msdn2.microsoft.com/en-us/library/ms893047.aspx). In the Winlogon log you will see a line that starts with “Calling ExitWindowsEx”.

     

    Debugger Output:

    dd winlogon!ExitWindowsInProgress l 1

    0105fd84  00000001

     

    The Win32 API ExitWindowsEx makes an RPC call to CSRSS.EXE. CSRSS synchronously sends a WM_QUERYENDSESSION message to all Windows applications. When an application gets this message it indicates that shutdown can continue and CSRSS then sends the WM_ENDSESSION message. After that the process is terminated. If the application indicates that it cannot be terminated then CSRSS stops processing any further applications and waits for the interactive user to close the application. The ExitWindowsEx call will fail with error ERROR_OPERATION_ABORTED (0x3E3) and the Winlogon flags are reset so that a new shutdown request can be processed.

     

    An application that prevents shutdown from proceeding in this manner can be seen visual since it will be the foreground window on the desktop. To confirm which application returned did this you will need to live debug CSRSS. A return code of 3 from either winsrv!ConsoleClientShutdown or winsrv!UserClientShutdown will indicate the application did this.

     

    Debugger Output:

    0:002> pc

    eax=00000000 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564de esp=0052fe40 ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+7e:

    75a564de ff5740 call dword ptr [edi+0x40]{winsrv!UserClientShutdown (75a9db1f)} ds:0023:00164640=75a9db1f

     

    ; Step past the call.

    0:002> p

    eax=00000003 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564e1 esp=0052fe4c ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+81:

    75a564e1 8bf8             mov     edi,eax

     

    ; The first parameter is a structure that contains the process ID.

    75a564d5 ff75f4           push    dword ptr [ebp-0xc]

    75a564d8 ff750c           push    dword ptr [ebp+0xc]

    75a564db ff75f8           push    dword ptr [ebp-0x8]

    75a564de ff5740           call    dword ptr [edi+0x40]

     

    ; Get the pointer to this structure.

    0:002> dd ebp-8 l 1

    dd ebp-8 l 1

    0052fe60  0018a530

     

    ; The first DWORD is the process ID. The second DWORD is the thread ID.

    0:002> dd 0018a530 l 1

    dd 0018a530

    0018a530  0000066c

     

    ; Break into kernel mode so we can look at all the processes.

    0:002> .breakin

    .breakin

    Break instruction exception - code 80000003 (first chance)

    nt!RtlpBreakWithStatusInstruction:

    8081db0e cc              int     3

     

    ; Get the process object with that process ID.

    kd> !process 0000066c 0

    Searching for Process with Cid == 66c

    PROCESS ff62a638  SessionId: 0  Cid: 066c    Peb: 7ffdf000  ParentCid: 0108

        DirBase: 0390d000  ObjectTable: e1658e38  HandleCount:  51.

        Image: test.exe

     

    Console (text-based) applications are asynchronously sent a separate CTRL_SHUTDOWN_EVENT notification. This means that ExitWindowsEx will proceed no matter how the application chooses to handle this notification.

     

    Services.exe (and all Windows services) is a console application which receives this notification from CSRSS. Services.exe registers a control handler ScShutdownNotificationRoutine which calls ScShutdownAllServices on shutdown. This function traverses through all Windows services that are not stopped or stopping and that accept the shutdown notification and sends them the SERVICE_CONTROL_SHUTDOWN notification. Each service has 20 seconds by default to shutdown. However a service may request more time by calling SetServiceStatus with a wait hint and updated check point. It can do this so long as it continues to respond within the current timeout period. However since Services.exe received this as an asynchronous message from CSRSS.EXE it will not prevent the system from shutting down.

     

    After ExitWindowsEx returns control to Winlogon the ExitWindowsInProgress flag is cleared and LogoffThreadProc exits.

     

    MainLoop

     

    The state of Winlogon is controlled by the main process thread in a function called MainLoop. As LogoffThreadProc sets the ShutdownHasBegun flag and calls ExitWindowsEx the MainLoop function picks up on this change of state and begins executing its shutdown code. Since MainLoop is responsible for interaction with the user this is the first place where the user will get visible confirmation from Winlogon that the system is shutting down.

     

    When MainLoop sees that the Winlogon state has changed to shutting down it takes the following actions.

     

    1.       Signal the shell that we are shutting down. This causes the Explorer shell to disappear.

    2.       Checks to see if there are any updates to the user’s profile.

    3.       Send out the logoff notification event.

    4.       Delete network connections.

    5.       Play the logoff sound.

    6.       Play the system exit sound.

    7.       Creates a KillComProcesses thread. This calls ExitWindowsEx and will wait up to 15 minutes for this to complete.

    8.       Save and unload the user’s profile.

    9.       Delete RAS connections.

    10.    Send out the shutdown notification event.

    11.    Stop Windows file protection.

    12.    Creates another LogoffThreadProc thread which again calls ExitWindowsEx.

    13.    Call the shutdown function in the GINA. This displays the Windows is shutting down dialog.

    14.    Wait for any remaining system processes to complete. If we are stuck here then we would need to look at System, smss.exe, or csrss.exe. One of those will be stuck in some action.

    15.    Shut down the system.

     

    Winlogon uses the native API NtShutdownSystem in step 15. If there is a hang after this point then you will see the main thread in Winlogon stuck in this call and the real problem is likely the result of a device driver in the System process. This call will never return.

  • Ntdebugging Blog

    How Windows Starts Up (Part the second)

    • 7 Comments

    Howdy folks, David here again with part two of How Windows Starts Up.  Today we’ll be covering the Boot Loader Phase.  Let’s take a moment to recap where we are at this point.  So far, the computer has completed POST, executed the MBR code, located the active partition, executed the Boot Sector code in that partition, found NTLDR and loaded it into memory.  So what happens next?  During this phase, we’ll be discussing the following files located in the root of your system volume:  NTLDR, ntdetect.com and boot.ini.

     

    The Boot Loader Phase begins when NTLDR executes:

     

    1.    NTLDR switches the processor from real mode to 32-bit flat memory mode.

    How does it do this you ask?  NTLDR contains two parts – a 16-bit component which handles switching the processor to Protected Mode and a 32-bit component which handles everything else. 

     

    When NTLDR first begins execution, the processor is in an x86-operating mode called Real-Mode.  In Real-Mode, no virtual to physical translation of memory addresses occurs, which means that all memory addresses use the 16-bit Segment:Offset format, and only the first 1MB of the computer's physical memory is accessible.  However, we don’t have to concern ourselves with this too much because the first action NTLDR takes is to switch the CPU to Protected-Mode. 

     

    *Trivia Alert*  before we enter Protected-Mode, we enable the A20 Line – you IBM guys know what I’m talking about… 

     

    Still no virtual-to-physical translation occurs at this point in the boot process, but a full 32 bits of memory addressing becomes accessible, and now NTLDR can access up to 4GB of physical memory. 

     

    2.    NTLDR creates enough page tables to make memory below 16MB accessible with paging turned on.  This number has been increased with Windows Server 2003. 

     

    3.    NTLDR enables paging.  “Protected-Mode with Paging Enabled” is the mode in which Windows executes during normal operation.

     

    4.    NTLDR then starts the FAT and NTFS mini file system drivers contained within its own code.  Unlike the Boot Sector's file system code, NTLDR's code is sophisticated enough to read subdirectories.

     

    Note:   All disk access from this point until the Kernel Phase begins uses BIOS Interrupt 13h, Extended Interrupt 13h or NTBOOTDD.SYS

     

    5.    NTLDR locates and parses Hiberfil.sys to determine if it contains a valid hibernator file.  If a valid hibernator file is found, then that file is loaded into memory and execution continues from the point the computer was placed into hibernation.

     

    Hibernation files are interesting.  The file either contains a full memory image, or it contains a pointer (ARC path) to the boot partition of the last operating system that entered hibernation.  That boot partition will contain another Hiberfil.sys file, which contains a full memory image of the hibernating operating system.  If the Hiberfil.sys file is a pointer, the file will contain an ARC path using one of the following syntaxes:

     

                             - linkmulti()disk()rdisk()partition()

                             - linksignature()disk()rdisk()partition()

                             - linkscsi()disk()rdisk()partition()

     

    6.    If Hiberfil.sys is not found, NTLDR parses the Boot.ini file.  If the Boot.ini contains a reference to more than one operating system or boot option, then the boot loader screen is displayed. 

     

    This is a sample of a default Boot.ini file from a Windows 2000 Server-based computer:

     

    [boot loader]

    timeout=30

    default=multi(0)disk(0)rdisk(0)partition(1)\WINNT

    [operating systems]

    multi(0)disk(0)rdisk(0)partition(1)\WINNT="Windows 2000 Server" /fastdetect

     

    Let’s talk a bit about what’s in the Boot.ini file.  Boot.ini contains a list of ARC paths.  Ok, quick – five bonus points to the first person who can tell me what ‘ARC’ stands for…  come on…  Yes, you there in the red shirt… correct!  It stands for ‘Advanced RISC Computing’; now beam down with the captain.  The rest of you come with me…

     

    ARC paths may use the “Multi”, “SCSI” or “Signature” syntax.

     

    The “Multi” syntax instructs Windows to rely on the system BIOS to load system files.  This means that NTLDR will be using INT13 BIOS calls to find and load NTOSKRNL and any other files on the boot partition.

     

    The “SCSI” syntax tells Windows to load a SCSI device driver to access the boot partition.  A copy of the driver must be renamed to NTBOOTDD.SYS and placed in the root of the system partition.

     

    The “Signature” syntax is equivalent to the SCSI syntax, but is used instead to support the Plug and Play architecture in Windows 2000 or later.  Because Windows 2000 is a PnP OS, the SCSI controller number instance might vary each time you start Windows 2000, especially if you add new SCSI controller hardware after setup.

     

    The Signature syntax instructs NTLDR to locate the drive with a disk signature that matches the value in the parentheses, regardless of which SCSI controller number that the drive is connected to.  Keep in mind that the Signature syntax also requires the presence of an Ntbootdd.sys file in the root of the system partition.  If you recall from the previous post, the disk signature value is located in the MBR.

     

    Example:

    signature(8b467c12)disk(1)rdisk(0)partition(2)\winnt="Windows 2000"

     

          Signature() syntax is used only if one of the following conditions exists:

     

    ·         The System BIOS or controller hosting the boot partition on which Windows is installed, does not support INT13 Extensions, or has the INT13 Extensions option disabled, and the partition on which you installed Windows is either larger than ~7.8 gigabytes (GB) in size, or the ending cylinder number for that partition is higher than 1024.

    ·         The drive on which you installed Windows is connected to a SCSI controller with SCSI BIOS disabled, so INT13 BIOS calls cannot be used during the boot process.

     

                            227704 Windows May Use Signature() Syntax in the Boot.ini File

                            http://support.microsoft.com/default.aspx?scid=kb;EN-US;227704

     

    7.    If Boot.ini is not found, NTLDR attempts to continue the boot process by assuming that Windows was installed in the default location on the same partition (c:\windows).  If the c:\windows directory is not found, NTLDR displays this error message:  Windows could not start because the following file is missing or corrupt: \winnt root\system32\ntoskrnl.exe

     

    8.    If the boot loader screen is displayed, the user must select an operating system to boot.

    a.    If the user chooses to run an operating system other than Windows 2003/XP such as DOS or Windows 95/98, NTLDR loads and executes the Bootsect.dos file located in the root of the boot drive.

    (1)  If the Bootsect.dos file is found, NTLDR loads it into RAM and then passes control to it.

    (2)  Bootsect.dos then loads IO.sys, MSDOS.sys, Command.com, etc., as necessary and proceeds to boot the downlevel operating system.

    (3)  If the file is not found in the root, NTLDR displays:

    I/O Error accessing boot sector file multi(0)disk(0)rdisk(0)partition(1): \bootsect.dos

    b.    If the user selects a Windows NT-based operating system, NTLDR executes Ntdetect.com.

    (1)  If the ARC path points to a hard drive or partition that is not on the system, the following is displayed:

    Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.

    (2)  The above message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file.  The following error message may also be displayed when there are problems with the Signature () value:

    Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.

     

    9.    Ntdetect.com initializes and detects all supported hardware that it can find.  (Troubleshooting tip:  There is an Ntdetect.chk on the Windows Resource Kit CD in the Debug.cab file, which can be renamed and used in place of Ntdetect.com to view what Ntdetect.com is detecting during startup.  You know you want to – go ahead, I’ll wait…)

     

    Ntdetect.com detects the following components:

     

    a)    Computer ID

    b)    Bus/adapter type

    c)    Video adapter

    d)    Keyboard

    e)    Communications ports

    f)     Floppy disks

    g)    Mouse or other pointing devices

    h)    Parallel ports

     

    10.  Ntdetect.com passes its list of hardware components to NTLDR.

     

    11.  If the /SOS switch was used in the Boot.ini for the OS that is loading, NTLDR will display the files it loads into memory from this point on.  Keep in mind – none of these files are being initialized.  They are just being loaded into RAM at this point.  I can’t tell you how many times someone has incorrectly concluded that there is a problem with MUP.SYS because it’s the last file displayed on this screen before their computer hangs.  Poor MUP – always getting a bad rap.  This list can be useful, however; it can help you identify potential malware that may be loading.

     

    12.  NTLDR loads Ntoskrnl.exe and Hal.dll into memory.  Again, *loading*, not initializing.

     

    13.  NTLDR loads the System registry hive into memory.  If Windows cannot load the registry hive, it may be necessary to restore the System hive from the ERD or by using the Windows Recovery Console to copy the files from the Windows\Repair or Windows\Repair\RegBack folder.

     

    14.  NTLDR reads the HKLM\System\Select registry key.

    a.    If the Last Known Good option was not selected, NTLDR looks for the “Default” value in the SELECT key to select which ControlSet to use.  (0x1 = ControlSet001, 0x2 = ContolSet002, 0x3 = ControlSet003) 

    b.    If the Last Known Good option was selected, NTLDR looks for the “LastKnownGood” value in the SELECT key to select the ControlSet to use.

     

    15.  NTLDR sets the “Current” value in the HKLM\System\Select key to the ControlSet being used.

     

    Note:  At this point, NTLDR has determined which ControlSet it needs to use, but we don’t actually create the “CurrentControlSet” symbolic link until the kernel initializes (more on that in our next installment).  Therefore, for the purposes of this discussion, I’ll use *ActiveControlSet* to refer to whichever registry ControlSet has been identified as the one NTLDR needs to use.

     

    16.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Codepage for the “Acp”, “Oemcp”, and “Oemhal” values.  These entries are pointers to the table of codepage information stored in the same key.  The table has the filenames that correspond to the codepage numbers.  NTLDR loads these files into memory.  The defaults are C_1252.nls, C_437.nls, and Vgaoem.fon.

     

    A code page is a character set specific to an operating system or hardware platform.  It maps character codes to individual characters.  Different code pages include different special characters, typically customized for a language or a group of languages.  The system uses code pages to translate keyboard input into character values for non-Unicode based applications, and to translate character values into characters for non-Unicode based output displays.

     

    17.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Language for the “Default” value.  This entry is a pointer to the table of language information stored in the same key.  NTLDR loads this file into memory.  The default is L_intl.nls.

     

    18.  NTLDR scans HKLM\System\*ActiveControlSet*\Services for device drivers with a “Start” value of 0x0.  It finds their “Group” and “Tag” values also.

     

    19.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\ServiceGroupOrder for the order that service groups should be loaded.

     

    20.  NTLDR sorts the drivers by “Group” and then by their “Tag” value.  Each service or driver has a particular group to which it belongs.  All drivers within a specific group are ordered by tag. 

     

    21.  NTLDR loads the 0x0 device drivers into memory.  It does not initialize or start these drivers.

     

    22.  The board is set, and the pieces are moving.  NTLDR executes Ntoskrnl.exe.

     

    Troubleshooting the Boot Loader Phase

     

    Okay, now that we’ve seen what is supposed to happen, what could possibly go wrong?  Well, several things…

     

    First, all file-system activity during this phase uses BIOS Int13, Extended Int13 or Ntbootdd.sys. 

     

    1.    If your computer uses Extended Int 13, verify that Ext. Int13 is enabled in your system’s BIOS or SCSI BIOS as appropriate.

     

    2.    If you are using a SCSI driver, verify that you are using the correct SCSI driver (Ntbootdd.sys).  If you are using "SCSI()" syntax in the Boot.ini file, copy the correct device driver for the SCSI controller to the root and then rename it to Ntbootdd.sys.  This is the driver that is stored in the Winnt.sbs\System32\Drivers folder (for example, the Aic78xx.sys file).  If you use "MULTI()" syntax in the Boot.ini file, you do not need to do this.  Use the Windows Recovery Console or a Windows NT boot floppy to complete this step.

     

                301680 HOW TO: Create a Boot Disk for an NTFS or FAT Partition

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;301680

     

    307654 How to install and use the Recovery Console in Windows XP

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;307654

     

     

    Significant Errors during this phase:

     

    ·         Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.

     

    This error indicates that the ARC path is pointing to a hard drive or partition that is not on the system.  This message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file. 

     

    ·         Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.

     

    This error message may be displayed when there are problems with the Signature () ARC path syntax.

     

    Ntoskrnl.exe missing or corrupt.  This error could be the result of an incorrect ARC path in the Boot.ini file.  Use the information in the following knowledge base articles to correct the Boot.ini information.

     

                102873 BOOT.INI and ARC Path Naming Conventions and Usage

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;102873

     

                155222 How to Determine the ARC Path

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;155222

     

                130921 Creating an FT Boot Disk With Scsi( ) and Multi( ) Identifiers

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;130921

     

    It is also possible to use the Windows XP Recovery Console to create a new Boot.ini for a Windows NT, Windows 2000 or .NET installation.  Please see the following knowledge base article for a description of this process:

     

                291980 A Discussion About the Bootcfg Command and Its Uses

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q291980

     

    Specify the /SOS switch in the Boot.ini file and monitor the output to the screen.  If the system stops responding before anything is output to the screen, then the problem may be with hardware detection.  Replace Ntdetect.com with Ntdetect.chk to view the hardware detected by the system during this phase.

     

                103659 Setup Hangs While Inspecting Hardware; How to Use Ntdetect.com

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;103659

     

                320040 HOW TO: Use Ntdetect.chk to Identify Hardware Problems

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;320040

     

     

    Incorrect HAL

    If after specifying the /SOS option, you see that an incorrect version of the HAL.DLL is loading, use the following KB articles to replace the HAL or troubleshoot other HAL-related issues:

     

                314477 "Hardware Configuration Problem" Err Msg Starting Windows

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;314477

     

                237556 Troubleshooting Windows 2000 Hardware Abstraction Layer Issues

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;237556

     

    330184  "Invalid Boot.ini" or "Windows could not start" error messages

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;330184

     

     

    Unable to load System hive

    Several problems could happen here.  First, consider that the amount of memory we’re dealing with is only 16MB.  We have to squeeze NTLDR, the System Registry Hive, the kernel, the HAL, all the boot drivers and the PFN database into a relatively small area.  So, with Windows Server 2003, we’ve given ourselves a bit more room. 

     

                302594 The System hive memory limitation is improved in Windows Server 2003

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;302594

     

    If Windows is unable to load the System registry hive into memory, then it may be necessary to restore the System hive from the ERD or use the Windows Recovery Console to copy the files from the Winnt\Repair or Winnt\Repair\RegBack folder whichever version is most recent.

     

                307545 How to Recover from a Corrupted Registry

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;307545

     

     

    Table of Startup Type Values

    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services

    0x0 = Boot

    Loaded into RAM by the Boot loader (NTLDR) during Boot Loader Phase, then initialized by the Kernel (NTOSKRNL) during Kernel Phase

    0x1 = System

    Loaded and initialized by the Kernel during Kernel Phase

    0x2 = Auto

    Loaded or started automatically at system startup

    0x3 = Manual

    Driver is manually started by the user or another process

    0x4 = Disabled

    Driver is not to be started (There are exceptions, such as NTFS)

     

     

    Conclusion

    Well folks, we certainly have covered a lot of ground so far.  Any questions?  Yes, you there in the front…  Ah, why have we not discussed any blue screens?  Very good question.  We haven’t seen any BSOD’s up to this point in the boot sequence because the kernel isn’t running yet and STOP errors are generated by the kernel.  No kernel, no blue screens. 

     

    That’s all the time we have this week boys & girls.  Join us next time for How Windows Starts Up, The Kernel Phase.

     

  • Ntdebugging Blog

    How it Works: DLL Injection

    • 7 Comments

     

    Introduction

     

    Hi everyone, this is Bob again.  I recently worked on an issue where the interaction of two threads in Winlogon led to a bugcheck.  One thread was a Winlogon thread initializing GDI.  The interesting thing about this scenario is how the other thread ended up in this process.

     

     

     

    What was the thread doing?

     

    Below is the user half of the thread stack.  The thread attempted to load a DLL.  

     

    ChildEBP RetAddr  Args to Child

    0058eaec 773901ad 773901d9 0058eafc 00240022 ntdll!KiFastSystemCallRet

    0058eb0c 775d96f3 775d1808 00000000 77e6f032 USER32!NtUserRegisterWindowMessage+0xc

    0058ed24 775e4755 00000000 00000001 7c837512 comctl32!InitGlobalMetrics+0x44

    0058ed3c 775e426a 00000031 0058ed68 7763490c comctl32!_ProcessAttach+0x98

    0058ed48 7763490c 775d0000 00000001 00000000 comctl32!DllMain+0x21

    0058ed68 7c81a352 775d0000 00000001 00000000 comctl32!_DllMainCRTStartup+0x52

    0058ed88 7c833465 776348ba 775d0000 00000001 ntdll!LdrpCallInitRoutine+0x14

    0058ee90 7c834311 00000000 00000000 7c8e2e58 ntdll!LdrpRunInitializeRoutines+0x367

    0058f124 7c834065 00000000 00080e98 0058f3ec ntdll!LdrpLoadDll+0x3cd

    0058f3a0 77e41bf3 00080e98 0058f3ec 0058f3cc ntdll!LdrLoadDll+0x198

    0058f408 77e5c70b 7c8e2e58 00000000 00000000 kernel32!LoadLibraryExW+0x1b2

    0058f41c 7c92a6a1 7c8e2e58 00000000 7c8e2e58 kernel32!LoadLibraryW+0x11

    0058f454 7c92a65f 7c8e2e58 7c8d0000 7c9297b6 SHELL32!SHFusionLoadLibrary+0x2a

    0058f460 7c9297b6 00000020 00000008 0058f6a8 SHELL32!DelayLoadCC+0x15

    0058f694 7c929728 0058f6a8 0000007c 00000001 SHELL32!SHFusionInitializeIDCC+0x92

    0058f8b4 7c92966f 7c8d0000 0000007c 00000001 SHELL32!SHFusionInitializeFromModuleID+0x3a

    0058f8c8 7c92962c 7c8d0000 00000001 0058f8f8 SHELL32!_ProcessAttach+0x34

    0058f8d8 7c92bb63 7c8d0000 00000001 00000000 SHELL32!DllMain+0x27

    0058f8f8 7c81a352 7c8d0000 00000001 00000000 SHELL32!_DllMainCRTStartup+0x52

    0058f918 7c833465 7c92bb1b 7c8d0000 00000001 ntdll!LdrpCallInitRoutine+0x14

    0058fa20 7c834311 00000000 00000000 00000004 ntdll!LdrpRunInitializeRoutines+0x367    ß This function is

          loading and calling init

          functions of dependent DLL’s

    0058fcb4 7c834065 00000000 00080760 0058ff7c ntdll!LdrpLoadDll+0x3cd

    0058ff30 77e41bf3 00080760 0058ff7c 0058ff5c ntdll!LdrLoadDll+0x198        ß 0058ff5c is the Unicode string

       pointer to DLL name

    0058ff98 77e5c70b 00570254 00000000 00000000 kernel32!LoadLibraryExW+0x1b2

    0058ffac 0057017e 00570254 00000000 00200008 kernel32!LoadLibraryW+0x11

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0058fff4 00000000 00570228 00905a4d 00000003 0x57017e

     

     

    The DLL being loaded depends on other DLLs.  These DLLs are loaded and initialized when the first DLL is loaded.  So if DLL ‘A’ has a call to DLL ‘B’, then DLL ‘B’ is loaded by the loader when DLL ‘A’ is loaded.

     

     

    What is so unusual about this thread?

    If you examine the IP for that start address taken from !thread and where it is calling LoadLibraryW, the IP is not in a range of any loaded module.

     

    1: kd> !thread

    THREAD 86edd020  Cid 7884.7528  Teb: 7ffdc000 Win32Thread: bc1adb48 RUNNING on processor 1

    Not impersonating

    DeviceMap                 e10018c0

    Owning Process            87c51d88       Image:         winlogon.exe

    Wait Start TickCount      2567944        Ticks: 0

    Context Switch Count      4                 LargeStack

    UserTime                  00:00:00.015

    KernelTime                00:00:00.000

    Start Address 0x00570000   ß Starting address.  This is not in any module displayed by “!peb”

     

    The !PEB extension will display the loaded module list and address range for the process.  It is not shown here because of space.  However this address is not in any loaded module.

     

    Let’s look at the function:

     

    00570000 55              push    ebp

    00570001 8bec            mov     ebp,esp

    00570003 83ec3c          sub     esp,3Ch

    00570006 8365e800        and     dword ptr [ebp-18h],0

    0057000a 8365ec00        and     dword ptr [ebp-14h],0

    0057000e 8365f800        and     dword ptr [ebp-8],0

    00570012 8365dc00        and     dword ptr [ebp-24h],0

    00570016 8365f000        and     dword ptr [ebp-10h],0

     

    1: kd> u

    0057001a 8365e000        and     dword ptr [ebp-20h],0

    0057001e 8365f400        and     dword ptr [ebp-0Ch],0

    00570022 6a01            push    1

    00570024 8b4508          mov     eax,dword ptr [ebp+8]        ß The 1st argument is a pointer to a list of functions.

    00570027 ff5004          call    dword ptr [eax+4]

    0057002a 8945fc          mov     dword ptr [ebp-4],eax

    0057002d 8b4508          mov     eax,dword ptr [ebp+8]        ß Function block.

    00570030 ff5010          call    dword ptr [eax+10h]

     

    1: kd> u

    00570033 8945e4          mov     dword ptr [ebp-1Ch],eax

    00570036 837de400        cmp     dword ptr [ebp-1Ch],0

    0057003a 0f84c0010000    je      00570200

     

    The first argument is a block of functions.  That is what was passed as the initial parameter.  What functions were passed?

     

    1: kd> dds 570228 l 5

    00570228  77e5c6fa kernel32!LoadLibraryW

    0057022c  77e6c2dc kernel32!SetErrorMode

    00570230  77e70531 kernel32!GetCurrentDirectoryW

    00570234  77e70d67 kernel32!SetCurrentDirectoryW

    00570238  77e63ec7 kernel32!GetProcessHeap

     

    These functions are standard kernel32 calls.  So, the question is why is it doing that?

     

     

    What is this thread doing?

    Based on the fact that the IP is not in any module, the IP is page aligned, and the thread was passed function addresses as it initial parameter, it looks like this thread was “injected” into this process. 

     

     

    How was this thread injected?

    Another process, (I do not know which one) allocated a block of memory in the Winlogon process via VirtualAllocEx.  This function takes a process handle as input and it can be a different process.  Then code can be moved into the process via WriteProcessMemory.  Then a thread can be created using the starting address of the memory address returned from VirtualAllocEx.

     

     

    Are we done?

    Nope – remember the block of addresses?  This is needed because the module was not loaded by the loader.  So the functions did not get resolved by the linker.  So the addresses of functions outside of the code moved are unknown.  Since in Windows Server 2003 the functions of certain DLLs stay at the same address, they can be passed to another process.  Vista and beyond does not do this, so this method will not work.

     

     

    Conclusion

    Hope you found this interesting.  More on DLL injection can be found here: http://www.codeproject.com/dll/DLL_Injection_tutorial.asp and here: http://en.wikipedia.org/wiki/DLL_injection.

     

    AppInit_DLLs is another method of getting a DLL to load (in all process that rely on user32.dll), documented in KB 197571.  This can lead to problems though, as Raymond Chen discussed here.

     

  • Ntdebugging Blog

    Tracking down MmSt paged pool usage

    • 6 Comments

     

    A trend that I’ve noticed recently are cases involving paged pool depletion with high MmSt tag usage that remains after trying KB304101 (PoolUsageMaximum). These pool allocations are used by the memory manager for section object prototype PTEs. There are generally only two options when this happens: 1) upgrade to a 64-bit platform, or 2) reduce the size of the volumes. But we may want to know what mapped files are using this memory. Here is how it can be done. Start with !memusage.

     

    5: kd> !memusage

     loading PFN database

    loading (100% complete)

    Compiling memory usage data (99% Complete).

                 Zeroed:  19073 ( 76292 kb)

                   Free:      0 (     0 kb)

                Standby: 1468824 (5875296 kb)

               Modified:    368 (  1472 kb)

        ModifiedNoWrite:   1927 (  7708 kb)

           Active/Valid: 605772 (2423088 kb)

             Transition:      0 (     0 kb)

                    Bad:      0 (     0 kb)

                Unknown:      0 (     0 kb)

                  TOTAL: 2095964 (8383856 kb)

      Building kernel map

      Finished building kernel map

    Scanning PFN database - (100% complete)

     

    Following this you will see the list of mapped files and their control areas.

     

      Usage Summary (in Kb):

    Control Valid Standby Dirty Shared Locked PageTables  name

    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )

     

    The control area is the address at the far left and has a Segment field that contains the total number of PTEs.

     

    5: kd> dt 8c62a638 _CONTROL_AREA Segment->TotalNumberOfPtes

    nt!_CONTROL_AREA

       +0x000 Segment                    :

          +0x004 TotalNumberOfPtes          : 0x1e8b00

     

    The MmSt allocations contain these PTEs so all we need to do is multiply this by the size of a PTE to get the total size of the MmSt allocations for this control area. Note that there may be multiple allocations for this control area, but this number will reflect the total size all these allocations.

     

    5: kd> ?? 0x1e8b00 * sizeof(nt!_MMPTE)

    unsigned int 0xf45800

     

    So now we know the MmSt size in bytes for a single control area, or mapped file. What if we would like to see the totals for all mapped files from the !memusage output? First, place the !memusage output in a text file and remove all header information. You will also need to remove all tail information including the page file and process summaries. Every line should look like these.

     

    8c62a638  1108  945868  3064     0     0     0  mapped_file( $Mft )

    8b06ac18   516      0     0     0     0     0    No Name for File

     

    We want to include the “No Name for File” entries since those are valid mapped files even though the name could not be located. Next strip out everything but the control area address. You can use Excel or any other tool that allows you to select and delete columns in a text file. Now we have a file with a single column of all the control areas on the system. The following debugger command script can be used to process this file.

     

    $$ countptes.txt script

    r $t2 = 0;

     

    $$ Replace the memusage.txt file name with your file name.

    .foreach /f (ca "memusage.txt") {

        r $t1 = @@c++(((nt!_CONTROL_AREA *)(0x${ca}))->Segment->TotalNumberOfPtes);

        .printf "Control Area %p : %d\n", ${ca}, @$t1;

        r $t2 = @$t2 + @$t1;

    }

     

    .printf "Total PTEs : %d\n", @$t2;

    .printf "MmSt size  : %d bytes\n", (@$t2 * @@c++(sizeof(nt!_MMPTE)));

     

    The following command will execute the script.

     

    5: kd> $$><countptes.txt

     

    This will show the number of PTEs for each control area, followed by a summary.

     

    Total PTEs : 62790244

    MmSt size  : 502321952 bytes

     

    A common high user of MmSt allocations is $Mft. The cache manager will hold the MmSt allocations for these file system metadata files at a cost of up to 4 files per PTE. This technique can be used to determine how much $Mft is using MmSt pool memory by first using findstr at a command prompt to isolate just those values from the !memusage output.

     

    C:\Projects>findstr /c:"$Mft" memusage.txt >mftusage.txt

     

    After stripping out the control area addresses with Excel and running the command script you’ll have the size of the MmSt allocations for just the $Mft files. If this is consuming most of the MmSt bytes then you are limited to the options mentioned at the beginning of this article. There may be other options if something else is the primary user but it will likely involve reducing some heavy load on the system.

     

    -Bryan

  • Ntdebugging Blog

    How to track down High CPU in User Mode Applications - A live debug!

    • 6 Comments

     

    Written by Jeff Dailey.

     

    Hello NTDebuggers,  I’d like to talk about a common issue we deal with on a regular basis.   We are often tasked with finding what functions are using CPU within a user mode process / application.  Typically a user will find an application that is using more CPU then they expect it to, and this can affect overall system performance and responsiveness.   

     

    For this exercise I’ve written some contrived sample code called EATCPU.  It’s included at the bottom of the blog post.    The following image of task manager shows EATCPU consuming 41% CPU time.   A customer or user may tell you this does not “normally” happen.  It’s always good to ask what “normal” is.   In this case we will say normal is ~10%.

    clip_image002[6]

    The best case scenario is a live debug on the process that is running at high CPU levels.  If you’re lucky enough to have a customer / user that will let you do a remote debug, and the problem reproduces on demand, you can take the following action.

     

    You need to install the debugging tools for Windows, and set your symbol path.  If at all possible acquire the symbols for the application you are debugging.  We’ll assume that you are the expert that supports said program.  If it’s written in house, get the symbols from the developer.  If it’s from a third party, that vendor may be willing to provide you with symbols for their product.  Microsoft has most of the symbols for our products available on our symbol server on the internet (!sympath srv*DownstreamStore*http://msdl.microsoft.com/download/symbols.)

     

    The next thing is to attach windbg.exe to the process in question.

    From the debuggers directory, type TLIST, this will list your process.  Get the process id and then run WinDBG.EXE –p PROCESSID, or if your debugging a program like EATCPU, you can run WINDBG C:\program\EATCPU.EXE.

     

    After the debugger is attached or the process is started in the debugger, reproduce the problem.


     

    Microsoft (R) Windows Debugger  Version 6.8.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    CommandLine: eatcpu.exe

    Symbol search path is: srv*C:\symbols*\\symbols\symbols

    Executable search path is:

    ModLoad: 00400000 0041a000   eatcpu.exe

    ModLoad: 779b0000 77b00000   ntdll.dll

    ModLoad: 76780000 76890000   C:\Windows\syswow64\kernel32.dll

    ModLoad: 62bb0000 62cd1000   C:\Windows\WinSxS\x86_microsoft.vc80.debugcrt_1fc8b3b9a1e18e3b_8.0.50727.762_none_24c8a196583ff03b\MSVCR80D.dll

    ModLoad: 75cb0000 75d5a000   C:\Windows\syswow64\msvcrt.dll

    (1090.164): Break instruction exception - code 80000003 (first chance)

    eax=00000000 ebx=00000000 ecx=712b0000 edx=00000000 esi=fffffffe edi=77a90094

    eip=779c0004 esp=0017faf8 ebp=0017fb28 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

    0:000> g

    (1090.11d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

    0:007> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:007> g

    (1090.17d4): Break instruction exception - code 80000003 (first chance)

    eax=7efa3000 ebx=00000000 ecx=00000000 edx=77a1d894 esi=00000000 edi=00000000

    eip=779c0004 esp=0109ff74 ebp=0109ffa0 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    ntdll!DbgBreakPoint:

    779c0004 cc              int     3

     

    Once the problem has started run the !runaway command.

     

    0:007> !runaway

     User Mode Time

      Thread       Time

       2:c04       0 days 0:01:08.827  ß Note this thread, thread 2:c04 is using more CPU than any other.

       7:17d4      0 days 0:00:00.000  ß Note the other threads are using very little if any CPU.

       6:1a4c      0 days 0:00:00.000

       5:d20       0 days 0:00:00.000

       4:157c      0 days 0:00:00.000

       3:1b28      0 days 0:00:00.000

       1:1134      0 days 0:00:00.000

       0:164       0 days 0:00:00.000

     

    0:007> ~2s  ß Using the thread number 2, set the thread context with the ~Ns command.

    *** WARNING: Unable to verify checksum for eatcpu.exe

    eax=cccccccc ebx=00b93c48 ecx=0000002b edx=00b937a8 esi=00000000 edi=00d9fcf0

    eip=0041169c esp=00d9fcd0 ebp=00d9fd9c iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202

    eatcpu!checkSomething+0x1c:

    0041169c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fcf0=cccccccc

    0:002> k    ß Dump the call stack using k.

     

    If you look at the following call stack, the applications code in this thread starts where you see EATCPU, the code before that is C  Runtime code for begin thread. I want to trace all the code that is running under being thread.  The assumption here is that I’ll find something looping and eating CPU.  To do this I will use the WT command.   However first I need to specify a beginning address for WT to start trace at.

     

    Let's use UF to Unassemble the Function that started our code by taking the return address of  eatcpu!myThreadFunction.

     

    ChildEBP RetAddr 

    00d9fd9c 00411657 eatcpu!checkSomething+0x1c [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]

    00d9fe74 004115a8 eatcpu!trySomething+0x27 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 eatcpu!myThreadFunction+0x38 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c MSVCR80D!_beginthread+0x221

    00d9ffa0 768019f1 MSVCR80D!_beginthread+0x1bc

    00d9ffac 77a2d109 kernel32!BaseThreadInitThunk+0xe

    00d9ffec 00000000 ntdll!_RtlUserThreadStart+0x23 [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> uf 004115a8 ß This command will unassemble the function at this address beginning to end.  

     

    0:007> uf 004115a8

    eatcpu!myThreadFunction [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 30]:

       30 00411570 55              push    ebp

       30 00411571 8bec            mov     ebp,esp

       30 00411573 81eccc000000    sub     esp,0CCh

       30 00411579 53              push    ebx

       30 0041157a 56              push    esi

       30 0041157b 57              push    edi

       30 0041157c 8dbd34ffffff    lea     edi,[ebp-0CCh]

       30 00411582 b933000000      mov     ecx,33h

       30 00411587 b8cccccccc      mov     eax,0CCCCCCCCh

       30 0041158c f3ab            rep stos dword ptr es:[edi]

       31 0041158e 8b4508          mov     eax,dword ptr [ebp+8]

       31 00411591 8945f8          mov     dword ptr [ebp-8],eax

     

    eatcpu!myThreadFunction+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 33]:

       33 00411594 b801000000      mov     eax,1

       33 00411599 85c0            test    eax,eax

       33 0041159b 7410            je      eatcpu!myThreadFunction+0x3d (004115ad)

     

    eatcpu!myThreadFunction+0x2d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]:

       35 0041159d 8b45f8          mov     eax,dword ptr [ebp-8]

       35 004115a0 8b08            mov     ecx,dword ptr [eax]

       35 004115a2 51              push    ecx

       35 004115a3 e880faffff      call    eatcpu!ILT+35(?trySomethingYAHHZ) (00411028)

       35 004115a8 83c404          add     esp,4

       36 004115ab ebe7            jmp     eatcpu!myThreadFunction+0x24 (00411594)

     

    eatcpu!myThreadFunction+0x3d [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 37]:

       37 004115ad 5f              pop     edi

       37 004115ae 5e              pop     esi

       37 004115af 5b              pop     ebx

       37 004115b0 81c4cc000000    add     esp,0CCh

       37 004115b6 3bec            cmp     ebp,esp

       37 004115b8 e8a1fbffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       37 004115bd 8be5            mov     esp,ebp

       37 004115bf 5d              pop     ebp

       37 004115c0 c3              ret

     

    0:002> wt -or 00411570 ß We will use WT to Watch Trace this code.  I’ve selected the starting address of the myThreadFunction function.  I’ve also specified –or to print the return value of each function.  Wt produces very visual output.  It allows you to quickly identify patterns in the way the code executes that would be much more difficult just using the T (TRACE) command. 

     

        8     0 [  0] ntdll!RtlSetLastWin32Error eax = 0

    >> No match on ret

        8     0 [  0] ntdll!RtlSetLastWin32Error

        2     0 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

        9     3 [  0] eatcpu!checkSomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       12     6 [  0] eatcpu!checkSomething eax = 0

    >> No match on ret

       12     6 [  0] eatcpu!checkSomething

        7     0 [  0] eatcpu!trySomething

        1     0 [  1]   eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  1]   eatcpu!_RTC_CheckEsp eax = 0

       10     3 [  0] eatcpu!trySomething eax = 0

    >> No match on ret

       10     3 [  0] eatcpu!trySomething

        9     0 [  0] eatcpu!myThreadFunction  ß I see a pattern,  a loop.   Beginning of loop.

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       18   176 [  0] eatcpu!myThreadFunction  ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError  ß Always look for what might be going wrong! Last error can give you a clue.  We are setting last error at the low level of the loop

     16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0  ß Also note checkSomething is returning ZERO, this might indicate a problem.  You need to look at the code or assembler.

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       27   352 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  2]     eatcpu!_RTC_CheckEsp eax = 0

       70   105 [  1]   eatcpu!trySomething eax = 0

       36   528 [  0] eatcpu!myThreadFunction ß End of loop / beginning of loop

        1     0 [  1]   eatcpu!ILT+35(?trySomethingYAHHZ)

       60     0 [  1]   eatcpu!trySomething

        1     0 [  2]     eatcpu!ILT+180(?checkSomethingYAHHZ)

       62     0 [  2]     eatcpu!checkSomething

        5     0 [  3]       kernel32!SetLastError

       16     0 [  3]       ntdll!RtlSetLastWin32Error eax = 0

       64    21 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       71    24 [  2]     eatcpu!checkSomething

        1     0 [  3]       eatcpu!ILT+345(__RTC_CheckEsp)

        2     0 [  3]       eatcpu!_RTC_CheckEsp eax = 0

       74    27 [  2]     eatcpu!checkSomething eax = 0

       67   102 [  1]   eatcpu!trySomething

    12930 instructions were executed in 12929 events (0 from other threads)

     

    Function Name                               Invocations MinInst MaxInst AvgInst

    eatcpu!ILT+180(?checkSomethingYAHHZ)                 69       1       1       1

    eatcpu!ILT+345(__RTC_CheckEsp)                      210       1       1       1

    eatcpu!ILT+35(?trySomethingYAHHZ)                    70       1       1       1

    eatcpu!_RTC_CheckEsp                                210       2       2       2

    eatcpu!checkSomething                                70      60      74      73

    eatcpu!myThreadFunction                               1     630     630     630

    eatcpu!trySomething                                  71      10      70      68

    kernel32!SetLastError                                70       5       5       5

    ntdll!RtlSetLastWin32Error                           70      16      16      16

     

    0 system calls were executed

     

    eax=cccccccc ebx=00b93c48 ecx=00000002 edx=00b937a8 esi=00000000 edi=00d9fe6c

    eip=0041164c esp=00d9fda8 ebp=00d9fe74 iopl=0         nv up ei pl nz na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206

    eatcpu!trySomething+0x1c:

    0041164c f3ab            rep stos dword ptr es:[edi]  es:002b:00d9fe6c=cccccccc

     

    0:002> !gle ß Now that we have broken in let's check and see what the last error is using !GLE (Get Last Error)  This dumps out the last error from the TEB.

    LastErrorValue: (Win32) 0x57 (87) - The parameter is incorrect.

    LastStatusValue: (NTSTATUS) 0xc000000d - An invalid parameter was passed to a service or function.

     

    0:007> bp kernel32!SetLastError  ß Lets set a breakpoint on last error to examine what is going on in the function calling it.

    0:007> g

    Breakpoint 1 hit

    eax=cccccccc ebx=00b93c48 ecx=00000000 edx=00b937a8 esi=00d9fcd0 edi=00d9fd9c

    eip=767913dd esp=00d9fcc8 ebp=00d9fd9c iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    kernel32!SetLastError:

    767913dd 8bff            mov     edi,edi

    0:002> kv ß Get the call stack

    ChildEBP RetAddr 

    0:002> kv

    ChildEBP RetAddr  Args to Child             

    00d9fcc4 004116cb 00000057 00d9fe74 00000000 kernel32!SetLastError (FPO: [Non-Fpo])  ß 0x57 Invalid parameter error,  Why?

    00d9fd9c 00411657 00000000 00d9ff58 00000000 eatcpu!checkSomething+0x4b (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]

    00d9fe74 004115a8 00000000 00000000 00000000 eatcpu!trySomething+0x27 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 45]

    00d9ff58 62bb4601 0017ff34 4f9f12e9 00000000 eatcpu!myThreadFunction+0x38 (FPO: [Non-Fpo]) (CONV: cdecl) [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 35]

    00d9ff94 62bb459c 00b937a8 00d9ffac 768019f1 MSVCR80D!_beginthread+0x221 (FPO: [Non-Fpo])

    00d9ffa0 768019f1 00b937a8 00d9ffec 77a2d109 MSVCR80D!_beginthread+0x1bc (FPO: [Non-Fpo])

    00d9ffac 77a2d109 00b93c48 00d926a6 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    00d9ffec 00000000 62bb4520 00b93c48 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo]) (CONV: stdcall) [d:\vistartm\base\ntos\rtl\rtlexec.c @ 2695]

    0:002> !error 00000057 ß double check, using !error, this will decode the error into a human readable string.

    Error code: (Win32) 0x57 (87) - The parameter is incorrect.

     

    0:002> uf checkSomething  ß lets disassemble the function calling SetLastError.

    eatcpu!checkSomething [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 49]:

       49 00411680 55              push    ebp

       49 00411681 8bec            mov     ebp,esp

       49 00411683 81ecc0000000    sub     esp,0C0h

       49 00411689 53              push    ebx

       49 0041168a 56              push    esi

       49 0041168b 57              push    edi

       49 0041168c 8dbd40ffffff    lea     edi,[ebp-0C0h]

       49 00411692 b930000000      mov     ecx,30h

       49 00411697 b8cccccccc      mov     eax,0CCCCCCCCh

       49 0041169c f3ab            rep stos dword ptr es:[edi]

       50 0041169e 837d0800        cmp     dword ptr [ebp+8],0 ß Check what our first parameter is on the stack, EBP+8 remember  PLUS == Parameters.  Note looking at the stack it’s 00000000     

       50 004116a2 741d            je      eatcpu!checkSomething+0x41 (004116c1ß if parameter 1 ones 0 we are going to jump to this addres, else we execute the following code. (WE JUMP) eatcpu!checkSomething+0x24 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 52]:

       52 004116a4 8bf4            mov     esi,esp    ß The green would have been the good code path, non errant.

       52 004116a6 68fa000000      push    0FAh

       52 004116ab ff15a8814100    call    dword ptr [eatcpu!_imp__Sleep (004181a8)] ß Note we sleep or do some work other then eat CPU here if we are passed non ZERO

       52 004116b1 3bf4            cmp     esi,esp

       52 004116b3 e8a6faffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       53 004116b8 b801000000      mov     eax,1 ß We are setting EAX to 1, this means we have succeded

       53 004116bd eb15            jmp     eatcpu!checkSomething+0x54 (004116d4ß Now we jump to the clean up code for the fucntionn

     

    eatcpu!checkSomething+0x41 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 57]:

       57 004116c1 8bf4            mov     esi,esp  ß This appears to be a failure case.  We did not get an expected parameter so we report an error and return zero.

       57 004116c3 6a57            push    57h ß Pusing error 0x57 on the stack, invalid parameter.

       57 004116c5 ff15a4814100    call    dword ptr [eatcpu!_imp__SetLastError (004181a4)] ß Our call to setlasterror

       57 004116cb 3bf4            cmp     esi,esp

       57 004116cd e88cfaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       58 004116d2 33c0            xor     eax,eax  ß XORing eax with eax will make EAX Zero.  This is an error condition.

     

    eatcpu!checkSomething+0x54 [c:\source\eatcpu\eatcpu\eatcpu.cpp @ 60]:

       60 004116d4 5f              pop     edi

       60 004116d5 5e              pop     esi

       60 004116d6 5b              pop     ebx

       60 004116d7 81c4c0000000    add     esp,0C0h

       60 004116dd 3bec            cmp     ebp,esp

       60 004116df e87afaffff      call    eatcpu!ILT+345(__RTC_CheckEsp) (0041115e)

       60 004116e4 8be5            mov     esp,ebp

       60 004116e6 5d              pop     ebp

       60 004116e7 c3              ret

     

    The key thing that should be observed by this scenario is that when dealing with a high CPU condition there is often some problem at the lower level of some loop condition that prevents the proper execution of code from happening.  If you’re lucky the problem is reported by some error facility in the OS or the application.  In either case you can use the above technique for isolation.

     

    The following is the sample code for EATCPU.     

     

    // eatcpu.cpp : Defines the entry point for the console application.

    //

     

    #include "stdafx.h"

    #include <windows.h>

    #include <process.h>

     

    void myThreadFunction(void *);

    int doSomething(int);

    int trySomething(int);

    int checkSomething(int);

     

    int _tmain(int argc, _TCHAR* argv[])

    {

          int truevalue = 1;

          int falsevalue = 0;

          // let's start some threads.  These should represent worker threads in a process.  Some of them will do “valid work” one will fail to do so.

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&falsevalue );

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

          _beginthread(myThreadFunction,12000,(void *)&truevalue);

     

          Sleep(10*60000);

          return 0;

    }

     

    void myThreadFunction(void *value)

    {

          int *localvalue = (int *)value;

     

          while(1)

          {

                trySomething(*localvalue);

          }

    }

    int doSomething(int value)

    {

          return trySomething(value);

     

    }

    int trySomething(int value)

    {

          return checkSomething(value);

     

    }

    int checkSomething(int value)

    {

          if(value)

          { // Make sure we have have some valid input parameter. 

            //We will pretend we are doing work, this could be anything , file I/O etc.

                Sleep(250);

                return TRUE;

          }

          else

          { // This is an error condition, this function expects an NON Zero parameter and will report ZERO as an invalid parameter.

     

                SetLastError(ERROR_INVALID_PARAMETER);

                return FALSE;

          }

    }

     

     

     

  • Ntdebugging Blog

    !Search to the Rescue!

    • 6 Comments

    My name is Trey Nash and I am an Escalation Engineer on the Core OS team.  My experience is as a software developer, and therefore, my blog posts tend to be slanted in the direction of helping developers during the feature development, testing, and the support phases.

    Windbg is definitely a feature-rich debugger.  Sometimes, reading the debugger help during idle time can provide some great insight into the capabilities of the debugger.  However, the debugger help comes up short when you ask questions such as, “command huh-huh sure is cool, but when would I ever want to do that?!?”  Besides, if you’re caught reading the windbg help in your spare time, you may be on the receiving end of some ridicule form those in your social circle.

    In this post, I would like to speak a bit about the !search command, among others,  and when you would want to use it.  Additionally, I’ll be demonstrating some related techniques germane to when you would use !search in the first place.

    Not long ago, I was working with a dump from a machine that was hung and it was my job to find out why.  After applying many of the techniques in our hang dump blog post, I discovered that there was a thread in particular that was stuck, which I show below:

    0: kd> !thread fe016330
    THREAD fe016330  Cid 0004.02e0  Teb: 00000000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Non-Alertable
        fcf1a698  SynchronizationEvent
        fe0163a8  NotificationTimer
    IRP List:
       
    fcb47650
    : (0006,01d8) Flags: 00000404  Mdl: 00000000
    Not impersonating
    DeviceMap                 e18008e8
    Owning Process            fe790648       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      75337682       Ticks: 145 (0:00:00:02.265)
    Context Switch Count      31848752            
    UserTime                  00:00:00.000
    KernelTime               
    00:25:38.000
    Start Address Treyresearch (0xf45629e0)
    Stack Init f50e1000 Current f50e05e8 Base f50e1000 Limit f50de000 Call 0
    Priority 15 BasePriority 15 PriorityDecrement 0
    ChildEBP RetAddr  Args to Child             
    f50e0600 e103d5b1 fe016330 fe0163d8 00000000 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
    f50e062c e103df9e fe016330 fd0321f8 00000000 nt!KiSwapThread+0x2e5 (FPO: [0,7,0])
    f50e0674 e101e05b fcf1a698 0000001b 00000000 nt!KeWaitForSingleObject+0x346 (FPO: [5,13,4])
    f50e06b0 e102e00a e3faf6e0 f50e0900 00000000 nt!ExpWaitForResource+0xd5 (FPO: [0,5,4])
    f50e06d0 f5a988cb
    fd0321f8 00000001 f50e08e4 nt!ExAcquireResourceExclusiveLite
    +0x8d (FPO: [2,3,0])
    f50e06e0 f5ad81c4 f50e0900 e3faf6e0 00000001
    Ntfs!NtfsAcquirePagingResourceExclusive
    +0x20 (FPO: [3,0,0])
    f50e08e4 f5ad8909 f50e0900 fcb47650 fdcc3020 Ntfs!NtfsCommonCleanup+0x193 (FPO: [SEH])
    f50e0a54 e1040153 fe00d718
    fcb47650 fcb47650 Ntfs!NtfsFsdCleanup
    +0xcf (FPO: [SEH])
    f50e0a68 f5b4fd28 fddbc818 fe6d1a28 00000000 nt!IofCallDriver+0x45 (FPO: [0,0,4])
    f50e0a94 e1040153 fdcc3020 fcb47650 fcb47650 fltmgr!FltpDispatch+0x152 (FPO: [2,6,0])
    f50e0aa8 f5b4fb25 fdde0cb0 fcb47650 fdd8dc18 nt!IofCallDriver+0x45 (FPO: [0,0,4])
    f50e0acc f5b4fcf5 f50e0aec fdde0cb0 00000000 fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x20b (FPO: [3,4,4])
    f50e0b04 e1040153 fdde0cb0 fcb47650 fcb47650 fltmgr!FltpDispatch+0x11f (FPO: [2,6,0])
    f50e0b18 e112ec0a fdfd9bd8 fe774730 fdfd9bf0 nt!IofCallDriver+0x45 (FPO: [0,0,4])
    f50e0b48 e112b6af fe790648 fdde0cb0 00010003 nt!IopCloseFile+0x2ae (FPO: [5,7,0])
    f50e0b78 e112b852 fe790648 00000001 fe774730 nt!ObpDecrementHandleCount+0xcc (FPO: [4,2,4])
    f50e0ba0 e112b776 e1802e48 fdfd9bf0 00006e54 nt!ObpCloseHandleTableEntry+0x131 (FPO: [5,1,0])
    f50e0be4 e112b7c1 00006e54 00000000 f50e0c00 nt!ObpCloseHandle+0x82 (FPO: [2,7,4])
    f50e0bf4 e1033bdf 00006e54 f50e0cfc e103b00c
    nt!NtClose
    +0x1b (FPO: [1,0,0])
    f50e0bf4 e103b00c 00006e54 f50e0cfc e103b00c nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ f50e0c00)
    f50e0c70 f4562119 00006e54 00030000 00000068 nt!ZwClose+0x11 (FPO: [1,0,0])
    WARNING: Stack unwind information not available. Following frames may be wrong.
    f50e0cfc f456229f f50e0d34 f50e0d2c f4577f50 Treyresearch+0x11119
    f50e0d38 f45626f9 fe016330 fc825368 00000000 Treyresearch+0x1129f
    f50e0d70 f45629ae f1ed8000 00002000 00000000 Treyresearch+0x116f9
    f50e0d90 f4562ba3 fc825318 fde59b38 00000003 Treyresearch+0x119ae
    f50e0dac e1120833 f4577e20 00000000 00000000 Treyresearch+0x11ba3
    f50e0ddc e103fe9f f45629e0 f4577e20 00000000 nt!PspSystemThreadStartup+0x2e (FPO: [SEH])
    00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

    Note:    Eagle-eye readers may have noticed that the debugger states every frame in the above thread uses frame pointer optimization (FPO).  This is a bug in version 6.11.0001.402 of the debugger.

    I have highlighted the interesting bits above.  It seems that this thread is some sort of worker thread, probably created by the Treyresearch driver.  It is doing some work that includes closing a particular file.  In the process of closing the file, NTFS wants to acquire the paging resource for this particular file, and that is where this thread gets stuck.

    What is the paging resource?  Many file systems have a per-file lock that one acquires when performing paging I/O such that other destabilizing activity cannot occur at the same time as a paging operation.  The paging resource for the file is this lock.

    To further illustrate the paging resource, let’s check out the file in question.  One handy things that you can do is follow the stack down to where you see the most recent call to nt!IofCallDriver.  You can see in the MSDN documentation that IoCallDriver accepts two parameters, a DEVICE_OBJECT* and an IRP*.  However, nt!IofCallDriver is a fastcall function, so you cannot find its parameters on the stack.  But since you know that nt!IofCallDriver is calling a driver dispatch routine, and since driver dispatch routines have the same prototype as IoCallDriver, you can easily find the IRP in question which I have highlighted in the Ntfs!NtfsFsdCleanup frame of the  thread’s stack above and dumped out below:

    0: kd> !irp fcb47650
    Irp is active with 10 stacks 10 is current (= 0xfcb47804)
     No Mdl: No System Buffer: Thread fe016330:  Irp stack trace. 
         cmd  flg cl Device   File     Completion-Context
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
    >[ 12, 0]   0  0 fe00d718
    fdfd9bf0
    00000000-00000000   
                  \FileSystem\Ntfs
                         Args: 00000000 00000000 00000000 00000000

    And from the IRP above, we can find the real file that the thread above is trying to acquire the paging lock for:

    0: kd> !fileobj fdfd9bf0

    \Program Files\Treyresearch\Treyresearch.data

    Device Object: 0xfe6da738   \Driver\Ftdisk
    Vpb: 0xfe791818
    Access: Read Write Delete SharedRead SharedWrite SharedDelete

    Flags:  0x43062
           Synchronous IO
           Sequential Only
           Cache Supported
           Modified
           Size Changed
           Handle Created

    File Object is currently busy and has 0 waiters.

    FsContext:
    0xe3faf7a8      FsContext2: 0xe3faf8f0
    Private Cache Map: 0xfccf1fa0
    CurrentByteOffset: 6400164
    Cache Data:
      Section Object Pointers: fc956f3c
      Shared Cache Map: fccf1ec8         File Offset: 6400164
      Vacb: fe77bd80
      Your data is at: cbe80164

    The file object contains two fields named FsContext and FsContext2 shown above.  These fields are for the file system to store file system specific information.  Most file systems would store the paging resource in these context fields somewhere.  For example, NTFS uses FsContext to hold the stream control block (SCB) and you can surmise that somewhere down in the SCB is where NTFS stores the paging resource. (It’s actually more complicated than that, but that’s good enough for sake of this discussion)

    Now, let’s take a look at the paging resource itself.  You can see from the documentation of ExAcquireResourceExclusiveLite, the first parameter is an ERESOURCE and I have highlighted it in our thread stack above.  Given that, we can use the !locks command to get a better idea of what’s going on:

    0: kd> !locks -v fd0321f8

    Resource @ 0xfd0321f8    Shared 1 owning threads
        Contention Count = 2
        NumberOfSharedWaiters = 1
        NumberOfExclusiveWaiters = 1
         Threads:
    fe77f1e0-07<*>     
    ß This thread is the owner

         THREAD fe77f1e0  Cid 0004.0064  Teb: 00000000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Non-Alertable
             fc825320  NotificationEvent
         Not impersonating
         DeviceMap                 e18008e8
         Owning Process            fe790648       Image:         System
         Attached Process          fced2d88       Image:         store.exe
         Wait Start TickCount      75281842       Ticks: 55985 (0:00:14:34.765)
         Context Switch Count      4440231            
         UserTime                  00:00:00.000
         KernelTime                00:01:51.171
         Start Address nt!MiMappedPageWriter (0xe101962c)
         Stack Init f6137000 Current f61366ac Base f6137000 Limit f6134000 Call 0
         Priority 17 BasePriority 8 PriorityDecrement 0
         ChildEBP RetAddr 
         f61366c4 e103d5b1 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
         f61366f0 e103df9e nt!KiSwapThread+0x2e5 (FPO: [0,7,0])
         f6136738 f4562e28 nt!KeWaitForSingleObject+0x346 (FPO: [5,13,4])
    WARNING: Stack unwind information not available. Following frames may be wrong.
         f6136760 f4563229 Treyresearch+0x11e28
         f6136788 f4559d8f Treyresearch+0x12229
         f61367e4 f4570b95 Treyresearch+0x8d8f
         f613684c f4570e39 Treyresearch+0x1fb95
         f6136898 f4570f4b Treyresearch+0x1fe39
         f61368c4 f5b4cb73 Treyresearch+0x1ff4b
         f613692c f5b4efc2 fltmgr!FltpPerformPostCallbacks+0x1c5 (FPO: [1,17,4])
         f6136940 f5b4f4f1 fltmgr!FltpProcessIoCompletion+0x10 (FPO: [1,0,0])
         f6136950 f5b4fb83 fltmgr!FltpPassThroughCompletion+0x89 (FPO: [3,0,4])
         f6136980 f5b4fcf5 fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x269 (FPO: [3,4,4])
         f61369b8 e1040153 fltmgr!FltpDispatch+0x11f (FPO: [2,6,0])
         f61369cc f452b2f8 nt!IofCallDriver+0x45 (FPO: [0,0,4])
         f6136a18 f452b6d3 exifs!NtSystemWrite+0x1ff (FPO: [Non-Fpo])
         f6136ab0 f452ae1d exifs!IfsInternalWrite+0x1a0 (FPO: [Non-Fpo])
         f6136b24 f4549e02 exifs!MRxIfsWrite+0x333 (FPO: [Non-Fpo])
         f6136b44 f4541a8e exifs!RxLowIoSubmit+0x180 (FPO: [Uses EBP] [2,2,4])
         f6136b54 f45427ed exifs!RxLowIoWriteShell+0x2e (FPO: [1,0,1])
         f6136c64 f452fbe3 exifs!RxCommonWrite+0xcc1 (FPO: [Non-Fpo])
         f6136cf8 f453dffd exifs!RxFsdCommonDispatch+0x2c4 (FPO: [Non-Fpo])
         f6136d24 f452439a exifs!RxFsdDispatch+0x93 (FPO: [Non-Fpo])
         f6136d40 e1040153 exifs!MRxIfsFsdDispatch+0x6c (FPO: [Non-Fpo])
         f6136d54 e101c5b4 nt!IofCallDriver+0x45 (FPO: [0,0,4])
         f6136d68 e101971d nt!IoAsynchronousPageWrite+0xd0 (FPO: [8,0,4])
         f6136dac e1120833
    nt!MiMappedPageWriter
    +0x12e (FPO: [1,4,0])
         f6136ddc e103fe9f nt!PspSystemThreadStartup+0x2e (FPO: [SEH])
         00000000 00000000 nt!KiThreadStartup+0x16

    fe78eb40-01   

         THREAD fe78eb40  Cid 0004.001c  Teb: 00000000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Non-Alertable
             fcea3890  Semaphore Limit 0x7fffffff
             fe78ebb8  NotificationTimer
         Not impersonating
         DeviceMap                 e18008e8
         Owning Process            fe790648       Image:         System
         Attached Process          N/A            Image:         N/A
         Wait Start TickCount      75337718       Ticks: 109 (0:00:00:01.703)
         Context Switch Count      3480314            
         UserTime                  00:00:00.000
         KernelTime                00:01:35.875
         Start Address nt!ExpWorkerThread (0xe102da4b)
         Stack Init f60ef000 Current f60eebf0 Base f60ef000 Limit f60ec000 Call 0
         Priority 14 BasePriority 13 PriorityDecrement 1
         ChildEBP RetAddr 
         f60eec08 e103d5b1 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
         f60eec34 e103df9e nt!KiSwapThread+0x2e5 (FPO: [0,7,0])
         f60eec7c e101e05b nt!KeWaitForSingleObject+0x346 (FPO: [5,13,4])
         f60eecb8 e1024ba8 nt!ExpWaitForResource+0xd5 (FPO: [0,5,4])
         f60eecd8 f5a98915 nt!ExAcquireResourceSharedLite+0xf5 (FPO: [2,3,4])
         f60eece8 f5ae198a Ntfs!NtfsAcquirePagingResourceShared+0x20 (FPO: [3,0,0])
         f60eed04 e1044997 Ntfs!NtfsAcquireScbForLazyWrite+0x7a (FPO: [2,0,0])
         f60eed40 e104328e nt!CcWriteBehind+0x27 (FPO: [0,8,4])
         f60eed80 e102db08 nt!CcWorkerThread+0x15a (FPO: [SEH])
         f60eedac e1120833 nt!ExpWorkerThread+0xeb (FPO: [1,5,0])
         f60eeddc e103fe9f nt!PspSystemThreadStartup+0x2e (FPO: [SEH])
         00000000 00000000 nt!KiThreadStartup+0x16

         Threads Waiting On Exclusive Access:

                  fe016330      

    Now this is some juicy output.  I used the –v option to also expand some of the threads related to this lock.  The owner thread is the one with the asterisk (*) next to it and you can see from the expanded thread listing and based on the fact that the function at the bottom of the stack is nt!MiMappedPageWriter, that the thread in question is the mapped page writer.  This thread is a system thread that periodically sweeps through a list of dirty pages flushing them out to disk.  Interestingly, the mapped page writer has acquired the ERESOURCE seven times.  That is shown by the -07 next to the owner thread in the above output.  The second thread is waiting for shared access and it is a system file cache thread.  And finally, the third thread is our initial hung thread that is waiting on exclusive access.

    As a sanity check, let’s make sure that the ERESOURCE and the file in question are related.  Previously, I stated that in the NTFS file system the FsContext field of the file object contains an SCB.  Let’s pass that pointer to !pool and get some more information about it:

    0: kd> !pool 0xe3faf7a8 2
    Pool page e3faf7a8 region is Paged pool
    *e3faf6d8 size:  330 previous size:   20  (Allocated) *Ntff
                  Pooltag Ntff : FCB_DATA, Binary : ntfs.sys

    Now, we can use the search command (s), to search the pool memory above and see if our ERESOURCE is in there.  If so, that would satisfy our sanity check:

    0: kd> s -d e3faf6d8 L 330/4 fd0321f8
    e3faf72c  fd0321f8 0c9013aa 01c9969b 42002f46  .!..........F/.B
    e3faf7b4  fd0321f8 06410000 00000000 06400164  .!....A.....d.@.

    Now that we are satisfied that we have matched up the ERESOURCE with the file that owns it, let’s move on.  At first glance of the mapped writer thread, it looks like the offending entity is exifs.  After all, it’s the most interesting component on the mapped page writer stack.  But don’t be fooled.  What you see in the mapped page writer stack is a snapshot of what it was doing when the dump was taken, and that’s not necessarily the work that caused things to go bad in the first place.  Even though exifs is a file system, it is not NTFS.  And we know an NTFS file’s paging resource is locked.  Keep in mind that the mapped page writer is processing a list.  So the items on the list that have caused the contention may have long been taken off the list and processed.

    So what do you do?  Unfortunately, the badness happened some time ago.  We don’t have a stack to look at to show who did this and when.  But what we can do is perform a search of memory to see if there are any references to the ERESOURCE elsewhere in memory.  If we find some hits, maybe they will shed some more light on what is going on.  So, let’s go ahead and do that:

    0: kd> !search fd0321f8
    Searching PFNs in range 0000000B - 000DFFF9 for [FFFFFFFFFD0321F8 - FFFFFFFFFD0321F8]

    Pfn      Offset   Hit      Va       Pte     
    - - - - - - - - - - - - - - - - - - - - - - - - - - -
    00007AB7 000004E8 FD0321F8 FCAB74E8 C03F2ADC
           fcab74e0+0x8     : Ntfr  -- ERESOURCE
    00007CDF 00000950 FD0321F8 FCCDF950 C03F337C
           fccdf938+0x18    : NpFc  -- CCB, client control block - Process: fdd70248
    0000803A 0000022C FD03A1F8 FD03A22C C03F40E8
           fd03a220+0xc     : Vad   -- Mm virtual address descriptors
    0000977F 0000003C FD0321F8
    FE77F03C
    C03F9DFC
           fe77f000+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000977F 00000514 FD0321F8 FE77F514
    C03F9DFC
           fe77f4d8+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000977F 000005BC FD0321F8 FE77F5BC
    C03F9DFC
           fe77f580+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000977F 00000904 FD0321F8 FE77F904
    C03F9DFC
           fe77f8c8+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000977F 00000BA4 FD0321F8 FE77FBA4
    C03F9DFC
           fe77fb68+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000978A 00000324 FD0321F8 FE78A324
    C03F9E28
           fe78a2e8+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    0000978A 0000051C FD0321F8 FE78A51C
    C03F9E28
           fe78a4e0+0x3c    :
    MmWe  -- Work entries for writing out modified filesystem pages.
    00060B7B 0000072C FD0321F8 E3FAF72C C038FEBC
           e3faf6d8+0x54    : Ntff  -- FCB_DATA
    00060B7B 000007B4 FD0321F8 E3FAF7B4 C038FEBC
           e3faf6d8+0xdc    : Ntff  -- FCB_DATA
    000D9653 00000638 FD0321F8 F50E0638 C03D4380
    000D9653 00000694 FD0321F8 F50E0694 C03D4380
    000D9653 000006D8 FD0321F8 F50E06D8 C03D4380
    000DE415 0000054C F50321F8 F503254C C03D40C8
    000DFBF4 00000C40 FD0321F8 F60EEC40 C03D83B8
    000DFBF4 00000C9C FD0321F8 F60EEC9C C03D83B8
    000DFBF4 00000CE0 FD0321F8 F60EECE0 C03D83B8
    Search done.

    One thing to note is that there are some hits that are not exactly what we were looking for.  That’s because !search also looks for values that are one bit off from what you requested.  Check out the help for how you can adjust this behavior.  Also, !search performs some extra work along the way.  If it notices that the virtual address found is in the pool, it displays information about that pool entry.

    Do you spot the curiosity?  Remember that the ERESOURCE had been acquired seven times.  Correspondingly, there are seven hits in the !search list with the MmWe tag!  And not surprisingly, the description of that tag pulled from pooltag.txt file in the triage directory where the debugger is installed reveals that these are paging work entries.  Now we’re on to something.

    Note:    Incidentally, if you want to determine where a virtual address in the list above resides, you can always pass it to !address.  The addresses at the end of the !seach output are often addresses on some thread’s stack.  If you pass those addresses to !thread, it is smart enough to find the thread that is associated with that stack and display it for you.

     “OK, but how do I find the real culprit?”, you then say.  Well, you have to continue to dig with what you have.  Unfortunately, the contents of the pool entries with MmWe tags are not documented, although, we definitely know what they are.  Let’s take a look at one of them using !pool:

    0: kd> !pool FE77F03C
    Pool page fe77f03c region is Nonpaged pool
    *
    fe77f000 size:   a8
    previous size:    0  (Allocated) *MmWe
                  Pooltag MmWe : Work entries for writing out modified filesystem pages., Binary : nt!mm

    Now, we see where the pool entry starts and how big the entry is.  So, let’s take a look at the contents of the memory:

    0: kd> dps fe77f000 L a8/@$ptrsize
    fe77f000  0a150000
    fe77f004  65576d4d
    fe77f008  fe78a4e8
    fe77f00c  e10b3af0 nt!MmMappedPageWriterList
    fe77f010  06140000
    fe77f014  00000000
    fe77f018  06150000
    fe77f01c  00000000
    fe77f020  fcc20788
    fe77f024  00000000
    fe77f028  e10b3f20 nt!MmMappedFileHeader
    fe77f02c  00000000
    fe77f030  00000000
    fe77f034  fdfd9bf0
    fe77f038  fdc6f008
    fe77f03c  fd0321f8
    fe77f040  00000000
    fe77f044  00000000
    fe77f048  00000000
    fe77f04c  0002005c
    fe77f050  00000000
    fe77f054  f1d20000
    fe77f058  00000000
    fe77f05c  00010000
    fe77f060  00000000
    fe77f064  0007036c
    fe77f068  000cb59d
    fe77f06c  000a97fe
    fe77f070  0000d7ef
    fe77f074  00021c90
    fe77f078  0005d6b1
    fe77f07c  000a5642
    fe77f080  0004d5c3
    fe77f084  000ae354
    fe77f088  00038249
    fe77f08c  000509ea
    fe77f090  0009c915
    fe77f094  00018dd6
    fe77f098  000b94d7
    fe77f09c  0006ca58
    fe77f0a0  00091e29
    fe77f0a4  00000000

    I used the dps command so that it would check to see if any of the values matched to any known symbols.  As you can see, there are a couple of symbols in there.  Notice that I divided the size of the block by the pseudo register $ptrsize as well.  The symbols that dps found validate that this block of memory is associated with the mapped page writer.

    You may also ask yourself, “are any of these values pointers to other pool blocks?”  If you wanted to know that, you could iterate over each one of them passing them to !pool or !address.  That sounds tedious to do manually.  But thankfully, the debugger has some nice command tokens such as .foreach that make this a breeze.  If you want to pass each of the above values to !pool, you can perform the following in the debugger:

    .foreach /pS 1 /ps 1 ( value { dp /c 1 fe77f000 L a8/@$ptrsize } ) { .if( value != 0 ) {.printf "**** %p ****\n", ${value}; !pool ${value} 0x2} }

    The address highlighted above is the address of the pool block revealed by the previous !pool command.  a8 is the size of the block in bytes and since dp lists the memory in units of pointer size, I divide a8 by $ptrsize.

    I won’t show the full output here, because it’s rather verbose.  But I have duplicated the output from the dps command above with added pool tags next to items that are pool entries:

    0: kd> dps fe77f000 L a8/@$ptrsize
    fe77f000  0a150000
    fe77f004  65576d4d
    fe77f008  fe78a4e8   Pooltag MmWe : Work entries for writing out modified filesystem pages., Binary : nt!mm
    fe77f00c  e10b3af0 nt!MmMappedPageWriterList
    fe77f010  06140000
    fe77f014  00000000
    fe77f018  06150000
    fe77f01c  00000000
    fe77f020 
    fcc20788   Pooltag Irp  : Io, IRP packets
    fe77f024  00000000
    fe77f028  e10b3f20 nt!MmMappedFileHeader
    fe77f02c  00000000
    fe77f030  00000000
    fe77f034 
    fdfd9bf0   Pooltag File : File objects
    fe77f038  fdc6f008   Pooltag MmCa : Mm control areas for mapped files, Binary : nt!mm
    fe77f03c 
    fd0321f8
       Pooltag Ntfr : ERESOURCE, Binary : ntfs.sys
    fe77f040  00000000
    fe77f044  00000000
    fe77f048  00000000
    fe77f04c  0002005c

    <snip>

     

    Sure enough, the file object above is identical to the file we identified earlier on as the one that the worker thread was attempting to close a handle to.  Also, you can see that our ERESOURCE is in there as well.  And even more, now we have an IRP that may reveal even more information.  Let’s see:

    0: kd> !irp fcc20788 1
    Irp is active with 10 stacks 12 is current (= 00000000)
     Mdl=fe77f048: No System Buffer: Thread fd089e6c: 
    Irp is completed
    .  Pending has been returned
    Flags = 00000003
    ThreadListEntry.Flink = fcc20798
    ThreadListEntry.Blink = fcc20798
    IoStatus.Status = 00000000
    IoStatus.Information = 00010000

    RequestorMode = 00000000
    Cancel = 00
    CancelIrql = 0
    ApcEnvironment = 00
    UserIosb = fe77f018
    UserEvent = 00000000
    Overlay.AsynchronousParameters.UserApcRoutine = e101c95b
    Overlay.AsynchronousParameters.UserApcContext = fe77f008
    Overlay.AllocationSize = 00000000 - 00000000
    CancelRoutine = 00000000   
    UserBuffer = 00000000
    &Tail.Overlay.DeviceQueueEntry = fcc207c8
    Tail.Overlay.Thread = fd089e6c
    Tail.Overlay.AuxiliaryBuffer = e101cb86
    Tail.Overlay.ListEntry.Flink = 00000000
    Tail.Overlay.ListEntry.Blink = 00000000
    Tail.Overlay.CurrentStackLocation = 00000000
    Tail.Overlay.OriginalFileObject = 00000000
    Tail.Apc = 00300012
    Tail.CompletionKey = 00300012
         cmd  flg cl Device   File     Completion-Context
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

                         Args: 00000000 00000000 00000000 00000000
     [  4, 0]   0  0 fe755ab8 00000000 f5ce857e-fe709df8   
                  \Driver\Disk  PartMgr!PmIoCompletion
                         Args: 00000000 00000000 00000000 0000000c
     [  4, 0]   0  0 fe709df8 00000000 f5c16558-fe6da7f0   
                  \Driver\PartMgr     ftdisk!FtpRefCountCompletionRoutine
                         Args: 00000000 00000000 00000000 0000000c
     [  4, 0]   0  0 fe6da738 00000000 f5bc0638-fe6d6a80   
                  \Driver\Ftdisk      volsnap!VspRefCountCompletionRoutine
                         Args: 00000000 00000000 00000000 0000000c
     [  4, 0]   0  0 fe6d69c8 00000000 f5d0af28-00000000   
                  \Driver\VolSnap     vsp
                         Args: 00000000 00000000 00000000 0000000c
     [  4, 0]   0  0 fe6d66c0 00000000 f5a9351c-fde38cd0   
                  \Driver\VSP   Ntfs!NtfsSingleAsyncCompletionRoutine
                         Args: 00000000 00000000 00000000 0000000c
     [  4, 0]   0  0 fe00d718 00000000 00000000-00000000   
                  \FileSystem\Ntfs
                         Args: 00000000 00000000 00000000 00000000

    What is apparent from looking at this IRP is that it is flagged as completed.  Moreover, the information in the IoStatus fields looks to be relevant as well.  But after studying the situation a little deeper, it appeared that the completion routine had never been fired.  We can find out more about the completion routine by dumping out the Tail.Apc portion of the IRP as shown below:

    0: kd> dt nt!_IRP fcc20788 Tail.Apc.
       +0x040 Tail      :
          +0x000 Apc       :
             +0x000 Type      : 0x12 ''
             +0x001 SpareByte0 : 0 ''
             +0x002 Size      : 0x30 '0'
             +0x003 SpareByte1 : 0 ''
             +0x004 SpareLong0 : 0
             +0x008 Thread    :
    0xfe77f1e0
    _KTHREAD
             +0x00c ApcListEntry : _LIST_ENTRY [ 0xfcde8ce4 - 0xfd089e6c ]
             +0x014
    KernelRoutine : 0xe101cb86           void  nt!IopCompletePageWrite+0
             +0x018 RundownRoutine : (null)
             +0x01c NormalRoutine : (null)
             +0x020 NormalContext : (null)
             +0x024 SystemArgument1 : (null)
             +0x028 SystemArgument2 : (null)
             +0x02c ApcStateIndex : 0 ''
             +0x02d ApcMode   : 0 ''
             +0x02e
    Inserted  : 0x1 ''

    Recall from the rules of IRP processing on Windows that IRPS like these have their completion routines called within the thread context that initiated the I/O.  As you can see above, that thread is the same thread that is running nt!MiMappedPageWriter.  Moreover, the Inserted flag is set above, which means that the APC has been placed in the APC queue for the thread.  Deductive reasoning would imply that if the completion routine has not run, then the APC has not been delivered.  And the APC will not be delivered if normal kernel mode APCs are disabled at the moment.  So, let’s check on that by looking in the nt!KTHREAD structure:

    0: kd> dt nt!_KTHREAD 0xfe77f1e0 KernelApcDisable
       +0x070 KernelApcDisable :
    -2

    Sure enough, kernel APCs are disabled for this thread at the moment.  How can that be?  Well, there are several ways to disable normal kernel APC delivery and it often involves either directly or indirectly entering a critical or guarded region.  Critical regions are entered directly via KeEnterCriticalRegion and guarded regions are entered via KeEnterGuardedRegion.  However, there are several means of indirectly entering critical regions including FsRtlEnterFileSystem.    Additionally, holding a mutex object automatically places the holder in a critical region.

    Therefore, the root cause in this case was that the file system drivers appear to have put the thread into a state where it cannot receive APCs and, therefore, I/O initiated on that thread could not be completed.  The APCs build up in the queue so that they can be delivered when kernel APC delivery is re-enabled.  Incidentally, the documentation for FsRtlEnterFileSystem states that file system filter drivers should never disable normal kernel APCs across calls to IoCallDriver.

    Conclusion

    Many times, when it looks like you are hitting up against a brick wall in determining what went wrong in a dump, you can get a lot further than you initially expect with a little bit of intuition and the right tools.  Of course, this intuition will grow as you become more and more familiar with the Windows operating system internals, or whatever platform you work on.  Using !search to search physical memory in the dump file can help find references (a.k.a pointers) to objects in hard to find places.  Additionally, pool tag information along with the helpful text in the pooltag.txt file displayed by !pool goes a long way when it comes to figuring out what a particular pool block is used for.  Armed with all of these tools, you can always get farther than one may initially expect.

    Happy debugging everyone!

    "The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious.  No association with any real company, organization, product, domain name, email address, logo, person, places, or events is intended or should be inferred."

     

    Share this post :
  • Ntdebugging Blog

    Identifying Global Atom Table Leaks

    • 6 Comments

    Hi, it's the Debug Ninja back again with another debugging adventure.  Recently I have encountered several instances where processes fail to initialize, and a review of available resources showed that there was no obvious resource exhaustion.  A more in depth review found that there were no available string atoms in the global atom table.

     

    Global atoms are organized on a per-session basis.  If atoms cannot be allocated in session 0, services may fail to start or processes launched by various services may fail to start.  However, a user logged in to a different session will not experience any such failures.

     

    String atoms are numbered from 0xC000 through 0xFFFF, providing a maximum of 0x4000 atoms per session.  For more information on atoms, and atom tables, see http://technet.microsoft.com/en-us/query/ms649053.

     

    When there are no more string atoms available, calls to APIs that allocate string atoms will fail.  Because atoms are often allocated at process or dll init time, the most common symptom is that processes fail to initialize.  The process may cleanly exit without an error.  You are likely experiencing this problem if you debug your application and find that the failure originates from an API that allocates string atoms such as RegisterClass, RegisterClassEx, GlobalAddAtom, or AddAtom.

     

    To determine if the global string atom table is full you will need to perform a kernel debug.  This can be a live debug or a post-mortem debug using a dump.

     

    First identify the session where the failures have occurred and set the process context to a process in this session.  In my example, w3wp.exe was launching a process and this process failed to initialize.

     

    2: kd> !process 0 0 w3wp.exe

    PROCESS fffffa8005083060

        SessionId: 0  Cid: 1668    Peb: fffdf000  ParentCid: 08ec

        DirBase: 8a2df000  ObjectTable: fffff8a0128bbe40  HandleCount: 441.

        Image: w3wp.exe

    2: kd> .process /p /r fffffa8005083060

    Implicit process is now fffffa80`05083060

    Loading User Symbols

    .....

     

    Next we need to analyze the global atom table.  The pointer to the table is stored in the UserAtomTableHandle global.

     

    2: kd> dq win32k!UserAtomTableHandle l1

    fffff960`003bf7a8  fffff8a0`05e5bc70

     

    The UserAtomTableHandle has a pointer to a handle table at offset 0x10 in 64-bit, and offset 0x8 in 32-bit.  Note that although the atom table is defined as a _RTL_ATOM_TABLE, the format shown by dt is for user mode and does not apply to the UserAtomTableHandle in kernel mode.

     

    2: kd> dq fffff8a0`05e5bc70+10 l1

    fffff8a0`05e5bc80  fffff8a0`05db7740

    2: kd> dt nt!_HANDLE_TABLE fffff8a0`05db7740

       +0x000 TableCode        : 0xfffff8a0`109c8001

       +0x008 QuotaProcess     : (null)

       +0x010 UniqueProcessId  : 0x00000000`00000184 Void

       +0x018 HandleLock       : _EX_PUSH_LOCK

       +0x020 HandleTableList  : _LIST_ENTRY [ 0xfffff8a0`05db7760 - 0xfffff8a0`05db7760 ]

       +0x030 HandleContentionEvent : _EX_PUSH_LOCK

       +0x038 DebugInfo        : (null)

       +0x040 ExtraInfoPages   : 0n0

       +0x044 Flags            : 0

       +0x044 StrictFIFO       : 0y0

       +0x048 FirstFreeHandle  : 0x10004

       +0x050 LastFreeHandleEntry : 0xfffff8a0`10ca4ff0 _HANDLE_TABLE_ENTRY

       +0x058 HandleCount      : 0x3fc0

       +0x05c NextHandleNeedingPool : 0x10400

       +0x060 HandleCountHighWatermark : 0x3fc1

     

    The FirstFreeHandle contains the handle number that will be given to the next handle allocated from this table.  This value is encoded, to get the next handle number we need to right shift the FirstFreeHandle by 2 bits.

     

    2: kd> ?00010004>>2

    Evaluate expression: 16385 = 00000000`00004001

     

    The result from above, 0x4001, is greater than the number of possible string atoms.  As I mentioned earlier, there is a limit of 0x4000 string atoms.  Now we know that the session is out of string atoms.

     

    The next step is to dump the string atoms to identify whether there is an observable pattern in the leaked strings.  The !atom command only works in user mode, so we need to dump the kernel mode strings manually.  An atom table is comprised of multiple buckets.   Each bucket is the head of a list of atoms.  The buckets start at offset 0x20 in the atom table in 64-bit, and offset 0x10 in 32-bit.

     

    2: kd> dq fffff8a0`05e5bc70+20

    fffff8a0`05e5bc90  fffff8a0`05e5ba60 fffff8a0`05db7be0

    fffff8a0`05e5bca0  fffff8a0`08cf1770 fffff8a0`05e5b3d0

    fffff8a0`05e5bcb0  fffff8a0`05ea9020 fffff8a0`05e5b8e0

    fffff8a0`05e5bcc0  fffff8a0`05ea9b10 fffff8a0`05ea9910

    fffff8a0`05e5bcd0  fffff8a0`05ea9f00 fffff8a0`05e5b650

    fffff8a0`05e5bce0  fffff8a0`05cda290 fffff8a0`05ea9e80

    fffff8a0`05e5bcf0  fffff8a0`05e5b200 fffff8a0`05ea9e30

    fffff8a0`05e5bd00  fffff8a0`05e5b7e0 fffff8a0`06c56210

    2: kd> dq

    fffff8a0`05e5bd10  fffff8a0`06d6b5a0 fffff8a0`05ea9d50

    fffff8a0`05e5bd20  fffff8a0`05e5b790 fffff8a0`05e5b9d0

    fffff8a0`05e5bd30  fffff8a0`06bd9bc0 fffff8a0`05ea9c90

    fffff8a0`05e5bd40  fffff8a0`05e5b0c0 fffff8a0`06ae2020

    fffff8a0`05e5bd50  fffff8a0`05e5b930 fffff8a0`04d2af40

    fffff8a0`05e5bd60  fffff8a0`05e5b690 fffff8a0`05e5b980

    fffff8a0`05e5bd70  fffff8a0`05e5b490 fffff8a0`05e5b410

    fffff8a0`05e5bd80  fffff8a0`05e5ba20 fffff8a0`05e5b4f0

    2: kd> dq

    fffff8a0`05e5bd90  fffff8a0`05e5baa0 fffff8a0`05e5b390

    fffff8a0`05e5bda0  fffff8a0`05e5b840 fffff8a0`05ea9c50

    fffff8a0`05e5bdb0  fffff8a0`05e5b250 00000000`00000000

    fffff8a0`05e5bdc0  00000000`00000000 00000000`00000000

    fffff8a0`05e5bdd0  00000000`00000000 00000000`00000000

    fffff8a0`05e5bde0  00000000`00000000 00000000`00000000

    fffff8a0`05e5bdf0  00000000`00000000 00000000`00000000

    fffff8a0`05e5be00  00000000`00000000 00000000`00000000

     

    The quick and dirty way to dump the buckets is with !list.  I am sure that some will say it is tedious to dump each bucket list by hand and that there are easier ways to accomplish this.  To prevent this article from becoming a lesson on debugger scripting, I am leaving that as an exercise to the reader.

     

    2: kd> !list "-t nt!_RTL_ATOM_TABLE_ENTRY.HashLink -e -x \"du @$extret+10\" fffff8a0`05e5ba60"

    du @$extret+10

    fffff8a0`05e5ba70  "Native"

     

    <snip strings that don't match a pattern>

     

    du @$extret+10

    fffff8a0`0838a120  "ControlOfs0210000000000700"

     

    du @$extret+10

    fffff8a0`0f7ff430  "ControlOfs021A000000000C30"

     

    du @$extret+10

    fffff8a0`162168c0  "ControlOfs020E000000001774"

     

    du @$extret+10

    fffff8a0`08c33870  "ControlOfs01F70000000007F4"

     

    du @$extret+10

    fffff8a0`07c46910  "ControlOfs0202000000000BF8"

     

    du @$extret+10

    fffff8a0`062aab50  "ControlOfs01F5000000001274"

     

    du @$extret+10

    fffff8a0`0777b150  "ControlOfs0202000000000C80"

     

    du @$extret+10

    fffff8a0`07dd3410  "ControlOfs0207000000000F00"

     

    du @$extret+10

    fffff8a0`0f01d190  "ControlOfs0214000000000DAC"

     

    Dumping the atoms I found that there is a continuous pattern of the string ControlOfs followed by 16 hexadecimal numbers.  Some time spent with your favorite search engine should find other reports of atom leaks involving the string ControlOfs, and that these leaks have been identified as a problem in some specific software.  In this instance the programmer using that software needs to change their application to avoid the problem.

  • Ntdebugging Blog

    Critical Device Database TIP

    • 6 Comments

    On a fairly regular basis, Bob Golding, our resident GES storage guru, sends out debugging tips to our group. We thought our blog readers would find value in these tips so we’re posting one here. Let us know what you think of the nugget.

    Thanks,

    Ron

    Introduction

    Hi everyone, Bob here.  Today I thought we’d have a little discussion about the Critical Device Database (CDDB) in the registry and an issue that can be caused when a device is not contained there.  This database stores configuration data for new devices that must be installed and started before the Windows components that normally install devices have been started.  The idea behind the critical device database is to allow the installation of devices without using user mode plug-and-play manager.   If a device is determined to be new, this database is queried to see if the information needed to install it are present. 

    What was the issue?

    A customer was getting a Stop 0x7B (Inaccessible_Boot_Device) after they installed a BIOS update.  Further investigation via the debugger using the !devnode command showed the following issue with a few devices:

             DevNode 0x8631f008 for PDO 0x8631f8e0

               InstancePath is "PCI\VEN_15B3&DEV_5A44&SUBSYS_5A4415B3&REV_A1\6&38f4f1f2&0&00080310"

               State = DeviceNodeInitialized (0x302)

               Previous State = DeviceNodeUninitialized (0x301)

               Problem = CM_PROB_NOT_CONFIGURED

    The above device is a bridge, and according to the definition of CM_PROB_NOT_CONFIGURED, it does not have a ConfigFlags registry entry.  I saw that this same problem occurred with a few bridges.  If the bridge cannot be enumerated, devices on the bridge will not be discovered either.  The Instance ID is used to look up the particulars such as driver name and class in the HKLM\System\CurrentControlSet\ENUM key in the registry.  What happened here is the lookup failed and the system thought it was a new device.  Since based on the device class this device was needed for boot, a Stop 0x7B occurred.

    What caused the issue?

    When the BIOS was updated the Instance ID included the version number of the bridge.  The update increased the version number, so the Instance ID was changed.

           DevNode 0x8635ca40 for PDO 0x8634c4a8

              InstancePath is "PCI\VEN_8086&DEV_1A38&SUBSYS_02DD1014&REV_B1\3&11583659&0&40"

              State = DeviceNodeInitialized (0x302)

              Previous State = DeviceNodeUninitialized (0x301)

              Problem = CM_PROB_NOT_CONFIGURED

    Looking at the registry data we could see:

    PCI\VEN_8086&DEV_1A38&SUBSYS_02DD1014&REV_93

    So a revision change caused the issue.

    What stops this from happening?

    Certain devices in the system are required for boot.  The device class determines if the device is in the boot device family.  If so, the hardware ID is written to the CDDB in the registry, so that if the device is determined to be new, it can be found there during boot.

    Below is an example of a bridge entry.  The contents of the pci#ven_8086&dev_244e contain the driver and the class.  This is enough to get the device started for boot.  The user mode PNP manager will complete the installation.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CriticalDeviceDatabase\pci#ven_8086&dev_244e

    How come this bridge was not there?

    When the INF was run, the device class was set to “unknown” class, so the OS did not know to write the information in the CDDB.  

    What was done to correct the problem?

    The BIOS update was temporarily reverted, and then the correct install package was found with the correct INF that has the bridges defined as a system device.  The device was re-installed (pre-update) so the device could be written properly in the CDDB, then the BIOS update was reapplied without a problem.

     

     

  • Ntdebugging Blog

    Understanding Storage Timeouts and Event 129 Errors

    • 6 Comments

    Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

     

    Below is an example 129 event:

     

    Event Type:       Warning

    Event Source:     <HBA_Name>

    Event Category:   None

    Event ID:         129

    Date:             4/9/2009

    Time:             1:15:45 AM

    User:             N/A

    Computer:         <Computer_Name>

    Description:

    Reset to device, \Device\RaidPort1, was issued.

     

    So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

     

    Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

     

    The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

    • Providing timing services for requests.
    • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
    • Building scatter gather arrays for data buffers.

    The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

     

    When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

     

    The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

     

    The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

    image002

    Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

     

    The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

     

    I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

  • Ntdebugging Blog

    Bcdedit Tips and Tricks For Debugging Part 1

    • 6 Comments

    Hello everyone, my name is Sean Walker, and I am on the Platforms OEM team in Washington.  This article is for those people who have had a hard time switching from the old boot.ini configuration to the new BCD store (myself included). Doing the simple tasks such as enabling kernel debugging over com1 are easy to do with bcdedit.exe or the msconfig GUI, you just enable them and reboot the computer. However, if you need to do something more advanced such as break into the early boot process during resume from hibernation, things get a lot more complicated.

     

    This article has some samples for enabling and disabling debug settings that you may not be familiar with, and a list of bcdedit debug settings for Windows Vista/Server 2008 and Windows 7/Server 2008 R2.  This information has been helpful to me for quickly and accurately getting to the debug at hand rather than fumbling around with bcdedit.  Much of the following information has been taken from various sources, including the windbg help files, the OEM team blog, the MSDN bcdedit reference, and the WHDC debugger site.

     

    NOTE: For the examples below, you will need to run bcdedit.exe from an administrator (UAC-elevated) command prompt.  To output a summary view of the current state of the BCD store, just run "bcdedit.exe" from the command prompt.  To get detailed information about all of the store(s) that Windows knows about, use the following command:

    bcdedit /enum all

     

    What is a BCD store?

    A BCD store is a binary file that contains boot configuration data for Windows, basically it is a small registry file.  Boot applications use the system BCD store, located on the system partition, during the boot process.  You can also create additional BCD stores in separate files but only one store at a time can be designated as the system store.

     

    NOTE: The "/store" switch can be used to specify a particular BCD store for bcdedit commands (instead of the default store).  To enumerate all the settings in another BCD store, in this case e:\bcd_store\BCD, use the following command:

    bcdedit /store e:\bcd_store\BCD /enum all

     

    This will show you which options are currently set, and what their values are.  When /store switch is omitted, the system store is used.

     

    Using bootdebug

    To enable debugging for early boot problems, you may need to enable the bootdebug switch.  This is easy to do with bcdedit:

    bcdedit /set bootdebug on

     

    However, this only sets bootdebug for the current "boot application", which is generally winload.exe, so it does not break into the very early boot process.  There are multiple applications used for booting, hibernating, and resuming (bootmgr.exe, winload.exe and winresume.exe are examples of these).  Each application (called BCD Objects) has its own settings (called BCD Elements) in the BCD store and each can be modified globally and/or individually.

     

    So, to deal with different (or multiple) debug scenarios, you just enable boot debugging based on the boot application you are concerned with.  For early debugging, you can enable bootdebug for bootmgr:

    bcdedit /set {bootmgr} bootdebug on

     

    To set bootdebug for winload.exe (which will most often be your current, and default, boot object) all three of the following will give you the same result:

    bcdedit /set bootdebug on

    bcdedit /set {current} bootdebug on

    bcdedit /set {default} bootdebug on

     

    If you are modifying the settings in another store, or are booted into another OS on the same computer (such as WinPE), you need to specify the location of the BCD store:

    bcdedit /store d:\Boot\BCD /set {default} bootdebug on

     

    Not all of the boot objects have "friendly" names, so you may need to specify the full GUID (Globally Unique ID) to modify it.  As an example, if you wanted to enable bootdebug on resume from hibernation, you would include the identifier (see figure 1) for the "Resume from Hibernate" object:

    bcdedit /set {89a932d0-d5bc-11e0-a0af-00215add5ebc} bootdebug on

     

    image001

    Figure 1: Color coded bcdedit output

     

    Why won't my USB or 1394 debug work?

    When there are multiple debug ports of a certain type in a computer Windows may not default to the correct one for your situation.  This happens most commonly when there are either multiple 1394 host controllers or USB EHCI controllers.  When this occurs it can range from a slight inconvenience (different port is used so the cable needs to be plugged into another port), to complete failure (internal port is used, which is not accessible).  In the case of USB debugging the Intel USB 2.0 specification only provides one debug port, so debugging is not possible if the wrong host controller is used.

     

    There are several caveats with USB debugging, not the least of which is that you need to buy a separate, expensive, debug cable.  Some of the difficulties and implementation details necessary to get USB debugging to work are encompassed in the WHDC USB FAQ and in Setting Up Kernel Debugging with USB 2.0.

     

    NOTE: A correction to the WHDC USB documentation for Windows 7/Windows 2008 R2 is that the busparams switch now takes decimal rather than hexadecimal values, and the "loadoptions" parameter is no longer required.  So, to enable the busparams element (for USB or 1394 debugging) in Vista/2008, you would use something like this:

    bcdedit /set {current} loadoptions busparams=0.1D.7

     

    And the Win7/2008 R2 example would be:

    bcdedit /set {current} busparams 0.29.7

     

    In the case of loadoptions or busparams, deleting the setting is not as easy as changing a flag from yes to no. You must specifically delete the value to get rid of it, and one of the examples below can be used:

     

    For Vista/2008:

    bcdedit /deletevalue {current} loadoptions

     

    And Windows 7/2008 R2:

    bcdedit /deletevalue {current} busparams

     

    Bcdedit settings and examples

    This is just scratching the surface of using bcdedit for your troubleshooting and/or debugging needs, so there are more articles to follow. Part 2 will include some more detailed debugging scenarios, such as Hyper-V guest and host debugging.  Below is a consolidated table with many of the debugging switches/settings as well as a number of different usage examples.

     

    Table of debug-related bcdedit settings

    Option

    Description

    bootdebug

    Enables or disables the boot debugger for a specified boot entry. Although this command works for any boot entry, it is effective only for boot applications.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Bcdedit /set bootdebug on

    debug

    Enables or disables the kernel debugger for a specified boot entry.

    Enable value(s): on, 1

    Disable value(s): off, 0

    /dbgsettings

    Used to modify the global settings for the debug connection (does not include hypervisor).  Values:

    Can change all settings at once instead of using the /set command to change them individually. Usage example:

    bcdedit /dbgsettings 1394 channel:30

    debugport

    Used to specify the debugger type.

    Values:

    Serial port – com1, com2, comx

    1394 port – 1394

    USB port - USB

    channel

    Specifies 1394 channel used.

    Values:

    Decimal integer between 0 and 62, inclusive.

    baudrate

    Used to specify the baud rate of a serial debug port.

    Values: 9600, 19200, 38400, 57600, 115200

    targetname

    Specifies a string to use as the identification for the USB 2.0 connection. This string can be any value.

    Usage example:

    bcdedit /dbgsettings usb targetname:usbdebug

    /hypervisorsettings

    Used the same way as /dbgsettings to configure all settings at once.

    Usage example:

    bcdedit /hypervisorsettings 1394 channel:10

    hypervisordebug

    Enables or disables hypervisor debug mode. This is for debugging a Hyper-V host system.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    bcdedit /set {current} hypervisordebug on

    /noumex

    Specifies that the kernel debugger ignores user-mode exceptions. By default, the kernel debugger breaks for certain user-mode exceptions, such as STATUS_BREAKPOINT and STATUS_SINGLE_STEP. The /noumex parameter is effective only when there is no user-mode debugger attached to the process.

    /start

    This option specifies the debugger start policy. If a start policy is not specified, ACTIVE is the default.

    Values: active, disable, autoenable

    loadoptions

    Used to describe settings that are not covered by other types. One setting that is relevant here is busparams.

    Values: Any value followed by the setting.

    Usage example (Vista/2008):

    bcdedit /set {current} loadoptions busparams=0.1d.0

    busparams

    A boot setting (specified with loadoptions key word) used to point to the PCI address of the debugger in use. The PCI bus, device, and function are used, in the format bb.dd.ff. This is generally used to identify the location of a 1394 or USB debug port. In Vista/2008, hexadecimal values are used, whereas decimal values are used for Win7.

    Values: Decimal values between 0 and 255.

    Usage example:

    In Win7 - bcdedit /set busparams 0.29.0

    In Vista - bcdedit /set loadoptions busparams=0.1d.0

    kernel

    The loadoptions parameter used to point to a different kernel binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: The 8.3 filename of the replacement kernel include the exe extension.

    Usage examples:

    In Win7 – bcdedit /set kernel kernchk.exe

    In Vista - bcdedit /set loadoptions kernel=kernchk.exe

    hal

    The loadoptions parameter used to point to a different hal binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: the 8.3 filename of the replacement kernel include the .dll extension.

    Usage examples:

    In Win7 – bcdedit /set hal halchk.dll

    In Vista - bcdedit /set loadoptions hal=halchk.dll

    testsigning

    Controls whether Windows 7, Windows Server 2008, or Windows Vista will load any type of test-signed kernel-mode code. This option is not set by default, which means test-signed kernel-mode drivers on 64-bit versions of Windows 7, Windows Server 2008, and Windows Vista will not load without setting the testsigning switch

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    Bcdedit /set testsigning on

     

Page 5 of 24 (238 items) «34567»