• Ntdebugging Blog

    How Windows Starts Up (part 1 of 4)

    • 11 Comments

    Hi folks, my name is David and I’m an Escalation Engineer for Microsoft.  Since Bryan wrote about How Windows Shuts Down, I thought it would be a good idea to cover How Windows Starts Up.

     

    This information applies specifically to Windows 2000, Windows XP, and Windows Server 2003.  I will blog separately on the boot changes in Windows Vista.

     

    Additional information about this topic may be found in Chapter 5 of Microsoft Windows Internals by Russinovich and Solomon and also on TechNet:  http://technet.microsoft.com/en-us/library/bb457123.aspx

     

    Methodology

    The key to understanding and troubleshooting system startup issues is to accurately ascertain the point of failure.  To facilitate this determination, I have divided the boot process into the following four phases.

     

                    1. Initial

                    2. Boot Loader

                    3. Kernel

                    4. Logon

     

    Over the next few weeks, I’ll be describing each of these phases in detail and providing appropriate guidance for troubleshooting relevant issues for each phase. 

     

    Initial Phase

    The Initial Phase of boot is divided into the Power-On Self Test (POST) and Initial Disk Access. 

     

    Power-On Self Test

    POST activities are fully implemented by the computer’s BIOS and vary by manufacturer.  Please refer to the technical documentation provided with your hardware for details.  However, regardless of manufacturer, certain generic actions are performed to ensure stable voltage, check RAM, enable interrupts for system usage, initialize the video adapter, scan for peripheral cards and perform a memory test if necessary.  Depending on manufacturer and configuration, a single beep usually indicates a successful POST. 

     

    Troubleshooting the POST

    ü  Make sure you have the latest BIOS and firmware updates for the hardware installed in the system.

    ü  Replace the CMOS battery if it has failed.

    ü  Investigate recently added hardware (RAM, Video cards, SCSI adapters, etc.)

    ü  Remove recently added RAM modules.

    ü  Remove all adapter cards, then replace individually, ensuring they are properly seated.

    ü  Move adapter cards to other slots on the motherboard.

    ü  If the computer still will not complete POST, contact your manufacturer.

     

    Initial Disk Access

    Depending on the boot device order specified in your computer’s BIOS, your computer may attempt to boot from a CD-ROM, Network card, Floppy disk, USB device or a hard disk.  For the purposes of this document, we’ll assume that we’re booting to a hard disk since that is the most common scenario.

     

    Before we discuss the sequence of events that occur during this phase of startup, we need to understand a little bit about the layout of the boot disk.  The structure of the hard disk can be visualized this way:  (Obviously, these data areas are not to scale)

     

     

     

    Hard disks are divided into Cylinders, Heads and Sectors.  A sector is the smallest physical storage unit on a disk and is almost always 512 bytes in size.  For more information about the physical structure of a hard disk, please refer to the following Resource Kit chapter:  http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/resguide/diskover.mspx

     

    There are two disk sectors critical to starting the computer that we’ll be discussing in detail:  

     

    ·         Master Boot Record (MBR)

    ·         Boot Sector

     

    The MBR is always located at Sector 1 of Cylinder 0, Head 0 of each physical disk.   The Boot Sector resides at Sector 1 of each partition.  These sectors contain both executable code and the data required to run the code.

     

    Please note that there is some ambiguity involved in sector numbering.  Cylinder/Head/Sector (CHS) notation begins numbering at C0/H0/S1.  However, Absolute Sector numbering begins numbering at zero.  Absolute Sector numbering is often used in disk editing utilities such as DskProbe.  These differences are discussed in the following knowledge base article:

    Q97819  Ambiguous References to Sector One and Sector Zero

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q97819

     

    Now that we have that straight, when does this information get written to disk?  The MBR is created when the disk is partitioned.  The Boot Sector is created when you format a volume.  The MBR contains a small amount of executable code called the Master Boot Code, the Disk Signature and the partition table for the disk.  At the end of the MBR is a 2-byte structure called a Signature Word or End of Sector marker, which should always be set to 0x55AA.  A Signature Word also marks the end of an Extended Boot Record (EBR) and the Boot Sector.

     

    The Disk Signature, a unique number at offset 0x1B8, identifies the disk to the operating system.  Windows 2000 and higher operating systems use the disk signature as an index to store and retrieve information about the disk in the registry subkey HKLM\System\MountedDevices.

     

    The Partition Table is a 64-byte data structure within the MBR used to identify the type and location of partitions on a hard disk.  Each partition table entry is 16 bytes long (four entries max).  Each entry starts at a predetermined offset from the beginning of the sector as follows:

     

                                    Partition 1                           0x1BE    (446)

                                    Partition 2                           0x1CE    (462)

                                    Partition 3                           0x1DE    (478)

                                    Partition 4                           0x1EE    (494)

     

    The following is a partial example of a sample MBR showing three partition table entries in-use and one empty:

     

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00 00 00

    000001D0: 81 0A 07 FE FF FF 8A F5 – 7F 00 3D 26 9C 00 00 00

    000001E0: C1 FF 05 FE FF FF C7 1B – 1C 01 D6 96 92 00 00 00

    000001F0: 00 00 00 00 00 00 00 00 – 00 00 00 00 00 00

     

    Let’s take a look at each of the fields of a partition table entry individually.  For each of these explanations, I’ll use the first partition table entry above and highlight the relevant section.  Keep in mind that these values are little-endian.

     

    Byte Offset

    Field Length

    Sample Value

    Field Description

    0x1BE

    8 Bits

    0x80

    Boot Indicator

                    00=Do Not Use for Booting

                    80=Active partition (Use for Booting)

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00     

     

    0x1BF

    8 Bits

    0x01

    Starting Head

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C0

    0x1C1

    2 Byte

    Word

    0x01

    Starting Sector

    Only the first 6 bits are used.  The upper 2 bits of this byte are used by the Starting Cylinder field.

    0x00

    Starting Cylinder

    Uses 1 byte + 2 bits from the Starting Sector field to make up the cylinder value.  The Starting Cylinder is a 10-bit number with a maximum value of 1023.

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

     

     

    0x1C2

    8 Bits

    0x07

    System ID

    Defines the volume type.  0x07=NTFS

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    Other Possible System ID Values:  

    Partition Type

    ID Value

    0x01

    FAT12 primary partition or logical drive (fewer than 32,680 sectors in the volume)

    0x04

    FAT16 partition or logical drive (32,680–65,535 sectors or 16 MB–33 MB)

    0x05

    Extended partition

    0x06

    BIGDOS FAT16 partition or logical drive (33 MB–4 GB)

    0x07

    Installable File System (NTFS partition or logical drive)

    0x0B

    FAT32 partition or logical drive

    0x0C

    FAT32 partition or logical drive using BIOS INT 13h extensions

    0x0E

    BIGDOS FAT16 partition or logical drive using BIOS INT 13h extensions

    0x0F

    Extended partition using BIOS INT 13h extensions

    0x12

    EISA partition

    0x42

    Dynamic disk volume

    0x86

    Legacy FT FAT16 disk *

    0x87

    Legacy FT NTFS disk *

    0x8B

    Legacy FT volume formatted with FAT32 *

    0x8C

    Legacy FT volume using BIOS INT 13h extensions formatted with FAT32 *

     

    Partition types denoted with an asterisk (*) indicate that they are also used to designate non-FT configurations such as striped and spanned volumes.

     

    0x1C3

    8 Bits

    0xFE

    Ending Head      (0xFE=254 decimal)

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C4

    0x1C5

    2 Byte

    Word

    0xBF

    Ending Sector

    As with the Starting Sector, it only uses the first 6 bits of the byte.  The upper 2 bits are used by the Ending Cylinder field.

    0x09

    Ending Cylinder

    Uses 1 byte in addition to the upper 2 bits from the Ending Sector field to make up the cylinder value.  The Ending Cylinder is a 10-bit number with a maximum value of 1023.

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C6

    32 Bits

    0x3F000000

    Relative Sectors

    The offset from the beginning of the disk to the beginning of the volume, counting by sectors.

    0x0000003F = 63

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

     

    0x1CA

    32 Bits

    0X4BF57F00

    Total Sectors

    The total number of sectors in the volume.

    0x007FF54B = 8,385,867 Sectors = 4GB

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    Are you with me so far?  Good!  Now, Cylinder/Sector encoding can be a bit tricky, so let’s take a closer look. 

     

    Cylinder - Sector Encoding

    16

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    Cylinder bits 1-8

    Cyl bits

    9 & 10

    Sector value

    Bits 1-6

     

    As you can see, the Sector value occupies the lower 6 bits of the word and the Cylinder occupies the upper 10 bits of the word.  In our example, the starting values for Cylinder and Sector are 01 00.  Values in the MBR use reverse-byte ordering, which is also referred to as ‘little-endian’ notation.  Therefore, we swap the bytes and find that our starting values are Cyl 0, Sec 1.

     

    Our ending values are more interesting:  BF 09.  First, we swap the bytes and obtain a hex value of 0x09BF.  This value in binary notation is 100110111111.  The following table illustrates how we derive the correct partition table values from these bytes:

     

    Example:  BF 09

    8

    7

    6

    5

    4

    3

    2

    1

    10

    9

    6

    5

    4

    3

    2

    1

    0

    0

    0

    0

    1

    0

    0

    1

    1

    0

    1

    1

    1

    1

    1

    1

    10 Cylinder value bits 1-8

    Cyl bits

    9 & 10

    Sector value

    bits 1-6

     

    The 6 low bits are all set to 1, therefore our Sector value is 111111 or 63.   You can see above how the bits are arranged for the Cylinder value.  The value above is 1000001001 (521).  Since both Cylinder and Head values begin numbering at zero, we have a total of 522 Cylinders and 255 Heads represented here.  This gives us an ending CHS value of:  522 x 255 x 63 = 8,385,930 sectors. 

     

    Subtracting the starting CHS address (Cylinder 0, Head 1, Sector 1) (63) gives us the total size of this partition:  8,385,867 sectors or 4GB.  We can verify this number by comparing it to the Total Sectors represented in the partition table:  4B F5 7F 00.  Applying reverse-byte ordering gives us 00 7F F5 4B which equals 8,385,867 sectors.

     

    So, now that we have an understanding of what is contained within the structures on the disk, let’s look at the sequence of events that occur.  Remember, this is just after POST has successfully completed.

     

    1.       The motherboard ROM BIOS attempts to access the first boot device specified in the BIOS.  (This is typically user configurable and can be edited using the BIOS configuration utility.)

     

    2.       The ROM BIOS reads Cylinder 0, Head 0, and Sector 1 of the first boot device.

     

    3.       The ROM BIOS loads that sector into memory and tests it.

    a.       For a floppy drive, the first sector is a FAT Boot Sector.

    b.      For a hard drive, the first sector is the Master Boot Record (MBR).

     

    4.       When booting to the hard drive, the ROM BIOS looks at the last two signature bytes of Sector 1 and verifies they are equal to 55AA.

    a.       If the signature bytes do not equal 55AA the system assumes that the MBR is corrupt or the hard drive has never been partitioned.  This invokes BIOS Interrupt 18, which displays an error that is BIOS-vendor specific such as “Operating System not found”.

    b.      If the BIOS finds that the last two bytes are 55AA, the MBR program executes.

     

    5.       The MBR searches the partition table for a boot indicator byte 0x80 indicating an active partition.

    a.       If no active partition is found, BIOS Interrupt 18 is invoked and a BIOS error is displayed such as “Operating System not found”.

    b.      If any boot indicator in the MBR has a value other than 0x80 or 0x00, or if more than one boot indicator indicates an active partition (0x80), the system stops and displays “Invalid partition table”.

    c.       If an active partition is found, that partition’s Boot Sector is loaded and tested.

     

    6.       The MBR loads the active partition’s Boot Sector and tests it for 55AA.

    a.       If the Boot Sector cannot be read after five retries, the system halts and displays “Error loading operating system”.

    b.      If the Boot Sector can be read but is missing its 55AA marker, “Missing operating system” is displayed and the system halts.

    c.       The bootstrap area is made up of a total of 16 sectors (0-15).  If sector 0 for the bootstrap code is valid, but there is corruption in sectors 1-15, you may get a black screen with no error.  In this case, it may be possible to transplant sectors 1-15 (not Sector 0) from another NTFS partition using DskProbe.

    d.      If the Boot Sector is loaded and it passes the 55AA test, the MBR passes control over to the active partition’s Boot Sector.

     

    7.       The active partition’s Boot Sector begins executing and looks for NTLDR.  The contents of the Boot Sector are entirely dependent on the format of the partition.   For example, if the boot partition is a FAT partition, Windows writes code to the Boot Sector that understands the FAT file system.  However, if the partition is NTFS, Windows writes NTFS-capable code.  The role of the Boot Sector code is to give Windows just enough information about the structure and format of a logical drive to enable it to read the NTLDR file from the root directory.   The errors at this point of the boot process are file system specific.  The following are possible errors with FAT and NTFS Boot Sectors.

    a.       FAT:  In all cases it displays “press any key to restart” after either of the following errors.

    (1)    If NTLDR is not found “NTLDR is missing” is displayed

    (2)    If NTLDR is on a bad sector, “Disk Error” displays.

    b.      NTFS:  In all cases it displays “Press CTRL+ALT+DEL to restart” after any of the following errors.

    (1)    If NTLDR is on a bad sector, “A disk read error occurred” is displayed.  This message may also be displayed on Windows 2000 or higher systems if Extended Int13 calls are required, but have been disabled in the CMOS or SCSI BIOS.  This behavior may also be seen if an FRS (File Record Segment) cannot be loaded or if any NTFS Metadata information is corrupt.

    (2)    If NTLDR is not found, “NTLDR is missing” displays.

    (3)    If NTLDR is compressed, “NTLDR is compressed” displays.

     

    8.       Once NTLDR is found, it is loaded into memory and executed.  At this point the Boot Loader phase begins.

     

    Troubleshooting the Initial Phase (Initial Disk Access)

    If you are unable to boot Windows and the failure is occurring very early in the boot process, begin by creating a Windows boot floppy and using it to attempt to boot the system.

     

                    Q119467 How to Create a Bootable Disk for an NTFS or FAT Partition

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q119467

     

    If you are unable to boot to Windows NT with a floppy, skip to the section titled “Unable to boot using floppy”.

     

    If you can successfully start the computer from the boot disk, the problem is limited to the Master Boot Record, the Boot Sector, or one or more of the following files:  NTLDR, NTDetect.com, or Boot.ini.  After Windows is running, you should immediately back up all data before you attempt to fix the Boot Sector or Master Boot Record.

     

    1.       Run a current virus scanning program to verify that no virus is present.

    2.       Select the method of troubleshooting based upon the error generated.

    ·         Operating System not found

    Typical BIOS-specific error indicating no 55AA signature on the MBR.  This must be repaired manually using a disk editor such as DskProbe.

    Warning:  Running FixMBR or FDISK /MBR in this case will clear the partition table.

    ·         Invalid Partition Table

    May be repaired using DiskProbe – check for multiple 0x80 boot indicators.

    ·         Error Loading Operating System

    This error indicates an invalid Boot Sector.  Use the “Repairing the Boot Sector” section below.

    ·         Missing Operating System

    This error indicates that the 55AA signature is missing from the Boot Sector.  This may be replaced using DskProbe or by using the “Repairing the Boot Sector” section below.

    ·         NTLDR is missing

    Use the “Repairing the NTLDR file” section below.

    ·         Disk error (FAT only)

    Indicates that NTLDR is located on a bad sector.

    ·         A disk read error occurred (NTFS only)

    Indicates that NTLDR is on a bad sector.  Use the “Repairing the NTLDR file” section below.

    ·         NTLDR is compressed

    Indicates that NTLDR is compressed.  Uncompress or replace NTLDR.  Note:  This error is rare.

    ·         Computer boots to a black screen with blinking cursor

    Could be an issue with the MBR or the active partition’s boot code (Q228734).  Use the appropriate section below.

     

    Repairing the MBR

    Use the FIXMBR command from the Windows Recovery Console to re-write the boot code located in the first 446 bytes of the MBR, but to leave the partition table intact.

     

    326215  How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;326215

     

    WARNING:  DO NOT use this method if you are receiving “Operating System not found” or other BIOS-specific error.  FIXMBR will zero out the partition table if the 55AA signature is missing from the MBR.  The data will be unrecoverable.

     

    Repairing the Boot Sector

    Use the FIXBOOT command from the Windows Recovery Console to write a new Boot Sector to the system partition.

     

    326215  How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;326215

     

    If you are not able to repair the Boot Sector by using this method, you may repair it manually using the following knowledge base article:

     

    Q153973 Recovering NTFS Boot Sector on NTFS Partitions

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q153973

     

    Repairing the NTLDR file

    Use the Recovery Console to copy NTLDR from \i386 directory of the Windows CD-ROM to the root of the hard drive.

     

    Unable to boot using floppy

    If you are unable to start the computer with a boot floppy and you are not receiving an error message, then the problem is most likely with your partition tables.  If invalid partitions are present and you are unable to start your computer with a boot disk, formatting the disk, reinstalling Windows and restoring from backup may be your only option.  It may be possible to recreate the partition table using DskProbe if a backup of the partition information is available.  It also may be possible to manually reconstruct the partition table however this is outside the scope of this blog post.  We may cover this later.

     

    If you do not have a current backup of the data on the computer, as a last resort a data recovery service may be able to recover some of your information. 

     

    Helpful knowledge base articles:

                    Q272395 Error Message: Boot Record Signature AA55 Not Found

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q272395

     

                    Q155892 Windows NT Boot Problem: Kernel File Is Missing From the Disk

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q155892

     

                    Q228004 Changing Active Partition Can Make Your System Unbootable

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q228004

     

                    Q155053 Black Screen on Boot

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q155053

     

                    Q314503 Computer Hangs with a Black Screen When You Start Windows

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q314503

     

                    Q153973 Recovering NTFS Boot Sector on NTFS Partitions

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q153973

     

    Okay boys & girls, that’s it for now.  Next time, I’ll continue with the boot sequence and cover the Boot Loader phase.  This is where things really get interesting… J

     

  • Ntdebugging Blog

    This button doesn’t do anything!

    • 8 Comments

     

    Hello - Matthew here again.  Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago.  From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs.  The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind.  When this happens, how can you determine what is going wrong?

     

    In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure.  If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event.  Or if you suspect that the problem is due to something failing over the network, a network trace could help.

     

    Let's assume that the techniques described above do not move us any closer to understanding the failure.  What next?  If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed?  Let's answer those questions by debugging a sample application...

     

    The Nothing button

     

    Download the sample application here.

     

     

    Here's what we know:

    1. Clicking the “Button 1” button is supposed to display a useful dialog box

    2. For certain users, clicking “Button 1” results in no observable response

    3. You don't have access to the source code or debug symbols for the app

    4. No one seems to know what the “Button 1” button does behind the scenes

    5. The developer of the application hasn't been heard from in 3 years

     

    We need to understand what exactly happens when the user clicks the “Button 1” button.  Every window has a WindowProc function that receives messages upon user input.  Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked.  To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.

     

    So we need to:

     

    a)     Find the control identifier for “Button 1”

    b)    Find the WindowProc for the main application window

    c)     Determine where WM_COMMAND messages for the “Button 1” button are handled

    d)    Figure out what is failing in that code

     

    Let’s get started...

     

     

     

    Find the control identifier for “Button 1”      

     

    Spy++ (available as part of Visual Studio) is the tool for this job.  Click Spy -> Log Messages.  Use the Finder Tool to select the main window of ntdbghang1.exe.  On the messages tab, click “Clear All” then select “WM_COMMAND”.  Click “OK” and SPY++ will start logging.  Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window.  For good measure, click “Button 2” also, and you should see a second line of text.  At this point, your SPY++ window should look something like this....

     

    Spy Windows Msg

     

    We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102).  We’ll need this information later.

     

     

     

    Find the WindowProc for the main application window

     

    We can find the WindowProc address by using SPY++ again.  Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe.  Make sure “Show properties” is selected, and then click “OK”.  You will get a dialog that looks like this....

     

    Spy Window Properties

     

    Note the value of Window Proc.  This is the address of the window procedure for the main window (the exact value may differ on your system).   We know that this function is a WindowProc, which means it is of the form...

     

    LRESULT CALLBACK WindowProc(         

        HWND hwnd,

        UINT uMsg,

        WPARAM wParam,

        LPARAM lParam

    );

     

    The meaning of wParam and lParam differ depending on what uMsg is being processed.  When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101.  The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).

     

    So, we specifically are interested in the case where:

               

                uMsg = WM_COMMAND (literally 0x111)

                wParam = 0x101

     

    While looking at assembly language for the WindowProc, the stack frame will look like this...

     

    ebp    =  “old ebp”

    ebp+4  =  “return address”

    ebp+8  =  hwnd

    ebp+c  =  uMsg

    ebp+10 =  wParam

    ebp+14 =  lParam

     

     

     

    Determine where WM_COMMAND messages for the “Button 1” button are handled

     

    Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg.  Launch windbg.exe and select “File -> Attach to a Process”.  Select “ntdbghang1.exe” from the list and click “OK”.  When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”.  Based on the address on my system, that command will be “u 01002830”.  To unassemble more just type “u” again.  I’m now going to unassemble the relevant code and explain what it means along the way...

     

     

    0:001> u 01002830

    ntdbghang1+0x2830:

     

    The first 3 instructions are a function prologue, just getting everything set up.

    01002830 8bff            mov     edi,edi

    01002832 55              push    ebp

    01002833 8bec            mov     ebp,esp

     

    Here is the code that moves the value of uMsg into ecx

    01002835 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]

     

    Now the app enters the assembly equivalent of a switch statement.

    We are interested in the check for uMsg = WM_COMMAND

     

    if uMsg = WM_CREATE (1) goto 01002893

    01002838 49              dec     ecx

    01002839 7458            je      ntdbghang1+0x2893 (01002893)

    if uMsg = WM_ DESTROY (2) goto 01002889

    0100283b 49              dec     ecx

    0100283c 744b            je      ntdbghang1+0x2889 (01002889)

    if uMsg = WM_CLOSE (0x10) goto 01002889

    0100283e 83e90e          sub     ecx,0Eh

    01002841 743b            je      ntdbghang1+0x287e (0100287e)

    if uMsg = WM_COMMAND (0x111) goto 01002853

    01002843 b801010000      mov     eax,101h

    01002848 2bc8            sub     ecx,eax

    0100284a 7407            je      ntdbghang1+0x2853 (01002853)

     

    From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853.  So let’s go there...

     

    0:001> u 01002853

    ntdbghang1+0x2853:

     

    Move the value of wParam into edx

    01002853 8b5510          mov     edx,dword ptr [ebp+10h]

    If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f

    01002856 0fb7ca          movzx   ecx,dx

    01002859 2bc8            sub     ecx,eax

    0100285b 7412            je      ntdbghang1+0x286f (0100286f)

     

    From the above assembly we can see that right away we are checking for the control ID for button 1.  This is the code path we want to follow....

     

    0:001> u 0100286f

    ntdbghang1+0x286f:

     

    If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b

    Otherwise, call the function at 010027f6

    0100286f c1ea10          shr     edx,10h

    01002872 6685d2          test    dx,dx

    01002875 7524            jne     ntdbghang1+0x289b (0100289b)

    01002877 e87affffff      call    ntdbghang1+0x27f6 (010027f6)

     

    Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6. 

     

     

     

    Figure out what is failing

     

    So now we know what code is going to run every time the button is clicked (the function at 010027f6).  Let’s examine that code and figure out what may be failing.  We’ll use “uf” to unassemble the entire function, as it is rather short.  During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.

     

     

    0:001> uf 010027f6

    ntdbghang1+0x27f6:

     

    Function Prologue

    010027f6 8bff            mov     edi,edi

    010027f8 55              push    ebp

    010027f9 8bec            mov     ebp,esp

     

    010027fb 51              push    ecx

    010027fc 56              push    esi

     

    call func1(0x20) [func1 address at 01002dca]

    set localvar1=0x10

    010027fd 6a20            push    20h

    010027ff c745fc10000000  mov     dword ptr [ebp-4],10h

    01002806 e8bf050000      call    ntdbghang1+0x2dca (01002dca)

     

    Store the result of func1 in esi (call it localvar2)

    0100280b 8bf0            mov     esi,eax

     

    Call func2(localvar2, &localvar1) [func2 address at 0100b484]

    0100280d 59              pop     ecx

    0100280e 8d45fc          lea     eax,[ebp-4]

    01002811 50              push    eax

    01002812 56              push    esi

    01002813 e86c8c0000      call    ntdbghang1+0xb484 (0100b484)

     

    If func2 returns 0, then goto 01002821

    01002818 85c0            test    eax,eax

    0100281a 7405            je      ntdbghang1+0x2821 (01002821)

     

    Call func3() [func3 address at 0100278d]

    ntdbghang1+0x281c:

    0100281c e86cffffff      call    ntdbghang1+0x278d (0100278d)

     

    Call func4(localvar2) [func4 address at 01002ce3]

    ntdbghang1+0x2821:

    01002821 56              push    esi

    01002822 e8bc040000      call    ntdbghang1+0x2ce3 (01002ce3)

     

    Clean up and exit the function

    01002827 59              pop     ecx

    01002828 5e              pop     esi

    01002829 c9              leave

    0100282a c3              ret

     

     

    So let’s translate this to some pseudo-code...

     

    localvar1 = 0x10;

    localvar2 = func1(0x20);

     

    if(func2(localvar2, &localvar1)

    {

           func3();

     

    func4(localvar2);

     

    So the behavior of the function will differ, depending on the results of func2.  Let’s see if we can figure out what func2 does...

     

    0:001> u 0100b484

    ntdbghang1+0xb484:

    0100b484 ff2568110001    jmp     dword ptr [ntdbghang1+0x1168 (01001168)]

    0100b48a cc              int     3

    0100b48b cc              int     3

    0100b48c cc              int     3

     

    0:001> dps 01001168 L1

    01001168  70b88cb1 WINSPOOL!GetDefaultPrinterW

     

     

    GetDefaultPrinterW is a public API function.  Here’s the function prototype from MSDN...

     

    BOOL GetDefaultPrinter(

      LPTSTR pszBuffer,   // printer name buffer

      LPDWORD pcchBuffer  // size of name buffer

    );

     

    This fits our pseudo-code, since the function takes two parameters and returns a BOOL.  Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...

     

    DWORD cchBuffer = 0x10;

    LPWSTR pszBuffer = func1(0x20);

     

    if(GetDefaultPrinterW(pszBuffer,  &cchBuffer))

    {

           func3();

    }

     

    func4(pszBuffer);

     

    Based on the above, we can make a couple of assumptions.  Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10.  So that accounts for all of the functions, except for func3.  We could unassemble func3 to see what it does, but that may not be needed.  From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory.  So GetDefaultPrinterW is a likely potential point of failure.

     

    Let’s test this theory out.  We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.

     

    0:001> bp WINSPOOL!GetDefaultPrinterW

     

    0:001> g

     

    <Now click Button 1>

     

    Breakpoint 0 hit

    eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c

    eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202

    WINSPOOL!GetDefaultPrinterW:

    70b88cb1 8bff            mov     edi,edi

     

    // Dump out the call stack...

    0:000> kb

    ChildEBP RetAddr  Args to Child             

    0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW

    WARNING: Stack unwind information not available. Following frames may be wrong.

    0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818

    0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c

    0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23

    0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b

    0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7

     

    // Go to the return address.  Note that eax=0, meaning GetDefaultPrinterW returned FALSE.

    0:000> gu

    eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c

    eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    ntdbghang1+0x2818:

    01002818 85c0            test    eax,eax

     

    // Check the last error...

    0:000> !gle

    LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.

     

           // Resume execution

    0:000> g

     

    The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER.  It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user.  This explains why the app works for some users and fails for others.  A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.

     

    To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...

     

    VOID Button1_OnClick()

    {

          DWORD cch = 16;

          LPTSTR pPrinterName;

     

          pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));

     

          if(GetDefaultPrinter(pPrinterName, &cch))

          {

                DisplayGoButtonMessage();

          }

     

          free(pPrinterName);

     

          return;

    }

     

    Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger.  I hope you found this useful and can apply to real-world debugging scenarios.  Please post any questions / comments!

  • Ntdebugging Blog

    How Windows Shuts Down

    • 7 Comments

    Hi my name is Bryan,  I'm a escalation engineer on the Microsoft CPR platforms team.   A common problem scenario involves shutting down Windows. When troubleshooting problems during shut down we have to look at the Winlogon.exe process which can be tricky and must be done correctly.

     

    Troubleshooting Techniques

     

    Common Settings

     

    Settings that would affect how Windows shuts down are mostly in HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon, and are documented thoroughly at http://www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/12314.mspx. The setting to really pay attention is:

     

    DisableCAD (REG_DWORD). When set to 1 the GINA will bypass the Ctrl-Alt-Del dialog and go straight to the logon prompt. The system cannot be shutdown when the logon prompt at the console is displayed.

     

    Another common setting that is sometimes needed for security reasons is in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management.

     

    ClearPageFileAtShutdown (REG_DWORD). When set to 1 the memory manager will clear the paging file on shutdown. If the paging file is large this can significantly the time it takes for the system to shut down.

     

    Winlogon Event Notification

     

    Prior to Windows Vista/2008 Server, Winlogon Notification Packages can be used to reliably monitor the state of Winlogon. These are simple export DLLs that are registered with entry points for each event monitored by the package. Winlogon explicitly calls these entry points when a specific event occurs. Each package handles the notification within the context of the Winlogon process. Keep in mind that a bug in a notification package can cause problems in Winlogon and render the system inaccessible. You can find more information about Winlogon Notification Packages at http://msdn2.microsoft.com/en-us/library/aa380545.aspx.

     

    Starting with Windows Vista support for Winlogon Notification Packages has been pulled. Most of this functionality is still provided through SCM event notifications. See the following link for more information about this.

     

    http://technet2.microsoft.com/WindowsVista/en/library/6ec4ec6d-6b84-44c9-b3af-116589a42b861033.mspx?mfr=true

     

    You can find information about how to write a service here.

     

    http://msdn2.microsoft.com/en-us/library/ms685969.aspx

     

    However these notifications are performed asynchronously so the exact moment of each state within Winlogon cannot be reliably known.

     

    Winlogon Logging

     

    Winlogon logging is obtained by applying the checked build of Winlogon and setting the following registry values under HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon.

     

    DebugFlags (REG_SZ). This is a comma-separated list of debug flags to capture. The list of flags is Error, Warning, Trace, Init, Timeout, Sas, State, MPR, CoolSwitch, Profile, DebugLsa, DebugSpm, DebugMpr, DebugGo, Migrate, DebugServices, Setup, SC, Notify, and Job.

     

    LogFile (REG_SZ). This is the path to the log file where the events should be written. If this value is missing then the events are written to the debug port.

     

    These options can also be placed in the win.ini file. This option and other debug logging options for Winlogon can be found in these articles.

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;232575

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;221833

     

    Checked builds of Windows binaries are available through MSDN subscriptions. The checked binary applied to the system must match the service pack level of the system.

     

    When looking at the Winlogon logs there are three check points to identify. They are the three ExitWindowsEx calls made during shutdown. These log entries were made during a shutdown initiated with “shutdown –r –f –t 0”.

     

    328.372> Winlogon-Trace: Doing remote-initiated (Administrator) Reboot=true, Force=true

    328.372> Winlogon-Trace: Starting user thread for Logoff, flags = 3807

    328.764> Winlogon-Trace: Doing immediate shutdown, LastGinaRet = b, Flags = 3804

    328.764> Winlogon-Trace: Calling ExitWindowsEx(0x3804, 0)

    .

    .

    .

    328.1528> Winlogon-Trace: ExitWindowsEx called to shut down COM processes

    .

    .

    .

    328.332> Winlogon-Trace: Starting shutdown

    328.332> Winlogon-Trace: Starting user thread for Logoff, flags = 7

    .

    .

    .

    328.284> Winlogon-Trace: Calling ExitWindowsEx(0x7, 0)

     

    If shutdown stalls then based on the log entries you should be able to identify which part of shutdown it is stuck in.

     

    Debugging

     

    In order to debug Winlogon you will need to first install the debugging tools for Windows (http://www.microsoft.com/whdc/devtools/debugging/default.mspx). It contains information about how to set up the debugger for Winlogon debugging. If the debugger is not set up correctly on Winlogon the system can easily get in an unusable state and the system will need to be reinstalled.

     

    You should debug Winlogon whenever a problem manifests itself on the Winlogon desktop where the SAS and logon prompt dialogs are displayed. When debugging Winlogon you should start by looking at the main process thread. If shutdown (or any other Winlogon activity) is hung that thread will show what happened to prevent shutdown from completing. Also look at these Winlogon flags which provide some indication of whether or not the main thread is even processing a shutdown.

     

    0105fe8c winlogon!g_fReadyForShutdown

    0105fdf8 winlogon!ShutdownHasBegun

    01062b3c winlogon!ShutdownInProgress

    01062b30 winlogon!ShutdownTime

     

    I will point out how these variables are modified.

     

    Shutdown Sequence of Events

     

    If you are debugging a shutdown issue you first need to understand the sequence of events that take place during shutdown. This will help isolate what activity may be failing.

     

    RPC Call

     

    An application, local or remote, tells Windows to shut down by calling the Win32 API InitiateSystemShutdownEx (http://msdn2.microsoft.com/en-us/library/aa376874.aspx). This API creates an RPC call over the named pipe InitShutdown. The remote connect and RPC call looks like this in a network trace.

     

    11:08:40.025                 CLIENT         SERVER        SMB              SMB: C; Nt Create Andx, FileName = \InitShutdown

    11:08:40.027                 CLIENT         SERVER        MSRPC        MSRPC: c/o Request: unknown Call=0x1 Opnum=0x2 Context=0x0 Hint=0x20

     

    A local connect would not need to go over the network, but it uses the same mechanism to make this call.

     

    The server side of this RPC call is handled by the function BaseInitiateShutdownEx in the Winlogon.exe process. This RPC worker performs the following actions.

     

    1.       Checks the privilege level of the client. If the client privilege test fails then the error status code is returned. This will typically be ERROR_ACCESS_DENIED (0x5).

    2.       Parses the shutdown command sent by the RPC client. If the command is malformed then the status error code of ERROR_INVALID_PARAMETER (0x57) is returned.

    3.       Checks the Winlogon flags ShutdownInProgress, ShutdownHasBegun, g_fReadyForShutdown, and the current state of the terminal desktop to see if we can shutdown. Winlogon cannot begin the shutdown if a shutdown is in progress or if it is not ready. Winlogon also will not start a shutdown if the force flag was not specified in the command and the desktop is locked. If the ShutdownInProgress or ShutdownHasBegun flag is set then the return value is ERROR_SHUTDOWN_IN_PROGRESS (0x45B). If the g_fReadyForShutdown is not set then the return value is ERROR_NOT_READY (0x15).

     

    Debugger output:

    dd winlogon!g_fReadyForShutdown l 1

    0105fe8c  00000001

     

    dd winlogon!ShutdownInProgress l 1

    01062b3c  00000000

     

    dd winlogon!ShutdownHasBegun l 1

    0105fdf8  00000000

     

    4.       Winlogon initializes the shutdown data including the global variable ShutdownTime. If this variable has been set then we know we’ve gotten this far.

     

    Debugger Output:

    dq winlogon!ShutdownTime l 1

    01062b30  01c7a859`baee0060

     

    .formats 01c7a859`baee0060

    Evaluate expression:

      Time:    Wed Jun  6 12:42:54.506 2007 (GMT-4)

     

    5.       If Winlogon has determined that it can proceed then the ShutdownInProgress flag is set. Subsequent shutdown requests from this point will fail in step 3 when it checks this flag.

    6.       Generates the shutdown audit event if auditing was enabled. Note that shutdown may fail at some point after this even though the audit log is generated.

    7.       Creates another thread to continue working on the shutdown. At this point the RPC worker thread returns to the caller.

     

    To this point there has been no visible indication on the server being shutdown. If the RPC worker fails for some reason then the client application will get an indication of the failure. If the RPC worker successfully hands the shutdown request to the next thread then the client application will get ERROR_SUCCESS (0) as the return code. The client application will get no indication of a failure after that point.

     

    Worker Thread

     

    The RPC worker thread hands control to the LogoffThreadProc thread. If the delay specified in the shutdown command was 0 then the ShutdownInProgress flag is cleared and the ShutdownHasBegun is set. In the Winlogon log you will see a line that starts with “Doing immediate shutdown”. Otherwise you will not see any visible indication at this point that a shutdown is occurring.

     

    If the delay specified in the shutdown command was greater than 0 then the countdown shutdown dialog is displayed.

     

    Shutdown

     

    When that dialog completes the AbortShutdown flag is checked. This flag would get set as a result of a call to AbortSystemShutdown (http://msdn2.microsoft.com/en-us/library/aa376630.aspx). If it is set then the shutdown is aborted. Otherwise the ShutdownInProgress flag is cleared, the ShutdownHasBegun flag is set and if the shutdown was initiated by a system process then we shut the system down immediately at this point with no further clean up.

     

    After these flags are updated the ExitWindowsInProgress flag is set and we call the Win32 API ExitWindowsEx (http://msdn2.microsoft.com/en-us/library/ms893047.aspx). In the Winlogon log you will see a line that starts with “Calling ExitWindowsEx”.

     

    Debugger Output:

    dd winlogon!ExitWindowsInProgress l 1

    0105fd84  00000001

     

    The Win32 API ExitWindowsEx makes an RPC call to CSRSS.EXE. CSRSS synchronously sends a WM_QUERYENDSESSION message to all Windows applications. When an application gets this message it indicates that shutdown can continue and CSRSS then sends the WM_ENDSESSION message. After that the process is terminated. If the application indicates that it cannot be terminated then CSRSS stops processing any further applications and waits for the interactive user to close the application. The ExitWindowsEx call will fail with error ERROR_OPERATION_ABORTED (0x3E3) and the Winlogon flags are reset so that a new shutdown request can be processed.

     

    An application that prevents shutdown from proceeding in this manner can be seen visual since it will be the foreground window on the desktop. To confirm which application returned did this you will need to live debug CSRSS. A return code of 3 from either winsrv!ConsoleClientShutdown or winsrv!UserClientShutdown will indicate the application did this.

     

    Debugger Output:

    0:002> pc

    eax=00000000 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564de esp=0052fe40 ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+7e:

    75a564de ff5740 call dword ptr [edi+0x40]{winsrv!UserClientShutdown (75a9db1f)} ds:0023:00164640=75a9db1f

     

    ; Step past the call.

    0:002> p

    eax=00000003 ebx=7c81a3ab ecx=7ffdb000 edx=75a58ca0 esi=75a58ca0 edi=00164600

    eip=75a564e1 esp=0052fe4c ebp=0052fe68 iopl=0         nv up ei pl zr na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    CSRSRV!CsrShutdownProcesses+81:

    75a564e1 8bf8             mov     edi,eax

     

    ; The first parameter is a structure that contains the process ID.

    75a564d5 ff75f4           push    dword ptr [ebp-0xc]

    75a564d8 ff750c           push    dword ptr [ebp+0xc]

    75a564db ff75f8           push    dword ptr [ebp-0x8]

    75a564de ff5740           call    dword ptr [edi+0x40]

     

    ; Get the pointer to this structure.

    0:002> dd ebp-8 l 1

    dd ebp-8 l 1

    0052fe60  0018a530

     

    ; The first DWORD is the process ID. The second DWORD is the thread ID.

    0:002> dd 0018a530 l 1

    dd 0018a530

    0018a530  0000066c

     

    ; Break into kernel mode so we can look at all the processes.

    0:002> .breakin

    .breakin

    Break instruction exception - code 80000003 (first chance)

    nt!RtlpBreakWithStatusInstruction:

    8081db0e cc              int     3

     

    ; Get the process object with that process ID.

    kd> !process 0000066c 0

    Searching for Process with Cid == 66c

    PROCESS ff62a638  SessionId: 0  Cid: 066c    Peb: 7ffdf000  ParentCid: 0108

        DirBase: 0390d000  ObjectTable: e1658e38  HandleCount:  51.

        Image: test.exe

     

    Console (text-based) applications are asynchronously sent a separate CTRL_SHUTDOWN_EVENT notification. This means that ExitWindowsEx will proceed no matter how the application chooses to handle this notification.

     

    Services.exe (and all Windows services) is a console application which receives this notification from CSRSS. Services.exe registers a control handler ScShutdownNotificationRoutine which calls ScShutdownAllServices on shutdown. This function traverses through all Windows services that are not stopped or stopping and that accept the shutdown notification and sends them the SERVICE_CONTROL_SHUTDOWN notification. Each service has 20 seconds by default to shutdown. However a service may request more time by calling SetServiceStatus with a wait hint and updated check point. It can do this so long as it continues to respond within the current timeout period. However since Services.exe received this as an asynchronous message from CSRSS.EXE it will not prevent the system from shutting down.

     

    After ExitWindowsEx returns control to Winlogon the ExitWindowsInProgress flag is cleared and LogoffThreadProc exits.

     

    MainLoop

     

    The state of Winlogon is controlled by the main process thread in a function called MainLoop. As LogoffThreadProc sets the ShutdownHasBegun flag and calls ExitWindowsEx the MainLoop function picks up on this change of state and begins executing its shutdown code. Since MainLoop is responsible for interaction with the user this is the first place where the user will get visible confirmation from Winlogon that the system is shutting down.

     

    When MainLoop sees that the Winlogon state has changed to shutting down it takes the following actions.

     

    1.       Signal the shell that we are shutting down. This causes the Explorer shell to disappear.

    2.       Checks to see if there are any updates to the user’s profile.

    3.       Send out the logoff notification event.

    4.       Delete network connections.

    5.       Play the logoff sound.

    6.       Play the system exit sound.

    7.       Creates a KillComProcesses thread. This calls ExitWindowsEx and will wait up to 15 minutes for this to complete.

    8.       Save and unload the user’s profile.

    9.       Delete RAS connections.

    10.    Send out the shutdown notification event.

    11.    Stop Windows file protection.

    12.    Creates another LogoffThreadProc thread which again calls ExitWindowsEx.

    13.    Call the shutdown function in the GINA. This displays the Windows is shutting down dialog.

    14.    Wait for any remaining system processes to complete. If we are stuck here then we would need to look at System, smss.exe, or csrss.exe. One of those will be stuck in some action.

    15.    Shut down the system.

     

    Winlogon uses the native API NtShutdownSystem in step 15. If there is a hang after this point then you will see the main thread in Winlogon stuck in this call and the real problem is likely the result of a device driver in the System process. This call will never return.

  • Ntdebugging Blog

    How Windows Starts Up (Part the second)

    • 7 Comments

    Howdy folks, David here again with part two of How Windows Starts Up.  Today we’ll be covering the Boot Loader Phase.  Let’s take a moment to recap where we are at this point.  So far, the computer has completed POST, executed the MBR code, located the active partition, executed the Boot Sector code in that partition, found NTLDR and loaded it into memory.  So what happens next?  During this phase, we’ll be discussing the following files located in the root of your system volume:  NTLDR, ntdetect.com and boot.ini.

     

    The Boot Loader Phase begins when NTLDR executes:

     

    1.    NTLDR switches the processor from real mode to 32-bit flat memory mode.

    How does it do this you ask?  NTLDR contains two parts – a 16-bit component which handles switching the processor to Protected Mode and a 32-bit component which handles everything else. 

     

    When NTLDR first begins execution, the processor is in an x86-operating mode called Real-Mode.  In Real-Mode, no virtual to physical translation of memory addresses occurs, which means that all memory addresses use the 16-bit Segment:Offset format, and only the first 1MB of the computer's physical memory is accessible.  However, we don’t have to concern ourselves with this too much because the first action NTLDR takes is to switch the CPU to Protected-Mode. 

     

    *Trivia Alert*  before we enter Protected-Mode, we enable the A20 Line – you IBM guys know what I’m talking about… 

     

    Still no virtual-to-physical translation occurs at this point in the boot process, but a full 32 bits of memory addressing becomes accessible, and now NTLDR can access up to 4GB of physical memory. 

     

    2.    NTLDR creates enough page tables to make memory below 16MB accessible with paging turned on.  This number has been increased with Windows Server 2003. 

     

    3.    NTLDR enables paging.  “Protected-Mode with Paging Enabled” is the mode in which Windows executes during normal operation.

     

    4.    NTLDR then starts the FAT and NTFS mini file system drivers contained within its own code.  Unlike the Boot Sector's file system code, NTLDR's code is sophisticated enough to read subdirectories.

     

    Note:   All disk access from this point until the Kernel Phase begins uses BIOS Interrupt 13h, Extended Interrupt 13h or NTBOOTDD.SYS

     

    5.    NTLDR locates and parses Hiberfil.sys to determine if it contains a valid hibernator file.  If a valid hibernator file is found, then that file is loaded into memory and execution continues from the point the computer was placed into hibernation.

     

    Hibernation files are interesting.  The file either contains a full memory image, or it contains a pointer (ARC path) to the boot partition of the last operating system that entered hibernation.  That boot partition will contain another Hiberfil.sys file, which contains a full memory image of the hibernating operating system.  If the Hiberfil.sys file is a pointer, the file will contain an ARC path using one of the following syntaxes:

     

                             - linkmulti()disk()rdisk()partition()

                             - linksignature()disk()rdisk()partition()

                             - linkscsi()disk()rdisk()partition()

     

    6.    If Hiberfil.sys is not found, NTLDR parses the Boot.ini file.  If the Boot.ini contains a reference to more than one operating system or boot option, then the boot loader screen is displayed. 

     

    This is a sample of a default Boot.ini file from a Windows 2000 Server-based computer:

     

    [boot loader]

    timeout=30

    default=multi(0)disk(0)rdisk(0)partition(1)\WINNT

    [operating systems]

    multi(0)disk(0)rdisk(0)partition(1)\WINNT="Windows 2000 Server" /fastdetect

     

    Let’s talk a bit about what’s in the Boot.ini file.  Boot.ini contains a list of ARC paths.  Ok, quick – five bonus points to the first person who can tell me what ‘ARC’ stands for…  come on…  Yes, you there in the red shirt… correct!  It stands for ‘Advanced RISC Computing’; now beam down with the captain.  The rest of you come with me…

     

    ARC paths may use the “Multi”, “SCSI” or “Signature” syntax.

     

    The “Multi” syntax instructs Windows to rely on the system BIOS to load system files.  This means that NTLDR will be using INT13 BIOS calls to find and load NTOSKRNL and any other files on the boot partition.

     

    The “SCSI” syntax tells Windows to load a SCSI device driver to access the boot partition.  A copy of the driver must be renamed to NTBOOTDD.SYS and placed in the root of the system partition.

     

    The “Signature” syntax is equivalent to the SCSI syntax, but is used instead to support the Plug and Play architecture in Windows 2000 or later.  Because Windows 2000 is a PnP OS, the SCSI controller number instance might vary each time you start Windows 2000, especially if you add new SCSI controller hardware after setup.

     

    The Signature syntax instructs NTLDR to locate the drive with a disk signature that matches the value in the parentheses, regardless of which SCSI controller number that the drive is connected to.  Keep in mind that the Signature syntax also requires the presence of an Ntbootdd.sys file in the root of the system partition.  If you recall from the previous post, the disk signature value is located in the MBR.

     

    Example:

    signature(8b467c12)disk(1)rdisk(0)partition(2)\winnt="Windows 2000"

     

          Signature() syntax is used only if one of the following conditions exists:

     

    ·         The System BIOS or controller hosting the boot partition on which Windows is installed, does not support INT13 Extensions, or has the INT13 Extensions option disabled, and the partition on which you installed Windows is either larger than ~7.8 gigabytes (GB) in size, or the ending cylinder number for that partition is higher than 1024.

    ·         The drive on which you installed Windows is connected to a SCSI controller with SCSI BIOS disabled, so INT13 BIOS calls cannot be used during the boot process.

     

                            227704 Windows May Use Signature() Syntax in the Boot.ini File

                            http://support.microsoft.com/default.aspx?scid=kb;EN-US;227704

     

    7.    If Boot.ini is not found, NTLDR attempts to continue the boot process by assuming that Windows was installed in the default location on the same partition (c:\windows).  If the c:\windows directory is not found, NTLDR displays this error message:  Windows could not start because the following file is missing or corrupt: \winnt root\system32\ntoskrnl.exe

     

    8.    If the boot loader screen is displayed, the user must select an operating system to boot.

    a.    If the user chooses to run an operating system other than Windows 2003/XP such as DOS or Windows 95/98, NTLDR loads and executes the Bootsect.dos file located in the root of the boot drive.

    (1)  If the Bootsect.dos file is found, NTLDR loads it into RAM and then passes control to it.

    (2)  Bootsect.dos then loads IO.sys, MSDOS.sys, Command.com, etc., as necessary and proceeds to boot the downlevel operating system.

    (3)  If the file is not found in the root, NTLDR displays:

    I/O Error accessing boot sector file multi(0)disk(0)rdisk(0)partition(1): \bootsect.dos

    b.    If the user selects a Windows NT-based operating system, NTLDR executes Ntdetect.com.

    (1)  If the ARC path points to a hard drive or partition that is not on the system, the following is displayed:

    Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.

    (2)  The above message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file.  The following error message may also be displayed when there are problems with the Signature () value:

    Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.

     

    9.    Ntdetect.com initializes and detects all supported hardware that it can find.  (Troubleshooting tip:  There is an Ntdetect.chk on the Windows Resource Kit CD in the Debug.cab file, which can be renamed and used in place of Ntdetect.com to view what Ntdetect.com is detecting during startup.  You know you want to – go ahead, I’ll wait…)

     

    Ntdetect.com detects the following components:

     

    a)    Computer ID

    b)    Bus/adapter type

    c)    Video adapter

    d)    Keyboard

    e)    Communications ports

    f)     Floppy disks

    g)    Mouse or other pointing devices

    h)    Parallel ports

     

    10.  Ntdetect.com passes its list of hardware components to NTLDR.

     

    11.  If the /SOS switch was used in the Boot.ini for the OS that is loading, NTLDR will display the files it loads into memory from this point on.  Keep in mind – none of these files are being initialized.  They are just being loaded into RAM at this point.  I can’t tell you how many times someone has incorrectly concluded that there is a problem with MUP.SYS because it’s the last file displayed on this screen before their computer hangs.  Poor MUP – always getting a bad rap.  This list can be useful, however; it can help you identify potential malware that may be loading.

     

    12.  NTLDR loads Ntoskrnl.exe and Hal.dll into memory.  Again, *loading*, not initializing.

     

    13.  NTLDR loads the System registry hive into memory.  If Windows cannot load the registry hive, it may be necessary to restore the System hive from the ERD or by using the Windows Recovery Console to copy the files from the Windows\Repair or Windows\Repair\RegBack folder.

     

    14.  NTLDR reads the HKLM\System\Select registry key.

    a.    If the Last Known Good option was not selected, NTLDR looks for the “Default” value in the SELECT key to select which ControlSet to use.  (0x1 = ControlSet001, 0x2 = ContolSet002, 0x3 = ControlSet003) 

    b.    If the Last Known Good option was selected, NTLDR looks for the “LastKnownGood” value in the SELECT key to select the ControlSet to use.

     

    15.  NTLDR sets the “Current” value in the HKLM\System\Select key to the ControlSet being used.

     

    Note:  At this point, NTLDR has determined which ControlSet it needs to use, but we don’t actually create the “CurrentControlSet” symbolic link until the kernel initializes (more on that in our next installment).  Therefore, for the purposes of this discussion, I’ll use *ActiveControlSet* to refer to whichever registry ControlSet has been identified as the one NTLDR needs to use.

     

    16.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Codepage for the “Acp”, “Oemcp”, and “Oemhal” values.  These entries are pointers to the table of codepage information stored in the same key.  The table has the filenames that correspond to the codepage numbers.  NTLDR loads these files into memory.  The defaults are C_1252.nls, C_437.nls, and Vgaoem.fon.

     

    A code page is a character set specific to an operating system or hardware platform.  It maps character codes to individual characters.  Different code pages include different special characters, typically customized for a language or a group of languages.  The system uses code pages to translate keyboard input into character values for non-Unicode based applications, and to translate character values into characters for non-Unicode based output displays.

     

    17.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Language for the “Default” value.  This entry is a pointer to the table of language information stored in the same key.  NTLDR loads this file into memory.  The default is L_intl.nls.

     

    18.  NTLDR scans HKLM\System\*ActiveControlSet*\Services for device drivers with a “Start” value of 0x0.  It finds their “Group” and “Tag” values also.

     

    19.  NTLDR scans HKLM\System\*ActiveControlSet*\Control\ServiceGroupOrder for the order that service groups should be loaded.

     

    20.  NTLDR sorts the drivers by “Group” and then by their “Tag” value.  Each service or driver has a particular group to which it belongs.  All drivers within a specific group are ordered by tag. 

     

    21.  NTLDR loads the 0x0 device drivers into memory.  It does not initialize or start these drivers.

     

    22.  The board is set, and the pieces are moving.  NTLDR executes Ntoskrnl.exe.

     

    Troubleshooting the Boot Loader Phase

     

    Okay, now that we’ve seen what is supposed to happen, what could possibly go wrong?  Well, several things…

     

    First, all file-system activity during this phase uses BIOS Int13, Extended Int13 or Ntbootdd.sys. 

     

    1.    If your computer uses Extended Int 13, verify that Ext. Int13 is enabled in your system’s BIOS or SCSI BIOS as appropriate.

     

    2.    If you are using a SCSI driver, verify that you are using the correct SCSI driver (Ntbootdd.sys).  If you are using "SCSI()" syntax in the Boot.ini file, copy the correct device driver for the SCSI controller to the root and then rename it to Ntbootdd.sys.  This is the driver that is stored in the Winnt.sbs\System32\Drivers folder (for example, the Aic78xx.sys file).  If you use "MULTI()" syntax in the Boot.ini file, you do not need to do this.  Use the Windows Recovery Console or a Windows NT boot floppy to complete this step.

     

                301680 HOW TO: Create a Boot Disk for an NTFS or FAT Partition

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;301680

     

    307654 How to install and use the Recovery Console in Windows XP

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;307654

     

     

    Significant Errors during this phase:

     

    ·         Windows could not start because of a computer disk hardware configuration problem.  Could not read from the selected boot disk.  Check boot path and disk hardware.  Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.

     

    This error indicates that the ARC path is pointing to a hard drive or partition that is not on the system.  This message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file. 

     

    ·         Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.

     

    This error message may be displayed when there are problems with the Signature () ARC path syntax.

     

    Ntoskrnl.exe missing or corrupt.  This error could be the result of an incorrect ARC path in the Boot.ini file.  Use the information in the following knowledge base articles to correct the Boot.ini information.

     

                102873 BOOT.INI and ARC Path Naming Conventions and Usage

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;102873

     

                155222 How to Determine the ARC Path

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;155222

     

                130921 Creating an FT Boot Disk With Scsi( ) and Multi( ) Identifiers

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;130921

     

    It is also possible to use the Windows XP Recovery Console to create a new Boot.ini for a Windows NT, Windows 2000 or .NET installation.  Please see the following knowledge base article for a description of this process:

     

                291980 A Discussion About the Bootcfg Command and Its Uses

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q291980

     

    Specify the /SOS switch in the Boot.ini file and monitor the output to the screen.  If the system stops responding before anything is output to the screen, then the problem may be with hardware detection.  Replace Ntdetect.com with Ntdetect.chk to view the hardware detected by the system during this phase.

     

                103659 Setup Hangs While Inspecting Hardware; How to Use Ntdetect.com

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;103659

     

                320040 HOW TO: Use Ntdetect.chk to Identify Hardware Problems

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;320040

     

     

    Incorrect HAL

    If after specifying the /SOS option, you see that an incorrect version of the HAL.DLL is loading, use the following KB articles to replace the HAL or troubleshoot other HAL-related issues:

     

                314477 "Hardware Configuration Problem" Err Msg Starting Windows

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;314477

     

                237556 Troubleshooting Windows 2000 Hardware Abstraction Layer Issues

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;237556

     

    330184  "Invalid Boot.ini" or "Windows could not start" error messages

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;330184

     

     

    Unable to load System hive

    Several problems could happen here.  First, consider that the amount of memory we’re dealing with is only 16MB.  We have to squeeze NTLDR, the System Registry Hive, the kernel, the HAL, all the boot drivers and the PFN database into a relatively small area.  So, with Windows Server 2003, we’ve given ourselves a bit more room. 

     

                302594 The System hive memory limitation is improved in Windows Server 2003

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;302594

     

    If Windows is unable to load the System registry hive into memory, then it may be necessary to restore the System hive from the ERD or use the Windows Recovery Console to copy the files from the Winnt\Repair or Winnt\Repair\RegBack folder whichever version is most recent.

     

                307545 How to Recover from a Corrupted Registry

                http://support.microsoft.com/default.aspx?scid=kb;EN-US;307545

     

     

    Table of Startup Type Values

    HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services

    0x0 = Boot

    Loaded into RAM by the Boot loader (NTLDR) during Boot Loader Phase, then initialized by the Kernel (NTOSKRNL) during Kernel Phase

    0x1 = System

    Loaded and initialized by the Kernel during Kernel Phase

    0x2 = Auto

    Loaded or started automatically at system startup

    0x3 = Manual

    Driver is manually started by the user or another process

    0x4 = Disabled

    Driver is not to be started (There are exceptions, such as NTFS)

     

     

    Conclusion

    Well folks, we certainly have covered a lot of ground so far.  Any questions?  Yes, you there in the front…  Ah, why have we not discussed any blue screens?  Very good question.  We haven’t seen any BSOD’s up to this point in the boot sequence because the kernel isn’t running yet and STOP errors are generated by the kernel.  No kernel, no blue screens. 

     

    That’s all the time we have this week boys & girls.  Join us next time for How Windows Starts Up, The Kernel Phase.

     

  • Ntdebugging Blog

    Hung Window?, No Source?, No Problem!! Part 2

    • 4 Comments

    Written by Jeff Dailey

     

    Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR (critical problem resolution) platforms team.   This blog entry is part 2 of my Hung Window?, No source?, No problem!! Part 1 blog.   In this lab we will be debugging a problem involving multi threaded applications and synchronization objects and the types of things that can go wrong, and how to track them down. This process and training lab is right out of our CPR Training curriculum.  In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post.  You will also need to install the debugging tools for windows.

     

    Debugging tools:

     http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx

     

    After you have both of these installed we can get started.  We are going to debug and figure out why the window stops repainting and does not respond.

     

    Step 1 start badwindow.exe

    Step 2 run dumphungwindow.exe

    Step 3 select Hang \ Hang Type 2 from  the BadWindow.exe menu.

    You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process

     

    ************ OUTPUT *************

    C:\source\dumphungwindow\release>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeffda\AppData\Local\Temp\
    scanning for hung windows

    **

    Hung Window found dumping process (12912) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_5_Pid12912_badwindow.exe.dmp

    Dump complete

     

    Hung Window found dumping process (12912) badwindow.exe

    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_24_Pid12912_badwindow.exe.dmp\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp

    Dump complete*

    ************ OUTPUT *************

     

    Step 4 create a local symbol directory at C:\websymbols

    Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.


    Step 6 start windbg select file\open crash dump and select the first dump file.

    Your initial output should look like this.

     

    Microsoft (R) Windows Debugger  Version 6.7.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]

    User Mini Dump File with Full Memory: Only application data is available

     

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols;srv

    Executable search path is:

    Windows Vista Version 6000 MP (2 procs) Free x86 compatible

    Product: WinNt, suite: SingleUserTS

    Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)

    System Uptime: 11 days 18:41:43.089

    Process Uptime: 0 days 0:00:32.000

    ....................................

    Loading unloaded module list

    .

    eax=00000000 ebx=00000002 ecx=00000000 edx=00000000 esi=00000000 edi=00000000
    eip=777faec5 esp=0017faf4 ebp=0017fb8c iopl=0         nv up ei pl nz na po nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
    ntdll!ZwWaitForMultipleObjects+0x15:
    777faec5 c21400          ret     14h

    0:000> !reload

     

    Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it.
    Type ~* k

     

     

    You will most likely see output similar to this.

    .  0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 

    0017faf0 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15

    0017fb8c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d

    0017fba8 00401502 kernel32!WaitForMultipleObjects+0x18

    0017fbc8 0040139b badwindow!hangtype2+0x42 [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af badwindow!WndProc+0x17b [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

       1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 

    026ffebc 777ecfad ntdll!ZwWaitForSingleObject+0x15

    026fff20 777ecf78 ntdll!RtlpWaitOnCriticalSection+0x154

    026fff48 0040153c ntdll!RtlEnterCriticalSection+0x152

    026fff64 757c2848 badwindow!hangtype2threada+0x2c [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 msvcr80!_endthread+0x4b

    026fffa0 76eb19f1 msvcr80!_endthread+0xcb

    026fffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    026fffec 00000000 ntdll!_RtlUserThreadStart+0x23

    Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 3270.2b10 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efdd000 Unfrozen

     

    Each of these threads represents a call stack.  The most recent call is at the TOP of the stack.  As each call is made the stack grows larger.   Looking at thread 0 you will see that our winproc appears to be blocked on a call to hangtype2, hangtype2 is making a call to WaitForMultipleObjects Lets look more closely at WaitForMultipleObjects

     

    Docs for WaitForMultipleObjects

    http://msdn2.microsoft.com/en-us/library/ms687025.aspx

     

    DWORD WINAPI WaitForMultipleObjects( DWORD nCount, const HANDLE* lpHandles, BOOL bWaitAll, DWORD dwMilliseconds

     

    Lets look at the parameters passed to

     

    0:000> kv

    ChildEBP RetAddr  Args to Child             

    0017faf0 76e4edb5 00000002 0017fb40 00000000 ntdll!ZwWaitForMultipleObjects+0x15 (FPO: [5,0,0])

    0017fb8c 76e430c3 0017fb40 0017fbc4 00000001 kernel32!WaitForMultipleObjectsEx+0x11d (FPO: [Non-Fpo])

    0017fba8 00401502 00000002 0017fbc4 00000001 kernel32!WaitForMultipleObjects+0x18 (FPO: [Non-Fpo])

    0017fbc8 0040139b 00401220 0017fbfc 00401220 badwindow!hangtype2+0x42 (FPO: [0,2,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af 00063d36 00000111 00008004 badwindow!WndProc+0x17b (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 00401220 00063d36 00000111 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d 00000000 00401220 00063d36 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])

    0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])

    0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])

    0017ff0c 00401817 00400000 00000000 00280f8c badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 7efde000 0017fb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

     

    The first parameter is 00000002  this is the number of objects we are waiting on.

    The second parmeter is the address of the array of objects,  Lets dump it out and take a look at the objects

     

    0:000> dd 0017fbc4

    0017fbc4  000000c4 000000c8 0040139b 00401220

    0017fbd4  0017fbfc 00401220 00063d36 0017fc48

    0017fbe4  772a8989 772a894d 53ca28e7 00000000

    0017fbf4  00063d36 00401220 00000000 00000000

    0017fc04  00000000 0017fca0 00000001 00000000

    0017fc14  ffffffff 772a88e5 53c4f4b4 75c12459

    0017fc24  0017fc50 772a87af 00063d36 00000111

    0017fc34  00008004 00000000 00401220 dcbaabcd

     

    0:000> !handle 000000c4

    Handle 000000c4

      Type            Thread

    0:000> !handle 000000c8

    Handle 000000c8

      Type            <Error retrieving type>

     

    Looking at the second value it would appear as if all the info needed to get the handle type info is not in the dump for some reason.  Handles are a index into the handle table in kernel.  It’s possible when the dump was created that no all the handle info was included.  However that’s ok.  We have a simple way to work around this and see what happened.

     

    We can use UF from part 1 of this blogs on badwindow.exe, All we need to do is UF the return address of

    WaitForMultipleObjects.  Lets run through the assembly and see what we are waiting on.

     

    0:000> uf 00401502

    badwindow!hangtype2 [c:\source\badwindow\badwindow\badwindow.cpp @ 334]:

     

    Reserving space on the stack by decrementing ESP (The stack pointer, remember the stack grows down in memory)

      334 004014c0 83ec08          sub     esp,8

     

    Save the state of ESI so it can be restored later.

      334 004014c3 56              push    esi

     

    Get the pointer to _beginthread from the import table and store it in ESI
    Docs on being thread http://msdn2.microsoft.com/en-us/library/kdzttdcb(VS.80).aspx 1 (start address), 2 (stack size), 3 (arglist)

      337 004014c4 8b3580204000    mov     esi,dword ptr [badwindow!_imp___beginthread (00402080)]

     

     

    Push the last arg to _beginthread on the stack this is the arg list for _beginthread in this case 0 we are passing no args.

      337 004014ca 6a00            push    0


    This is our stack space.  Note in the debugger you can do a ? 2EE0h and it will show value in Hex and Dec, this value is 12000 dec.

      337 004014cc 68e02e0000      push    2EE0h

     

    This is the start address for our thread function, in this case hangtype2threada

      337 004014d1 6810154000      push    offset badwindow!hangtype2threada (00401510)

     

    Here we call _beginthread this starts the thread up.  The return value is a thread handle.

      337 004014d6 ffd6            call    esi

     

    Push the last arg to _beginthread on the stack this is the arg list for _beginthread in this case 0 we are passing no args.

      338 004014d8 6a00            push    0

     

    This is our stack space arg for _beingthread

      338 004014da 68e02e0000      push    2EE0h

     

    This is the start address for our thread function, in this case hangtype2threadb

      338 004014df 6870154000      push    offset badwindow!hangtype2threadb (00401570)

     

    We are now storing EAX (The return from the first _beginthreadcall ) into ESP+1ch (on our stack)

      338 004014e4 8944241c        mov     dword ptr [esp+1Ch],eax

     

    Here we call _beginthread this starts the thread up.  The return value is a thread handle.

      338 004014e8 ffd6            call    esi

     

    Any time we add to ESP We are shrinking or cleaning up the stack.

      338 004014ea 83c418          add     esp,18h

     

    We are pushing our wait time for WaitForMultipleObjects in this case 0FFFFFFFFh (-1) Wait forever.

      340 004014ed 6aff            push    0FFFFFFFFh

     

    Storing EAX on the stack, this is the thread handle from our last _beginthread call.

      340 004014ef 8944240c        mov     dword ptr [esp+0Ch],eax

     

    This is our wait logic,  in this case it’s WaitAll, so we will only unblock once all handles are signaled or in this case threads complete running.

      340 004014f3 6a01            push    1

     

    Here we are loading the pointer of the stack location that contains our handles that we will wait on into EAX.

      340 004014f5 8d44240c        lea     eax,[esp+0Ch]

     

    Now we push the pointer to our handles / objects on the to the stack.

      340 004014f9 50              push    eax

     

    And this is the count of objects, 2 in this case both of them threads.

      340 004014fa 6a02            push    2

     

    Now we call our WaitForMultipleObjects call to wait on hangtype2threadb and hangtype2threada to finish executing.

      340 004014fc ff1510204000    call    dword ptr [badwindow!_imp__WaitForMultipleObjects (00402010)]

     

    Restore our ESI register, this will happen when we return. 

      340 00401502 5e              pop     esi 

     

    Dec our stack pointer.

      342 00401503 83c408          add     esp,8

     

    Return we are done.

      342 00401506 c3              ret

     

    Here is the source.

     

    void hangtype2(void)

    {

          HANDLE handles[2];

     

          handles[0] = (HANDLE)_beginthread(hangtype2threada, 12000, NULL);

          handles[1] = (HANDLE)_beginthread(hangtype2threadb, 12000, NULL);

         

          WaitForMultipleObjects(2,handles,1,INFINITE);

     

    }

     

     

    So what went wrong?  Let’s look at our threads again. 

     

    We have our main message pump thread thread 0 waiting on two threads, One is still running badwindow!hangtype2threada and the other one is gone or has completed hangtype2threadb.

     

    0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 

    0017faf0 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15

    0017fb8c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d

    0017fba8 00401502 kernel32!WaitForMultipleObjects+0x18

    0017fbc8 0040139b badwindow!hangtype2+0x42 [c:\source\badwindow\badwindow\badwindow.cpp @ 340]

    0017fc24 772a87af badwindow!WndProc+0x17b [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Looking at hangtype2threada it would seem that it is blocked on RtlEnterCriticalSection.

     

       1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 

    026ffebc 777ecfad ntdll!ZwWaitForSingleObject+0x15

    026fff20 777ecf78 ntdll!RtlpWaitOnCriticalSection+0x154

    026fff48 0040153c ntdll!RtlEnterCriticalSection+0x152

    026fff64 757c2848 badwindow!hangtype2threada+0x2c [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 msvcr80!_endthread+0x4b

    026fffa0 76eb19f1 msvcr80!_endthread+0xcb

    026fffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    026fffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Lets look and see what is happening with this critical section call..

     

    First lets set our thread context to thread id 1

     

    0:000> ~1s

    eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=00403780 edi=00000000

    eip=777fa69d esp=026ffec0 ebp=026fff20 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    ntdll!ZwWaitForSingleObject+0x15:

    777fa69d c20c00          ret     0Ch

     

    Lets get our call stack and get the first and only arg for entercriticalsection.

     

    0:001> kv

    ChildEBP RetAddr  Args to Child              

    026ffebc 777ecfad 000000cc 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])

    026fff20 777ecf78 00000000 00000000 76e61d5a ntdll!RtlpWaitOnCriticalSection+0x154 (FPO: [Non-Fpo])

    026fff48 0040153c 00403780 00000000 00000000 ntdll!RtlEnterCriticalSection+0x152 (FPO: [Non-Fpo])

    026fff64 757c2848 00000000 51b22bb2 00000000 badwindow!hangtype2threada+0x2c (FPO: [Uses EBP] [1,1,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 358]

    026fff9c 757c28c8 76eb19f1 02274358 026fffec msvcr80!_endthread+0x4b (FPO: [Non-Fpo])

    026fffa0 76eb19f1 02274358 026fffec 7782d109 msvcr80!_endthread+0xcb (FPO: [Non-Fpo])

    026fffac 7782d109 02274358 026ffb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    026fffec 00000000 757c286e 02274358 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

     

    We can do a couple of things at this point first lets look at the CS. (Critical Section)

     

    0:001> !cs 00403780

    -----------------------------------------

    Critical section   = 0x00403780 (badwindow!csCritSec1+0x0)

    DebugInfo          = 0x0029bd40

    LOCKED             < It’s LOCKED.

    LockCount          = 0x1

    WaiterWoken        = No

    OwningThread       = 0x00002048  < This is the owning thread.

    RecursionCount     = 0x14

    LockSemaphore      = 0xCC

    SpinCount          = 0x00000000

     

    Are there any other locked critical sections?  !locks will tell us and no this is the only one.

     

    0:001> !locks

     

    CritSec badwindow!csCritSec1+0 at 00403780

    WaiterWoken        No

    LockCount          1

    RecursionCount     20

    OwningThread       2048

    EntryCount         0

    ContentionCount    1

    *** Locked

     

    Scanned 156 critical sections

     

    What thread are running in our process and what is 2048 doing?

     

    0:001> ~

    #  0  Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen

    .  1  Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen

     

    Ok here is our problem.  Apparently both threads hangtype2threada and hangtype2threadb were using this same critical section however something happened to hangtype2threadb.  We need to figure out what happened so let’s go take a look at that function.

     

    Looking back where we unassembled badwindow!hangtype2 we got it’s address, lets verify it with a ln (list near), we are lucky enough to have symbols in this case.

     

    0:001> ln 00401570

    c:\source\badwindow\badwindow\badwindow.cpp(370)

    (00401570)   badwindow!hangtype2threadb   |  (004015e0)   badwindow!hangtype3thread

    Exact matches:

        badwindow!hangtype2threadb (void *)

     

    Looks like we have an exact match.  Now lets unassemble it and see what went wrong.

     

     

    0:001> uf 00401570

    badwindow!hangtype2threadb [c:\source\badwindow\badwindow\badwindow.cpp @ 370]:

     

    Save ECX

      370 00401570 51              push    ecx

     

    Save EBX

      370 00401571 53              push    ebx

     

    Move the pointer to sprint into EBX

      371 00401572 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]

     

    Save EBP

      371 00401578 55              push    ebp

     

    Move the pointer to outputdebugstring into ebp

      371 00401579 8b2d14204000    mov     ebp,dword ptr [badwindow!_imp__OutputDebugStringA (00402014)]

     

    Save ESI

      371 0040157f 56              push    esi

     

    Move the pointer to EnterCriticalSection into ESI

      371 00401580 8b351c204000    mov     esi,dword ptr [badwindow!_imp__EnterCriticalSection (0040201c)]

     

    Save EDI

      371 00401586 57              push    edi

     

    Move the pointer for Sleep into EDI

      371 00401587 8b3d0c204000    mov     edi,dword ptr [badwindow!_imp__Sleep (0040200c)]

     

    Save  14h or 20dec to ESP+10h (A local on the stack)  Maybe this is a counter?

      371 0040158d c744241014000000 mov     dword ptr [esp+10h],14h

     

    Push the address of the critical section csCritSec1 00403780 onto the stack.

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)

     

    Call entercriticalsection

      374 0040159a ffd6            call    esi

     

    Push 0xFA, 250Dec on the stack

      376 0040159c 68fa000000      push    0FAh

    Call Sleep (Wait for 250ms)

      376 004015a1 ffd7            call    edi

    Push pointer to value on the stack. When in doubt dump it out.

    0:001> db 004021e4

    004021e4  57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79  We are in hangty

    004021f4  70 65 32 74 68 72 65 61-64 62 00 00 48 00 00 00  pe2threadb..H...

    00402204  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    00402214  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

      377 004015a3 68e4214000      push    offset badwindow!`string' (004021e4)

     

     

    Push pointer on the stack what is it?

    0:001> db 00403380

    00403380  57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79  We are in hangty

    00403390  70 65 32 74 68 72 65 61-64 62 00 00 00 00 00 00  pe2threadb......

    004033a0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    004033b0  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

      377 004015a8 6880334000      push    offset badwindow!szTrace (00403380)

     

    Call spirintf

      377 004015ad ffd3            call    ebx

     

    Clean up the stack

      377 004015af 83c408          add     esp,8

     

    This is out buffer we just did the sprintf to

      378 004015b2 6880334000      push    offset badwindow!szTrace (00403380)


    Call outputdebugstring

      378 004015b7 ffd5            call    ebp

     

    Push csCritSec2’s address on the stack

      380 004015b9 6868374000      push    offset badwindow!csCritSec2 (00403768)

     

    Call LeaveCriticalSection for csCritSec2’s

      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection (00402024)]

     

    Decrement a counter on the stack this is a local counting down to zero..

      382 004015c4 836c241001      sub     dword ptr [esp+10h],1

     

    Check counter local counting down to zero if we are not ZERO yet dump to the top of the loop.

      382 004015c9 75ca            jne     badwindow!hangtype2threadb+0x25 (00401595)

     

     

    Restore all the registers and then return

      382 004015cb 5f              pop     edi

      382 004015cc 5e              pop     esi

      382 004015cd 5d              pop     ebp

      382 004015ce 5b              pop     ebx

      387 004015cf 59              pop     ecx

      387 004015d0 c3              ret

     

     

    Did you see the BUG?,  Look closely,   If you need it here is the source.

    void __cdecl hangtype2threadb(void *)

    {

          int i=0;

          while(1)

          {

                EnterCriticalSection(&csCritSec1);

               

                Sleep(250);

                sprintf(szTrace, "We are in hangtype2threadb");

                OutputDebugStringA(szTrace);

               

                LeaveCriticalSection(&csCritSec2);

                i++;

                if(i==20)

                {

                      break;

                }

          }

    }

     

    We are entering one critical section and leaving another.  Then we drop out of the function once we dec our counter to zero and the thread terminates leaving csCritSec1 entered but never left.  The fix for this seems rather simple,  we just need to leave critsec1 vis leave creatsec2.  That should fix it.  But it we don’t have the source how can we verify that?

    SIMPLE! We just modify the machine code in the debugger!   Often if we think we know how to fix something we will edit the code bytes to make the machine code do the right thing and let it run. 

     

    Do do this, from the command line in your debuggers directory run windbg.exe C:\source\badwindow\release\badwindow.exe   asuming you have your bad window sample in the same directory, I do.   When the debugger fires up make sure you have your symbol path set.  .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols.

     

    Our bad funtion call was

     

    Push csCritSec2’s address on the stack  << WRONG CRITICALSECTION

      380 004015b9 6868374000      push    offset badwindow!csCritSec2 (00403768)

     

    Call LeaveCriticalSection for csCritSec2’s

      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection (00402024)]

     

    Push the address of the critical section csCritSec1 00403780onto the stack.   << CORRECT CRITICALSECTION

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)

     

    Call entercriticalsection

      374 0040159a ffd6            call    esi

     

    Remember all we need to do is change what criticalsection was pushed on the stack for the leavecriticalsection call.

     

    004015b9 6868374000  (BAD)   
    00401595 6880374000  (GOOD)

    Now we just do a edit bytes on the bad instruction.

    0:001> eb 004015b9

    004015b9 68 68  << ENTER 68

    68

    004015ba 68 80  << We don’t want 68 enter 80

    80

    004015bb 37  << Now just hit enter to finish editing memory.

     

    Here is the fixed code.

     

    0:001> uf 00401570

    badwindow!hangtype2threadb [c:\source\badwindow\badwindow\badwindow.cpp @ 370]:
      370 00401570 51              push    ecx
      370 00401571 53              push    ebx
      371 00401572 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]
      371 00401578 55              push    ebp
      371 00401579 8b2d14204000    mov     ebp,dword ptr [badwindow!_imp__OutputDebugStringA (00402014)]
      371 0040157f 56              push    esi
      371 00401580 8b351c204000    mov     esi,dword ptr [badwindow!_imp__EnterCriticalSection (0040201c)]
      371 00401586 57              push    edi
      371 00401587 8b3d0c204000    mov     edi,dword ptr [badwindow!_imp__Sleep (0040200c)]
      371 0040158d c744241014000000 mov     dword ptr [esp+10h],14h

    ENTERING CORRET CRITICAL SECTION csCritSec1

      374 00401595 6880374000      push    offset badwindow!csCritSec1 (00403780)
      374 0040159a ffd6            call    esi
      376 0040159c 68fa000000      push    0FAh
      376 004015a1 ffd7            call    edi
      377 004015a3 68e4214000      push    offset badwindow!`string' (004021e4)
      377 004015a8 6880334000      push    offset badwindow!szTrace (00403380)
      377 004015ad ffd3            call    ebx
      377 004015af 83c408          add     esp,8
      378 004015b2 6880334000      push    offset badwindow!szTrace (00403380)
      378 004015b7 ffd5            call    ebp

    LEAVING CORRECT CRITICAL SECTION csCritSec1

      380 004015b9 6880374000      push    offset badwindow!csCritSec1 (00403780) 
      380 004015be ff1524204000    call    dword ptr [badwindow!_imp__LeaveCriticalSection
    (00402024)]

      382 004015c4 836c241001      sub     dword ptr [esp+10h],1
      382 004015c9 75ca            jne     badwindow!hangtype2threadb+0x25 (00401595)
      382 004015cb 5f              pop     edi
      382 004015cc 5e              pop     esi
      382 004015cd 5d              pop     ebp
      382 004015ce 5b              pop     ebx
      387 004015cf 59              pop     ecx
      387 004015d0 c3              ret

     

    Then just run the code (PRESS G then enter in the debugger) that’s it, it will work!

    Once you have proven this you can go to the developer of the application and recommend they change their code, remember to provide your debug notes.

     

    I hope you found this helpful and I welcome your feedback.

    Thank you,  Jeff-

     

     

  • Ntdebugging Blog

    Server Hangs with Event ID: 2021 and 2022

    • 3 Comments

    Hi again!  This is Tate from the CPR team and I’m going to show you how to debug a Server Service hang and the sometimes dreaded Event ID:  2021 and Event ID:  2022.  There is much Voodoo about troubleshooting these two events but never fear, it’s possible to debug quickly given the right approach.

     

     

    Recall, I’ve talked about two other common Server Service events prior, 2019 and 2020.  (Event ID: 20xx usually denotes a srv.sys series event) Also, from now on I’m going to refer to the Server Service as srv. It’s in the driver srv.sys that the bulk of the work is happening and as such that’s just what we call it.

     

    Anyway, if you have seen these events before you likely have queried the web with the ID or description and have seen the big KB317249 article.

    This is an oldie but goodie, but the problem is that, well it’s big and it would help to have perspective on how srv is supposed to work because all the root causes for the event seem quite varied…but often there is only one of two things happening here.

     

    1.)  Blocking in the I/O stack preventing normal amounts of SMB requests from getting processed in a timely fashion. (this is the most common that we see, the SMB pump failure case)

     

    2.)  Abnormally large amount of requests directed at srv(this is usually a really chatty SMB application or usually accumulation of ChangeNotify requests).

     

    I’ll show you how to determine which of these two are happening but first let’s start with the event text as a refresher…

     

    Event ID: 2022
    Source: Srv
    Description: Server was unable to find a free connection n times in the last s seconds.

    Event ID: 2021
    Source: Srv
    Description: Server was unable to create a work item %2 times in the last %3 seconds.

     

     

    What does this mean?

     

    We really have to start with an explanation of Connections and Work Items.  So, internally to srv these are just structures used to represent a connection into srv and an item of work…couldn’t be that straightforward right?  It is, when a request comes into srv it has to use an Endpoint, Connection, a Work Item, and other structures to know about where the request came from, what the work is, and how to get the response back to what client.

     

    At its core, srv is just a Kernel Mode Driver that does SMB processing.  It takes in network SMB/CIFS requests, converts them into local I/O, and pumps the return back out to the client.  But the problem here is why would you be “unable to find” or “unable to create” one of these or what breaks this pump/processing?

     

     

    But first, let’s comment on storage.  Kernel memory is not free and as mentioned in prior blogs these structures take up space in the kernel pools and or may have preconfigured srv nominal limits (see MaxWorkItems in KB317249, etc.) if you hit either of these limits the event will be logged.  So as we learned prior running out of Paged or NonPaged pool can hang a machine no problem but can also log these events should Srv not be able to allocate memory on behalf of client requests.  (See the Event ID:  2019 2020 blog for that info and troubleshooting).  What usually happens here though are these WorkItems build up due to blocking or high load and cause the Event ID:  2019 or 2020…(with the top consumer usually LSwn – LanmanServer WorkItems) and then you will see these along with the 2021/2022s.  An event party if you will with the 2021/2022 issue in srv actually exhausting kernel memory pools.

     

    Could you bump up the limits of everything in \LanmanServer via KB317249 not read further and be okay?

     

    Maybe, but it depends how much blocking you may have or how high the load gets before the point of failure.  If you are logging this even after the server has been working fine under your normal ebb and flow of load in the past, it may be likely that this article will not completely fix your problem so further work is needed.  Keep in mind too that if you bump up the limits in the registry per the above explanation you could actually hit the pool limits sooner!  The values in KB317249 are pretty conservative so they shouldn’t hurt but again it might not completely help.  Also, if you don’t like changing things until you understand more…read on!

     

     

     

    REASON #1 for Event ID:  2021 and 2022

     

    The pump mentioned above is really the Srv!WorkerThread and KQUEUE architecture.  When requests come in they are put in KQUEUEs via KeInsertQueue and as you might imagine, the processing/draining of those queues is done by the Srv!WorkerThread(s) via a call to KeRemoveQueue.  These threads are the heart of Srv and are the key to determining what is usually happening to break the SMB processing here, so we usually look there for the answer.

     

    To check in on srv you gather a kernel memory.dmp via KB244139 during the problem or with livekd (Sysinternals) perhaps…but the best way is KB244139 or by attaching the kernel debugger.  The best time to gather this dump is when the server is somewhat unresponsive via the network (SMB) or you are constantly logging the 2021/2022 events…or if the machine is completely hung after hitting these Events.

     

    With memory.dmp in hand, just load it up and do the following to get the system process’ EPROCESS address via !process 0 0 system

     

    0: kd> !process 0 0 system

    PROCESS 89067520  SessionId: 0  Cid: 0008    Peb: 00000000  ParentCid: 0000

        DirBase: 00030000  ObjectTable: 89067008  TableSize: 490.

        Image: System

     

    Now change process context using that address with .process /r /p <system EPROCESS address>

     

    0: kd> .process /r /p 89067520

    Implicit process is now 89067520

    Loading User Symbols

     

    Now let’s dump out all the system threads using !process <system EPROCESS address>

     

    0: kd> !process 89067520

    PROCESS 89067520  SessionId: 0  Cid: 0008    Peb: 00000000  ParentCid: 0000

        DirBase: 00030000  ObjectTable: 89067008  TableSize: 490.

        Image: System

    .

    .

    Dumping out the threads…

    .

    .

     

    Once the output of threads is finished, I like to search up from the bottom for “Start Address srv!WorkerThread” looking at the call stacks…

     

    Note:  There’s a shortcut here for Windows 2003 to just do a “!stacks 2 Srv!WorkerThread” right away.  This output will give you a really quick look at the threads without having to .process into system, etc.

     

    Here’s an example of a “good thread” just waiting for work to come in…sitting in KeRemoveQueue

     

     THREAD 887c14e0  Cid 8.474  Teb: 00000000  Win32Thread: 00000000 WAIT: (WrEventPairLow) KernelMode Non-Alertable

                be9fd324  Unknown

            Not impersonating

            Owning Process 89067520

            Wait Start TickCount    160832116     Elapsed Ticks: 203335

            Context Switch Count    134294              

            UserTime                  0:00:00.0000

            KernelTime                0:00:09.0343

            Start Address srv!WorkerThread (0xbea06880)

            Stack Init beb11000 Current beb10d3c Base beb11000 Limit beb0e000 Call 0

            Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0

     

            ChildEBP RetAddr  Args to Child

            beb10d54 8042ef5f 88a10eb8 be9fd320 00000001 nt!KiSwapThread+0x1b1

            beb10d78 bea068ec 00000000 00000000 00000000 nt!KeRemoveQueue+0x197

            beb10da8 804578c2 be9fd300 00000000 00000000 srv!WorkerThread+0x6c

            beb10ddc 8046c966 bea06880 be9fd320 00000000 nt!PspSystemThreadStartup+0x54

            00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

     

     

    What does a “bad thread” look like?

    Here is a blocked Srv!WorkerThread that is trying to process a FindFirst SMB request…

     

    THREAD 887dd4e0  Cid 8.45c  Teb: 00000000  Win32Thread: 00000000 WAIT: (UserRequest) KernelMode Non-Alertable

                88255db8  SynchronizationEvent

            IRP List:

                86941288: (0006,01fc) Flags: 00000884  Mdl: 00000000

            Impersonation token:  e5d56990 (Level Impersonation)

            Owning Process 89067520

            Wait Start TickCount    160783273     Elapsed Ticks: 252178

    NOTE the Tick count here on this thread….it’s been waiting a while, not good.

            Context Switch Count    12993              

            UserTime                  0:00:00.0000

            KernelTime                0:00:00.0843

            Start Address srv!WorkerThread (0xbea06880)

            Stack Init bec29000 Current bec284f4 Base bec29000 Limit bec26000 Call 0

            Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0

     

            ChildEBP RetAddr  Args to Child

            bec2850c 8042d893 00000000 882549c8 00000000 nt!KiSwapThread+0x1b1

            bec28534 f5ae3408 88255db8 00000006 00000000 nt!KeWaitForSingleObject+0x1a3

            bec285bc f5ae4066 bec285e4 00000000 00000000 SomeFiltr+0x3408<<<<<<<making a blocking call here…

            bec285fc f5ae19fb 884ce3a0 01941288 8041eecb SomeFiltr+0x4066

            bec287a4 804c4994 884ce3a0 00000000 bec28894 SomeFiltr+0x19fb

            bec287dc 804531c2 884766c8 00000000 bec28894 nt!IopParseFile+0x44

            bec28854 804da4d8 0000051c bec28950 00000040 nt!ObpLookupObjectName+0xf8

            bec28964 804a4495 00000000 00000000 00000000 nt!ObOpenObjectByName+0xc8

            bec28a40 804a403a 8690d008 00000001 bec28b4c nt!IopCreateFile+0x407

            bec28a88 bea0282c 8690d008 00000001 bec28b4c nt!IoCreateFile+0x36

            bec28af8 bea050f6 88a0d2f8 8690d008 00000001 srv!SrvIoCreateFile+0x32a

            bec28b8c bea0acdb 88a0d2f8 00000001 8046b000 srv!SrvQueryDirectoryFile+0x1c8

            bec28c5c bea0936d 88a0d2f8 00000001 00000000 srv!SrvFind2Loop+0x171

            bec28cd0 bea07d2f 88a0d201 bea0a9c8 88a0d2f8 srv!DoFindFirst2+0x373

            bec28cd8 bea0a9c8 88a0d2f8 88a0d2f8 804636f0 srv!SrvSmbFindFirst2+0x31

            bec28cf0 bea089b7 88994f38 88a0d2f8 00000000 srv!ExecuteTransaction+0xdc

            bec28d68 be9f27ae 88a0d2f8 be9f2cde 88a0d300 srv!SrvSmbTransaction+0x5af

            bec28d70 be9f2cde 88a0d300 88a144a0 bec28da8 srv!SrvProcessSmb+0x46

            bec28d84 bea069a9 00000000 00000000 00000000 srv!SrvRestartReceive+0xa2

            bec28da8 804578c2 88a14400 00000000 00000000 srv!WorkerThread+0x129

            bec28ddc 8046c966 bea06880 88a144a0 00000000 nt!PspSystemThreadStartup+0x54

            00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

     

    Note:  The tick count (Elapsed Ticks) was not bad for the prior “good thread” because we’ve just been waiting in the call to KeRemoveQueue for that long for work to arrive.  No work to do in that queue, no problem.  However, in this “bad thread “case we’ve grabbed a WorkItem to process and been waiting for a long time (about 1hr 5 min.) stuck while processing it, yikes.

     

     

    Why is this bad?

     

    So if the Srv!WorkerThread is the worker thread that drains the WorkItems from this list of request in the KQUEUEs and number of these threads are finite(they are), then we have a problem!  As more requests come into srv they will continue to get placed in the KQUEUEs by different event handler threads at the transport layer but with no WorkerThread available to process the queue on the back end (and considering these Connections and WorkItems are stored in Pool and the number is finite as well) this is how and usually why the pump stops and we log these events!  So not only might we run out of the configured limits in the registry (KB317249) for srv, we might run out of Non Paged Pool, and we are definitely not giving clients a timely response if at all in this state.

     

    What if I see “good” and “bad” threads in my !process of the system process?

    This is likely because of the two queues that we have in service.  There are blocking and a non-blocking queues so you may see the threads that are waiting in KeRemoveQueue are actually for the blocking queue whereas your other bad threads are blocked on things like CreateFile, Find, etc. requests…supposed to be shorter time requests, appropriately so, on the the non-blocking work queues.  In general, if you see any Srv!WorkerThread blocked in anything but KeRemoveQueue for very long, it’s bad.

     

     

    So what is the solution?

     

    Referencing the above “bad thread” for SomeFiltr.sys note that the filter driver in question has chosen to make a blocking call to KeWaitForSingleObject.  This is blocking a FindFirst request from a client but more importantly it’s consuming an Srv!WorkerThread in the process as stated….so it’s actually the filter driver making the decision to stop the processing so the fault lies with it in this case.

     

    Now it’s worth mentioning that this is a fairly simple 1st tier blocking example but suffice to say any blocking call taking too long in a srv!WorkerThread can have bad consequences.  This could have easily been any filter driver or device driver in the I/O stack (between the network and srv and the disk) which is blocking doing some work to “help” process the I/O or more difficult to debug the filter or function driver could be waiting on some other queuing mechanism that it implements…so you see how srv can encounter a dependency failure here.

    Another trick that we use is looking at all other thread stacks containing the name of the filter driver to see if it’s doing something recognizable like waiting on another Kernel api, filter, etc…in other words guessing through the black box of it’s decision to wait on this thread and making an educated guess of what work it is trying to do on another…A common case is when the blocked thread may be waiting on a response from an LPC to LSASS which is performing some security check. Then it’s necessary to jump …..

     

    In summary, since we can’t see further into the blocking here through SomeFiltr.sys and there’s no other threads in process with this driver in the call stack we’d recommend testing updating, removing, or contact the vendor of the SomeFiltr.sys filter to see if this is fixed, etc.  You can do an “lmvm SomeFiltr”  to get the time/date information for the file.

     

     

    So, what if all the threads are just waiting like in the good thread example?

     

    This could be the unfortunate side of gathering a single memory.dmp to represent a problem that may be transient.  If a fraction of a second before you hit the Ctrl+Scroll+Scroll to issue the dump the Srv!WorkerThread unblocks, it may look completely normal.  What you can do for this class of issue that has temporary blocking is either try again to get it in state, or have a look at the device stack that srv has to work through to get things done(see devstack below).  Usually you will find out what filter drivers you are running, etc. and since that is the #1 reason for this event updating them is usually the solution because we or the vendor have likely already debugged this for you and with the filter vendor’s help have likely fixed it in a later version.

     

    This is why usually the first thing we target is filter drivers before debugging anything for hangs or if a dump is not available.  By frequency this is far and away the usual root cause for hangs of many many (did I say many?) symptom types.

     

    How do I check the members of device stack that Srv.sys has to work through to get I/O requests processed, filter drivers I need to update, etc?

     

    This is cool…so did you notice that there was an Irp listed in the prior “bad thread”?

     

    Dump it out…

     

    0: kd> !irp 86941288

    Irp is active with 11 stacks 11 is current (= 0x86941460)

     No Mdl Thread 887dd4e0:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

    >[  0, 0]   1  0 884ce3a0 878977a8 00000000-00000000   

                           \FileSystem\SomeFitr

                                                    Args: bec28778 01000000 00070000 00000000

     

    Now grab the Device Object for SomeFiltr…

     

    0: kd> !devstack 884ce3a0

      !DevObj   !DrvObj            !DevExt   ObjectName

    > 884ce3a0  \FileSystem\SomeFiltr884ce458 

      88ba5c80  \Driver\InnocentDriver       88ba5d38 

      88ba5460  \FileSystem\Ntfs   88ba5518 

     

    Now you can see how this thing works….ntfs talks to InnocentDriver and it to SomeFiltr, etc….all these guys have to work flawlessly to not kill the srv request dependent upon this device stack.

     

    What if you didn’t have an Irp, no problem…everthing hangs off Object Manager so start at Ntfs’s device objects…This is cooler…

     

     

    0: kd> !drvobj \FileSystem\Ntfs  <<<<note that’s !DRVobj for the Driver Object

    Driver object (88dd2710) is for:

     \FileSystem\Ntfs

    Driver Extension List: (id , addr)

     

    Device Object list:

    882b4120  8717a620  88b91800  88b981e0

    88b9e600  88ba5460  88f583e0  88dd25f0

     

    0: kd> !devstack 882b4120

      !DevObj   !DrvObj            !DevExt   ObjectName

      86576020  \FileSystem\SomeFiltr865760d8  <<<<<<there you are Mr. SomeFiltr

    > 882b4120  \FileSystem\Ntfs   882b41d8 

     

    0: kd> !devstack 8717a620

      !DevObj   !DrvObj            !DevExt   ObjectName

      8792ae00  \FileSystem\SomeFiltr8792aeb8  <<<<<<and here

    > 8717a620  \FileSystem\Ntfs   8717a6d8 

    .

    .etc…

    .

    0: kd> !devstack 88dd25f0

      !DevObj   !DrvObj            !DevExt   ObjectName

      886c9020  \FileSystem\SomeFiltr886c90d8  <<<<<<<and here

      88f74f00  \FileSystem\DfsDriver88f74fb8 

      88f74020  \Driver\InnocentDriver       88f740d8 

    > 88dd25f0  \FileSystem\Ntfs   00000000  Ntfs

     

    The prudent thing to do if you aren’t sure of root cause by debugging the Srv!WorkerThreads you have captured in a memory.dmp is at least make sure all the filter drivers in these stacks are updated and perhaps even your storage drivers that are being used under ntfs serving your file share(s).

     

     

    REASON #2 for Event ID:  2021 and 2022

     

    What we have just discussed was the most common reason for hitting the Event ID:  2021 and 2022 event in blocking by a filter or function driver on the I/O stack and how to identify.  There are of course other reasons like bursts of demand and a lower than the maximum configured Srv (the nominal limits mentioned above and discussed in KB317249) or just simply a lot of long term requests that are on the system such as ChangeNotify requests.  How would you find out about these?  Analyze the network traffic.

     

    There are tons of network sniffers which allow you to gauge what type of SMB traffic is going in between servers so I won’t get into that here.  I will only say that if you have the ability to generate Statistics of top SMB commands, that’s the way to go.  You could however just filter on SMB and ballpark what kind of request and from which clients are occurring most and given your client configuration if that’s normal.  What we see here usually is a misbehaving network application in a loop, etc. hitting the file server with repeated requests for the same file, etc.  This is also a great strategy for high system process CPU issues where when you look at the thread call stacks generating the CPU via Process Explorer (sysinternals) you find the base of the activity in Srv!WorkerThreads coming and going rapidly.

     

    I will mention however there are a few known ways to reduce the cumulative load of ChangeNotifications and QueryFile and QueryPathInfo requests that Windows Clients actually generate.  You can check out the following KBs for these configuration changes on the client side:

     

    For the client SMB side’s shell:  NoRemoteChangeNotify.

    http://support.microsoft.com/kb/812669/en-us

    “An update is available in Windows to turn off directory notification SMB requests”

     

     

    For the client SMB side (redirector):  InfoCacheLevel

    (also has shell change NoRemoteRecursiveEvents )

    http://support.microsoft.com/kb/816375/en-us

    “Windows XP Explorer Pane flickers on mapped network drives”

     

     

     

    I hope you’ve enjoyed this post and hopefully the next time you see these srv events you can nail the root cause with windbg no problem!

     

    Keep that device stack updated and until next time…

  • Ntdebugging Blog

    Hung Window?, No source?, No problem!! Part 1

    • 2 Comments

    Written by Jeff Dailey

     

    Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR Platforms team.   This blog entry is a follow on for how to detect a hung window.   This process and training lab is right out of our CPR Training curriculum.  In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post.  You will also need to install the debugging tools for windows.  

     

    Debugging tools:  http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

    Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx

     

    After you have both of these installed we can get started.  We are going to debug and figure out why the window stops repainting and does not respond.

     

    Step 1 start badwindow.exe

    Step 2 run dumphungwindow.exe

    Step 3 select Hang \ Hang Type 1 from  the BadWindow.exe menu.

     

    You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process

     

    ************ OUTPUT *************

    C:\source\dumphungwindow\release>dumphungwindow.exe
    Dumps will be saved in C:\Users\jeffda\AppData\Local\Temp\
    scanning for hung windows

    **

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_46_Pid7924_badwindow.exe.dmp

    Dump complete

    Hung Window found dumping process (7924) badwindow.exe
    Dumping unresponsive process
    C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp

    Dump complete*

    ************ OUTPUT *************

     

    Step 4 create a local symbol directory at C:\websymbols

    Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.

     

    Step 6 start windbg select file\open crash dump and select the first dump file.

    Your initial output should look like this.

     

    Microsoft (R) Windows Debugger  Version 6.7.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

    ***** WARNING: Your debugger is probably out-of-date.

    *****          Check http://dbg for updates.

     

    Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]

    User Mini Dump File with Full Memory: Only application data is available

     

    Symbol search path is: Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Executable search path is:

    Windows Vista Version 6000 MP (2 procs) Free x86 compatible

    Product: WinNt, suite: SingleUserTS

    Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)

    System Uptime: 11 days 18:41:43.089

    Process Uptime: 0 days 0:00:32.000

    ....................................

    Loading unloaded module list

    .

    eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=004e3750 edi=004d9c60

    eip=777fb2b2 esp=0017ee18 ebp=0017ee44 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    ntdll!NtAlpcSendWaitReceivePort+0x12:

    777fb2b2 c22000          ret     20h

     

    Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it.
    Type ~* k

    You will most likely see output similar to this.

     

    .  0  Id: 1ef4.17a8 Suspend: 0 Teb: 7efdd000 Unfrozen

    ChildEBP RetAddr 
    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12
    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f
    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48
    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c
    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25
    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31
    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9
    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f
    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19
    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257
    0017f388 756e6025 dnsapi!Query_Shim+0xbb
    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20
    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d
    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129
    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c
    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17
    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c
    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4
    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74
    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

       1  Id: 1ef4.f80 Suspend: 0 Teb: 7efda000 Unfrozen

    ChildEBP RetAddr 
    0275fef0 76e41220 ntdll!ZwWaitForSingleObject+0x15
    0275ff60 76e41188 kernel32!WaitForSingleObjectEx+0xbe
    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for FwcWsp.dll -
    0275ff74 55620611 kernel32!WaitForSingleObject+0x12
    WARNING: Stack unwind information not available. Following frames may be wrong.
    0275ff88 5561f10c FwcWsp!RatDumpStack+0x1bd
    0275ffa0 76eb19f1 FwcWsp!RatDbgPrint+0x156
    0275ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    0275ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

       2  Id: 1ef4.1d04 Suspend: 0 Teb: 7efd7000 Unfrozen
    ChildEBP RetAddr 
    028cfe90 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
    028cff2c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
    028cff48 5560c841 kernel32!WaitForMultipleObjects+0x18
    WARNING: Stack unwind information not available. Following frames may be wrong.
    028cff84 5560c9c0 FwcWsp!WspDisable+0x1b47
    028cffa0 76eb19f1 FwcWsp!WspDisable+0x1cc6
    028cffac 7782d109 kernel32!BaseThreadInitThunk+0xe
    028cffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 1ef4.1d04 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efd7000 Unfrozen

     

    Each of these threads represents a call stack.  The most recent call is at the TOP of the stack.  As each call is made the stack grows larger.  In this case thread 0 is not showing us all it’s stack.  We can get more of the stack be specifying the number of frames we would like to see.

     

    In this case my current thread is 0 so I will issue a k 123 command asking for as many 123 calls to be displayed.

     

    0:000> k 123

    ChildEBP RetAddr 

    0017ee14 76f584c5 ntdll!NtAlpcSendWaitReceivePort+0x12

    0017ee44 76f5899d rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f

    0017ee78 76f58514 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48

    0017ee90 76f57785 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c

    0017eea0 76f5766b rpcrt4!I_RpcSendReceive+0x25

    0017eeb4 76f5763a rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 rpcrt4!NdrpSendReceive+0x9

    0017f2c8 756e5ff3 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 dnsapi!R_ResolverQuery+0x19

    0017f34c 756e5d34 dnsapi!Query_PrivateExW+0x257

    0017f388 756e6025 dnsapi!Query_Shim+0xbb

    0017f3b0 74f461b4 dnsapi!DnsQuery_W+0x20

    0017f3e0 74f4611c mswsock!SaBlob_Query+0x2d

    0017f424 74f459db mswsock!Rnr_DoDnsLookup+0x129

    0017f6c0 76d75a72 mswsock!Dns_NSPLookupServiceNext+0x25c

    0017f6d8 76d75a52 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17

    0017f6f4 76d75994 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c

    0017f720 76d7590a ws2_32!NSQUERY::LookupServiceNext+0xa4

    0017f764 76d7b11f ws2_32!WSALookupServiceNextW+0x74

    0017f7b4 76d7ab77 ws2_32!WSALookupServiceNextA+0x52

    0017f7e0 76d7dba3 ws2_32!getxyDataEnt+0xbd

    0017fa1c 0040147d ws2_32!gethostbyname+0xb4

    0017fbc8 004012fc badwindow!hangtype1+0x3d [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af badwindow!WndProc+0xdc [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109

    0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380

    0017fd3c 004010fb user32!DispatchMessageW+0xf

    0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    As you can see our current thread is the WinMain thread.   Note this from the call badwindow!wWinMain,  this winmain is processing messages, you can see it calling DispatchMessageW this goes back into User32, and user32 has to rout this message for the proper WinProc that is registered for the associated Window.  

     

    Looking further up the stack you see our winproc being called badwindow!WndProc.  Remember it’s critical that your WinProc stays available to process incoming messages.  If it makes a blocking call or runs in a loop you will not be able to process incoming messages and your window will not repaint and your application will appear to be hung.

     

    If you look deeper yet into the call stack you can see that the WndProc calls hangtype1, this calls gethostbyname.    http://msdn2.microsoft.com/en-us/library/ms738524.aspx 

     

    Looking at the docs, Gethostbyname takes one parameter being the server name.  Looks dump it out and see what we are looking for.

     

    0:000> kv 123

    ChildEBP RetAddr  Args to Child             

    0017ee14 76f584c5 00000424 40020000 004d9c60 ntdll!NtAlpcSendWaitReceivePort+0x12 (FPO: [8,0,0])

    0017ee44 76f5899d 00020000 004d9c60 004e3638 rpcrt4!LRPC_CASSOCIATION::AlpcSendWaitReceivePort+0x2f (FPO: [Non-Fpo])

    0017ee78 76f58514 0017eed8 0017ef04 0017f2e8 rpcrt4!LRPC_BASE_CCALL::DoSendReceive+0x48 (FPO: [Non-Fpo])

    0017ee90 76f57785 0017eed8 0017ef04 0017eeb4 rpcrt4!LRPC_BASE_CCALL::SendReceive+0x2c (FPO: [Non-Fpo])

    0017eea0 76f5766b 0017eed8 756e54f0 0017ef04 rpcrt4!I_RpcSendReceive+0x25 (FPO: [Non-Fpo])

    0017eeb4 76f5763a 0017ef04 004d9d18 76ff0146 rpcrt4!NdrSendReceive+0x31

    0017eec0 76ff0146 59a28201 00000000 00000000 rpcrt4!NdrpSendReceive+0x9 (FPO: [0,0,0])

    0017f2c8 756e5ff3 756e54f0 756e5aec 0017f2e8 rpcrt4!NdrClientCall2+0x18f

    0017f2e0 756e5ef9 00000000 0279091c 00000001 dnsapi!R_ResolverQuery+0x19 (FPO: [Non-Fpo])

    0017f34c 756e5d34 0279091c 00000001 14000000 dnsapi!Query_PrivateExW+0x257 (FPO: [Non-Fpo])

    0017f388 756e6025 00000001 0279091c 00000001 dnsapi!Query_Shim+0xbb (FPO: [Non-Fpo])

    0017f3b0 74f461b4 0279091c 00000001 14000000 dnsapi!DnsQuery_W+0x20 (FPO: [Non-Fpo])

    0017f3e0 74f4611c 0279091c 00000001 14000000 mswsock!SaBlob_Query+0x2d (FPO: [Non-Fpo])

    0017f424 74f459db 0279091c 0031c2e0 00307118 mswsock!Rnr_DoDnsLookup+0x129 (FPO: [Non-Fpo])

    0017f6c0 76d75a72 027908b8 00000000 0017f794 mswsock!Dns_NSPLookupServiceNext+0x25c (FPO: [Non-Fpo])

    0017f6d8 76d75a52 00307270 027908b8 00000000 ws2_32!NSPROVIDER::NSPLookupServiceNext+0x17 (FPO: [Non-Fpo])

    0017f6f4 76d75994 0031bc18 00000000 0017f794 ws2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c (FPO: [Non-Fpo])

    0017f720 76d7590a 00307130 00000000 0017f794 ws2_32!NSQUERY::LookupServiceNext+0xa4 (FPO: [Non-Fpo])

    0017f764 76d7b11f 00307118 00000000 0017f794 ws2_32!WSALookupServiceNextW+0x74 (FPO: [Non-Fpo])

    0017f7b4 76d7ab77 00307118 00000000 0017f7ec ws2_32!WSALookupServiceNextA+0x52 (FPO: [Non-Fpo])

    0017f7e0 76d7dba3 00307118 0000013c 0017fa38 ws2_32!getxyDataEnt+0xbd (FPO: [Non-Fpo])

    0017fa1c 0040147d 0017fa38 00000000 00070520 ws2_32!gethostbyname+0xb4 (FPO: [Non-Fpo])

    0017fbc8 004012fc 00401220 0017fbfc 00401220 badwindow!hangtype1+0x3d (FPO: [Uses EBP] [0,101,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 310]

    0017fc24 772a87af 00070520 00000111 00008003 badwindow!WndProc+0xdc (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]

    0017fc50 772a8936 00401220 00070520 00000111 user32!InternalCallWinProc+0x23

    0017fcc8 772a8a7d 00000000 00401220 00070520 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])

    0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])

    0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])

    0017ff0c 00401817 00400000 00000000 004a0f84 badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]

    0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]

    0017ffac 7782d109 7efde000 0017206b 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])

    0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])

    0:000> da 0017fa38

    0017fa38  "AServerThatDoesNotExist14"

     

    Generally speaking it’s not a good idea to make any call that could possibly block in your winproc. 

     

    Let’s look deeper into what the caller (hangtype1) is really doing, Feel free to look at the C code as it was included with the project.  I like to look at the assembler,  it’s nice practice for those times you don’t have the source.  We can take the return address from the gethostbyname call 0040147d to figure out how it was called.  A neat trick for disassembling the entire function based on an address within that functions code range is to use the unassembled function or UF command.  In the following sample output I’ll simply UF the return address of the gethostbyname call.   I’ll then walk trough and annotate the disassembly.  Then we will look at the C source.

     

    For reference on intel cpu programming see http://developer.intel.com/design/pentium/manuals/24319001.pdf

     

    0:000> uf 0040147d

    We are making run for locals on the stack, 402 dec bytes.  If you look at the source you will see this is a sznamebuff[400], int I, and pointer to a host structure.
    badwindow!hangtype1 [c:\source\badwindow\badwindow\badwindow.cpp @ 298]:
    298 00401440 81ec94010000    sub     esp,194h 

     

    This is done for  stack checking to ensure none is trying to exploit our stack.
    298 00401446 a100304000      mov     eax,dword ptr [badwindow!__security_cookie (00403000)]
    298 0040144b 33c4            xor     eax,esp
    298 0040144d 89842490010000  mov     dword ptr [esp+190h],eax

    Save ebx register state on the stack
    298 00401454 53              push    ebx 

    Get the function pointer for sprint from the import table.
    299 00401455 8b1d7c204000    mov     ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]

    Save the stack base / frame pointer.
    299 0040145b 55              push    ebp

    Get the address of gethostbyname from the import table and save it in EBP.
    299 0040145c 8b2d20214000    mov     ebp,dword ptr [badwindow!_imp__gethostbyname (00402120)]

    Save ESI and EDI on the stack.
    299 00401462 56              push    esi
    299 00401463 57              push    edi

    Zero out ESI

    299 00401464 33f6            xor     esi,esi
    badwindow!hangtype1+0x26 [c:\source\badwindow\badwindow\badwindow.cpp @ 306]:

    Push ESI on the stack and the first pass it will be zero == 0, this is the 3rd parameter to the call we are about to make.
    306 00401466 56              push    esi

    We are loading the effective address of something on the stack, we will see what it is used for when we use EAX, remember that.
    306 00401467 8d442414        lea     eax,[esp+14h]

    Hum what is this?.
    badwindow!`string' (004021ac) Well when in doubt dump it out.
    0:000> db 004021ac
    004021ac  41 53 65 72 76 65 72 54-68 61 74 44 6f 65 73 4e  AServerThatDoesN
    004021bc  6f 74 45 78 69 73 74 25-64 00 00 00 57 65 20 61  otExist%d...We a
    004021cc  72 65 20 69 6e 20 68 61-6e 67 74 79 70 65 32 74  re in hangtype2t
    004021dc  68 72 65 61 64 61 00 00-57 65 20 61 72 65 20 69  hreada..We are i
    004021ec  6e 20 68 61 6e 67 74 79-70 65 32 74 68 72 65 61  n hangtype2threa
    004021fc  64 62 00 00 48 00 00 00-00 00 00 00 00 00 00 00  db..H...........
    0040220c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................
    0040221c  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

    Ok so we are pushing our print formatting string for sprintf this is the second parameter to the call we are about to make.

    306 0040146b 68ac214000      push    offset badwindow!`string' (004021ac)

    Ok here we are pushing EAX,  remember eax is just an address on our stack, and in this case the 1st parameter passed to sprint,  it’s our BUFFER.  szNameBuffer to be exact.
    306 00401470 50              push    eax

     

    Call sprint (Remember we stored it’s address in ebx)
    306 00401471 ffd3            call    ebx

     

    Stacks grown down or negative as they get larger, so adding in this case is cleaning up the stack from our sprintf call
    306 00401473 83c40c          add     esp,0Ch

     

    Get the address of the szNameBuffer
    307 00401476 8d4c2410        lea     ecx,[esp+10h]

     

    Push the address of the name buffer on the stack.
    307 0040147a 51              push    ecx

     

    Call gethostbyname (Rember we stored it’s address in ebp the)  NOTE when you do a return the value is stored in EAX (REMBER THIS)
    307 0040147b ffd5            call    ebp

     

    Push 250 dec FA hex on the stack.  (This is our return address from get host by name we see on the stack)
    310 0040147d 68fa000000      push    0FAh 

     

    Store EAX in EDI We need to save it because EAX will be over written when we come back from Sleep and we may want to use it later.  Eax is the return value from gethostbyname.
    310 00401482 8bf8            mov     edi,eax

     

    Call sleep it takes 1 parameter and that values was just pushed on the stack 250ms ¼ second.
    310 00401484 ff150c204000    call    dword ptr [badwindow!_imp__Sleep (0040200c)]

    Increment ESI this is where we are storing the value int i.   It was zero the first time through now it’s ESI+1 if we loop it will grow
    311 0040148a 83c601          add     esi,1

     

    Compare ESI we know this is in I to 0x64h or 100dec.
    312 0040148d 83fe64          cmp     esi,64h

    Where we have a jump if greater then, that is to say if ESI > 64h or 100dec jump to 00401496
    312 00401490 7f04            jg      badwindow!hangtype1+0x56 (00401496)

    Now we test EDI, this is the return value from gethostbyname.  TEST is a logical compare.  So if you test EDI and EDI and EDI = 0
    304 00401492 85ff            test    edi,edi

    If EDI is ZERO 0 jump to this address je is jump if ZERO this take us to the TOP of the loop.

    304 00401494 74d0            je      badwindow!hangtype1+0x26 (00401466)

     

    The rest of this is us cleaning up the stack and getting ready to drop out of the function.  We can stop here.
    318 00401496 8b8c24a0010000  mov     ecx,dword ptr [esp+1A0h]
    318 0040149d 5f              pop     edi
    318 0040149e 5e              pop     esi
    318 0040149f 5d              pop     ebp
    318 004014a0 5b              pop     ebx
    318 004014a1 33cc            xor     ecx,esp
    318 004014a3 e8c5010000      call    badwindow!__security_check_cookie (0040166d)
    318 004014a8 81c494010000    add     esp,194h
    318 004014ae c3              ret

     

    Here is the C code snipped for hangtype1

     

    void hangtype1(void)

    {

    int i =0;

    char szNameBuff[400];

    struct hostent * myhost=NULL;

     

          // Lets see if any of my favorte machines are on the network.

          while (!myhost)

          {

                sprintf(szNameBuff,"AServerThatDoesNotExist%d",i);

                myhost = gethostbyname (szNameBuff);

                // We are hitting network resource in a loop.

                // Let's not hit the server with to many request at once. We will pause 250ms.

                Sleep(250);

                i++;

                if(i>100)

                {

                      break;

                }

          }

         

    }

     

    So what is happening,  if you recall we saw AServerThatDoesNotExist14 on the stack.  We know from this loop and the format string passed to sprint that this loop will run to 100 and will sleep 250ms with each pass.  That’s 25 seconds total just in sleeps not to mention the network stall on a bad server name.   I realize this code is somewhat contrived but it’s based on the types of things we actually see and troubleshoot every day.

     

    So how do we get this work done without stalling the application?.  The answer is rather simple.  We go multithreaded.   You can look for hangtype1fix in the project.  Instread of calling hangtype1 from our windproc directly we declare a threadproc called hangtype1fixthread

     

    void __cdecl hangtype1fixthread(void *)

    {

           // our blocking function.

           hangtype1();

    }

     

    Then from our winproc instead of calling hangtype1 we call another function called hangtype1fix that creates a new thread that in turn executes out hangtype1 code in another thread context so our winproc can continue to execute and process incoming messages.  The being thread call is asynchronously .  The thread is create immediately and your thread continues to run.

     

    void hangtype1fix(void)

    {

           _beginthread(hangtype1fixthread, 12000, NULL);

    }

     

    I hope you found this blog informative and I welcome your feedback.

     

    Thanks  Jeff-

     

     

  • Ntdebugging Blog

    Where the rubber meets the road, or in this case the hardware meets the probe.

    • 1 Comments

    Hi my name is Bob, I’m an Escalation engineer with the Microsoft critical problem resolution team.  We had one of our readers ask how much we deal with hardware.  Well in response we recently worked on an interesting problem I thought I would share with you.  In this case it was interesting because it demonstrated an issue specific to multi-processor machines and something that probably sounded innocuous to the driver writer who caused the  problem.

     

    What was the problem?

     

    The problem was that everything worked except the time in the system was not being updated. The RTC had stopped.   We spent time reviewing the semantics of how the RTC should function and even hooked up an oscilloscope to the RTC on the motherboard and were able to turn it off and on with the debugger by writing out the correct port.  The trace on the scope validated our understanding of what had to be written to the port to turn the clock off.    One we had a clear understanding of this we understood what we were looking for in a driver that might cause the problem.  Note the clock typically fires every 10ms so you do not need a fast scope to do this.

     

     

    Special keyboard driver written

     

    In order to catch a dump in state we had to modify the keyboard driver.  It would cause an “Int 3” in its ISR instead of calling “bug check” for an E2 stop.    Because the RTC was not running the idle thread was not getting quantums and as a result a normal break in would not work.  However the system would respond to ISRs. 

     

    What was found

     

    All RTC interrupts were stopped - the clock was not running.  We checked all the obvious places to see if the RTC was disabled.

     

    We looked at the ICR in the I/O APIC.  This is the interrupt configuration register. There is a register for every interrupt pin on the APIC. These registers are used to tell the APIC what vector to send the processor so the processor can service the interrupt.  It also has configuration information about level and if it edge triggered and a mask bit.  The mask bit was not set.

    Below is a reenactment.

     

    0: kd> ed ffd01000

    ffd01000 00000034 20 ß Select register 20 which is pin 8.

     

     

    0: kd> ed ffd01010

    ffd01010 000008d1    ß Contents ß Vect D1 Bit 16 the interrupt mask bit is not set so it is OK.

     

     

    Next check the RTC status register which are I/O ports 70 and 71.  Port 70 is the address port.  Port 71 is the data port.  This information is from an old BIOS book.

     

    0: kd> ob 70 b       ß ‘B’ is a control register.   

    0: kd> ib 71

    00000071: 42         ß The value 42 means that the RTC is enabled.  Bit 6 is the enable.

     

     

    So what was it?

     

    The way the RTC works is it will interrupt at a certain interval.  When the interrupt is serviced, the status register has to be read to start it again.

     

    We discovered another driver that was reading  the clock, this was done by disassembly various drivers in the dump and looking for the I/O operation to ports 70 or 71.  The lower addresses selected by port 70 will yield the time when read.  That is what the driver was doing.

     

    You would think that simply reading the time in this way would not hurt anything.  However, in a multi-processor system, access has to be serialized.  There is only one set of I/O ports for the system.

     

    Since it takes two accesses to perform an operation on the clock, one address & one data, a collision between two processors can cause undetermined results.

     

    Below is a timing diagram of the issue;

     

     

    Proc 0 running OS RTC handler                         Proc 1 running XYZ driver

     

      T1          Set register select to status register

     

      T2                                                                                          Set register select to read time

     

      T3          Read status register to restart clock

     

     

    So at T3 the OS RTC handler reads the wrong register so the clock does not start. 

     

     

    Conclusion

     

    I thought this was an interesting problem that illustrates the need for serialization.  And it demonstrates what to look out for in a multi-proc environment.  You always have to think “What happens if the other processor does…….”

     

    For more information consult any BIOS developer manuals you may have lying around or this link we found http://www.geocities.com/SiliconValley/Campus/1671/docs/rtc.htm

    See the “Status Register C” section, “All future interrupts are disabled until this register is read - your interrupt handler *must* do it.”

Page 1 of 1 (8 items)