• Ntdebugging Blog

    How Windows Starts Up (part 1 of 4)

    • 11 Comments

    Hi folks, my name is David and I’m an Escalation Engineer for Microsoft.  Since Bryan wrote about How Windows Shuts Down, I thought it would be a good idea to cover How Windows Starts Up.

     

    This information applies specifically to Windows 2000, Windows XP, and Windows Server 2003.  I will blog separately on the boot changes in Windows Vista.

     

    Additional information about this topic may be found in Chapter 5 of Microsoft Windows Internals by Russinovich and Solomon and also on TechNet:  http://technet.microsoft.com/en-us/library/bb457123.aspx

     

    Methodology

    The key to understanding and troubleshooting system startup issues is to accurately ascertain the point of failure.  To facilitate this determination, I have divided the boot process into the following four phases.

     

                    1. Initial

                    2. Boot Loader

                    3. Kernel

                    4. Logon

     

    Over the next few weeks, I’ll be describing each of these phases in detail and providing appropriate guidance for troubleshooting relevant issues for each phase. 

     

    Initial Phase

    The Initial Phase of boot is divided into the Power-On Self Test (POST) and Initial Disk Access. 

     

    Power-On Self Test

    POST activities are fully implemented by the computer’s BIOS and vary by manufacturer.  Please refer to the technical documentation provided with your hardware for details.  However, regardless of manufacturer, certain generic actions are performed to ensure stable voltage, check RAM, enable interrupts for system usage, initialize the video adapter, scan for peripheral cards and perform a memory test if necessary.  Depending on manufacturer and configuration, a single beep usually indicates a successful POST. 

     

    Troubleshooting the POST

    ü  Make sure you have the latest BIOS and firmware updates for the hardware installed in the system.

    ü  Replace the CMOS battery if it has failed.

    ü  Investigate recently added hardware (RAM, Video cards, SCSI adapters, etc.)

    ü  Remove recently added RAM modules.

    ü  Remove all adapter cards, then replace individually, ensuring they are properly seated.

    ü  Move adapter cards to other slots on the motherboard.

    ü  If the computer still will not complete POST, contact your manufacturer.

     

    Initial Disk Access

    Depending on the boot device order specified in your computer’s BIOS, your computer may attempt to boot from a CD-ROM, Network card, Floppy disk, USB device or a hard disk.  For the purposes of this document, we’ll assume that we’re booting to a hard disk since that is the most common scenario.

     

    Before we discuss the sequence of events that occur during this phase of startup, we need to understand a little bit about the layout of the boot disk.  The structure of the hard disk can be visualized this way:  (Obviously, these data areas are not to scale)

     

     

     

    Hard disks are divided into Cylinders, Heads and Sectors.  A sector is the smallest physical storage unit on a disk and is almost always 512 bytes in size.  For more information about the physical structure of a hard disk, please refer to the following Resource Kit chapter:  http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/resguide/diskover.mspx

     

    There are two disk sectors critical to starting the computer that we’ll be discussing in detail:  

     

    ·         Master Boot Record (MBR)

    ·         Boot Sector

     

    The MBR is always located at Sector 1 of Cylinder 0, Head 0 of each physical disk.   The Boot Sector resides at Sector 1 of each partition.  These sectors contain both executable code and the data required to run the code.

     

    Please note that there is some ambiguity involved in sector numbering.  Cylinder/Head/Sector (CHS) notation begins numbering at C0/H0/S1.  However, Absolute Sector numbering begins numbering at zero.  Absolute Sector numbering is often used in disk editing utilities such as DskProbe.  These differences are discussed in the following knowledge base article:

    Q97819  Ambiguous References to Sector One and Sector Zero

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q97819

     

    Now that we have that straight, when does this information get written to disk?  The MBR is created when the disk is partitioned.  The Boot Sector is created when you format a volume.  The MBR contains a small amount of executable code called the Master Boot Code, the Disk Signature and the partition table for the disk.  At the end of the MBR is a 2-byte structure called a Signature Word or End of Sector marker, which should always be set to 0x55AA.  A Signature Word also marks the end of an Extended Boot Record (EBR) and the Boot Sector.

     

    The Disk Signature, a unique number at offset 0x1B8, identifies the disk to the operating system.  Windows 2000 and higher operating systems use the disk signature as an index to store and retrieve information about the disk in the registry subkey HKLM\System\MountedDevices.

     

    The Partition Table is a 64-byte data structure within the MBR used to identify the type and location of partitions on a hard disk.  Each partition table entry is 16 bytes long (four entries max).  Each entry starts at a predetermined offset from the beginning of the sector as follows:

     

                                    Partition 1                           0x1BE    (446)

                                    Partition 2                           0x1CE    (462)

                                    Partition 3                           0x1DE    (478)

                                    Partition 4                           0x1EE    (494)

     

    The following is a partial example of a sample MBR showing three partition table entries in-use and one empty:

     

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00 00 00

    000001D0: 81 0A 07 FE FF FF 8A F5 – 7F 00 3D 26 9C 00 00 00

    000001E0: C1 FF 05 FE FF FF C7 1B – 1C 01 D6 96 92 00 00 00

    000001F0: 00 00 00 00 00 00 00 00 – 00 00 00 00 00 00

     

    Let’s take a look at each of the fields of a partition table entry individually.  For each of these explanations, I’ll use the first partition table entry above and highlight the relevant section.  Keep in mind that these values are little-endian.

     

    Byte Offset

    Field Length

    Sample Value

    Field Description

    0x1BE

    8 Bits

    0x80

    Boot Indicator

                    00=Do Not Use for Booting

                    80=Active partition (Use for Booting)

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00     

     

    0x1BF

    8 Bits

    0x01

    Starting Head

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C0

    0x1C1

    2 Byte

    Word

    0x01

    Starting Sector

    Only the first 6 bits are used.  The upper 2 bits of this byte are used by the Starting Cylinder field.

    0x00

    Starting Cylinder

    Uses 1 byte + 2 bits from the Starting Sector field to make up the cylinder value.  The Starting Cylinder is a 10-bit number with a maximum value of 1023.

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

     

     

    0x1C2

    8 Bits

    0x07

    System ID

    Defines the volume type.  0x07=NTFS

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    Other Possible System ID Values:  

    Partition Type

    ID Value

    0x01

    FAT12 primary partition or logical drive (fewer than 32,680 sectors in the volume)

    0x04

    FAT16 partition or logical drive (32,680–65,535 sectors or 16 MB–33 MB)

    0x05

    Extended partition

    0x06

    BIGDOS FAT16 partition or logical drive (33 MB–4 GB)

    0x07

    Installable File System (NTFS partition or logical drive)

    0x0B

    FAT32 partition or logical drive

    0x0C

    FAT32 partition or logical drive using BIOS INT 13h extensions

    0x0E

    BIGDOS FAT16 partition or logical drive using BIOS INT 13h extensions

    0x0F

    Extended partition using BIOS INT 13h extensions

    0x12

    EISA partition

    0x42

    Dynamic disk volume

    0x86

    Legacy FT FAT16 disk *

    0x87

    Legacy FT NTFS disk *

    0x8B

    Legacy FT volume formatted with FAT32 *

    0x8C

    Legacy FT volume using BIOS INT 13h extensions formatted with FAT32 *

     

    Partition types denoted with an asterisk (*) indicate that they are also used to designate non-FT configurations such as striped and spanned volumes.

     

    0x1C3

    8 Bits

    0xFE

    Ending Head      (0xFE=254 decimal)

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C4

    0x1C5

    2 Byte

    Word

    0xBF

    Ending Sector

    As with the Starting Sector, it only uses the first 6 bits of the byte.  The upper 2 bits are used by the Ending Cylinder field.

    0x09

    Ending Cylinder

    Uses 1 byte in addition to the upper 2 bits from the Ending Sector field to make up the cylinder value.  The Ending Cylinder is a 10-bit number with a maximum value of 1023.

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    0x1C6

    32 Bits

    0x3F000000

    Relative Sectors

    The offset from the beginning of the disk to the beginning of the volume, counting by sectors.

    0x0000003F = 63

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

     

    0x1CA

    32 Bits

    0X4BF57F00

    Total Sectors

    The total number of sectors in the volume.

    0x007FF54B = 8,385,867 Sectors = 4GB

    000001B0:                                             80 01

    000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00

     

    Are you with me so far?  Good!  Now, Cylinder/Sector encoding can be a bit tricky, so let’s take a closer look. 

     

    Cylinder - Sector Encoding

    16

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    Cylinder bits 1-8

    Cyl bits

    9 & 10

    Sector value

    Bits 1-6

     

    As you can see, the Sector value occupies the lower 6 bits of the word and the Cylinder occupies the upper 10 bits of the word.  In our example, the starting values for Cylinder and Sector are 01 00.  Values in the MBR use reverse-byte ordering, which is also referred to as ‘little-endian’ notation.  Therefore, we swap the bytes and find that our starting values are Cyl 0, Sec 1.

     

    Our ending values are more interesting:  BF 09.  First, we swap the bytes and obtain a hex value of 0x09BF.  This value in binary notation is 100110111111.  The following table illustrates how we derive the correct partition table values from these bytes:

     

    Example:  BF 09

    8

    7

    6

    5

    4

    3

    2

    1

    10

    9

    6

    5

    4

    3

    2

    1

    0

    0

    0

    0

    1

    0

    0

    1

    1

    0

    1

    1

    1

    1

    1

    1

    10 Cylinder value bits 1-8

    Cyl bits

    9 & 10

    Sector value

    bits 1-6

     

    The 6 low bits are all set to 1, therefore our Sector value is 111111 or 63.   You can see above how the bits are arranged for the Cylinder value.  The value above is 1000001001 (521).  Since both Cylinder and Head values begin numbering at zero, we have a total of 522 Cylinders and 255 Heads represented here.  This gives us an ending CHS value of:  522 x 255 x 63 = 8,385,930 sectors. 

     

    Subtracting the starting CHS address (Cylinder 0, Head 1, Sector 1) (63) gives us the total size of this partition:  8,385,867 sectors or 4GB.  We can verify this number by comparing it to the Total Sectors represented in the partition table:  4B F5 7F 00.  Applying reverse-byte ordering gives us 00 7F F5 4B which equals 8,385,867 sectors.

     

    So, now that we have an understanding of what is contained within the structures on the disk, let’s look at the sequence of events that occur.  Remember, this is just after POST has successfully completed.

     

    1.       The motherboard ROM BIOS attempts to access the first boot device specified in the BIOS.  (This is typically user configurable and can be edited using the BIOS configuration utility.)

     

    2.       The ROM BIOS reads Cylinder 0, Head 0, and Sector 1 of the first boot device.

     

    3.       The ROM BIOS loads that sector into memory and tests it.

    a.       For a floppy drive, the first sector is a FAT Boot Sector.

    b.      For a hard drive, the first sector is the Master Boot Record (MBR).

     

    4.       When booting to the hard drive, the ROM BIOS looks at the last two signature bytes of Sector 1 and verifies they are equal to 55AA.

    a.       If the signature bytes do not equal 55AA the system assumes that the MBR is corrupt or the hard drive has never been partitioned.  This invokes BIOS Interrupt 18, which displays an error that is BIOS-vendor specific such as “Operating System not found”.

    b.      If the BIOS finds that the last two bytes are 55AA, the MBR program executes.

     

    5.       The MBR searches the partition table for a boot indicator byte 0x80 indicating an active partition.

    a.       If no active partition is found, BIOS Interrupt 18 is invoked and a BIOS error is displayed such as “Operating System not found”.

    b.      If any boot indicator in the MBR has a value other than 0x80 or 0x00, or if more than one boot indicator indicates an active partition (0x80), the system stops and displays “Invalid partition table”.

    c.       If an active partition is found, that partition’s Boot Sector is loaded and tested.

     

    6.       The MBR loads the active partition’s Boot Sector and tests it for 55AA.

    a.       If the Boot Sector cannot be read after five retries, the system halts and displays “Error loading operating system”.

    b.      If the Boot Sector can be read but is missing its 55AA marker, “Missing operating system” is displayed and the system halts.

    c.       The bootstrap area is made up of a total of 16 sectors (0-15).  If sector 0 for the bootstrap code is valid, but there is corruption in sectors 1-15, you may get a black screen with no error.  In this case, it may be possible to transplant sectors 1-15 (not Sector 0) from another NTFS partition using DskProbe.

    d.      If the Boot Sector is loaded and it passes the 55AA test, the MBR passes control over to the active partition’s Boot Sector.

     

    7.       The active partition’s Boot Sector begins executing and looks for NTLDR.  The contents of the Boot Sector are entirely dependent on the format of the partition.   For example, if the boot partition is a FAT partition, Windows writes code to the Boot Sector that understands the FAT file system.  However, if the partition is NTFS, Windows writes NTFS-capable code.  The role of the Boot Sector code is to give Windows just enough information about the structure and format of a logical drive to enable it to read the NTLDR file from the root directory.   The errors at this point of the boot process are file system specific.  The following are possible errors with FAT and NTFS Boot Sectors.

    a.       FAT:  In all cases it displays “press any key to restart” after either of the following errors.

    (1)    If NTLDR is not found “NTLDR is missing” is displayed

    (2)    If NTLDR is on a bad sector, “Disk Error” displays.

    b.      NTFS:  In all cases it displays “Press CTRL+ALT+DEL to restart” after any of the following errors.

    (1)    If NTLDR is on a bad sector, “A disk read error occurred” is displayed.  This message may also be displayed on Windows 2000 or higher systems if Extended Int13 calls are required, but have been disabled in the CMOS or SCSI BIOS.  This behavior may also be seen if an FRS (File Record Segment) cannot be loaded or if any NTFS Metadata information is corrupt.

    (2)    If NTLDR is not found, “NTLDR is missing” displays.

    (3)    If NTLDR is compressed, “NTLDR is compressed” displays.

     

    8.       Once NTLDR is found, it is loaded into memory and executed.  At this point the Boot Loader phase begins.

     

    Troubleshooting the Initial Phase (Initial Disk Access)

    If you are unable to boot Windows and the failure is occurring very early in the boot process, begin by creating a Windows boot floppy and using it to attempt to boot the system.

     

                    Q119467 How to Create a Bootable Disk for an NTFS or FAT Partition

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q119467

     

    If you are unable to boot to Windows NT with a floppy, skip to the section titled “Unable to boot using floppy”.

     

    If you can successfully start the computer from the boot disk, the problem is limited to the Master Boot Record, the Boot Sector, or one or more of the following files:  NTLDR, NTDetect.com, or Boot.ini.  After Windows is running, you should immediately back up all data before you attempt to fix the Boot Sector or Master Boot Record.

     

    1.       Run a current virus scanning program to verify that no virus is present.

    2.       Select the method of troubleshooting based upon the error generated.

    ·         Operating System not found

    Typical BIOS-specific error indicating no 55AA signature on the MBR.  This must be repaired manually using a disk editor such as DskProbe.

    Warning:  Running FixMBR or FDISK /MBR in this case will clear the partition table.

    ·         Invalid Partition Table

    May be repaired using DiskProbe – check for multiple 0x80 boot indicators.

    ·         Error Loading Operating System

    This error indicates an invalid Boot Sector.  Use the “Repairing the Boot Sector” section below.

    ·         Missing Operating System

    This error indicates that the 55AA signature is missing from the Boot Sector.  This may be replaced using DskProbe or by using the “Repairing the Boot Sector” section below.

    ·         NTLDR is missing

    Use the “Repairing the NTLDR file” section below.

    ·         Disk error (FAT only)

    Indicates that NTLDR is located on a bad sector.

    ·         A disk read error occurred (NTFS only)

    Indicates that NTLDR is on a bad sector.  Use the “Repairing the NTLDR file” section below.

    ·         NTLDR is compressed

    Indicates that NTLDR is compressed.  Uncompress or replace NTLDR.  Note:  This error is rare.

    ·         Computer boots to a black screen with blinking cursor

    Could be an issue with the MBR or the active partition’s boot code (Q228734).  Use the appropriate section below.

     

    Repairing the MBR

    Use the FIXMBR command from the Windows Recovery Console to re-write the boot code located in the first 446 bytes of the MBR, but to leave the partition table intact.

     

    326215  How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;326215

     

    WARNING:  DO NOT use this method if you are receiving “Operating System not found” or other BIOS-specific error.  FIXMBR will zero out the partition table if the 55AA signature is missing from the MBR.  The data will be unrecoverable.

     

    Repairing the Boot Sector

    Use the FIXBOOT command from the Windows Recovery Console to write a new Boot Sector to the system partition.

     

    326215  How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;326215

     

    If you are not able to repair the Boot Sector by using this method, you may repair it manually using the following knowledge base article:

     

    Q153973 Recovering NTFS Boot Sector on NTFS Partitions

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q153973

     

    Repairing the NTLDR file

    Use the Recovery Console to copy NTLDR from \i386 directory of the Windows CD-ROM to the root of the hard drive.

     

    Unable to boot using floppy

    If you are unable to start the computer with a boot floppy and you are not receiving an error message, then the problem is most likely with your partition tables.  If invalid partitions are present and you are unable to start your computer with a boot disk, formatting the disk, reinstalling Windows and restoring from backup may be your only option.  It may be possible to recreate the partition table using DskProbe if a backup of the partition information is available.  It also may be possible to manually reconstruct the partition table however this is outside the scope of this blog post.  We may cover this later.

     

    If you do not have a current backup of the data on the computer, as a last resort a data recovery service may be able to recover some of your information. 

     

    Helpful knowledge base articles:

                    Q272395 Error Message: Boot Record Signature AA55 Not Found

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q272395

     

                    Q155892 Windows NT Boot Problem: Kernel File Is Missing From the Disk

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q155892

     

                    Q228004 Changing Active Partition Can Make Your System Unbootable

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q228004

     

                    Q155053 Black Screen on Boot

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q155053

     

                    Q314503 Computer Hangs with a Black Screen When You Start Windows

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q314503

     

                    Q153973 Recovering NTFS Boot Sector on NTFS Partitions

                    http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q153973

     

    Okay boys & girls, that’s it for now.  Next time, I’ll continue with the boot sequence and cover the Boot Loader phase.  This is where things really get interesting… J

     

  • Ntdebugging Blog

    Easily Resolving an Event Viewer Error using a Process Memory Dump

    • 11 Comments

    My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft.  Today I will be blogging about how I used the SOS  .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue.  Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.

     

    Issue Reported:  The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:

     

                           "MMC could not create the snap-in."

    MMC could not create the snap-in.

    The snap-in might not have been installed correctly

    Name: Event Viewer

                   CLSID: FX:{b05566ad-fe9c-4363-be05-7a4cbb7cb510}

    clip_image001

    First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before.  It may occur when the following registry key gets deleted or corrupted:

     

                    HKLM\software\Microsoft\MMC\SnapIns\FX:{ b05566ad-fe9c-4363-be05-7a4cbb7cb510}

     

    I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected

     

    Next,  a memory dump of the mmc.exe process was collected.  The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.

    clip_image002clip_image003

    clip_image004

     

    Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL:  http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx

     

     

     Now let's have a look at the debug.

     

    1.          First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.

    clip_image005

    It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.

    clip_image002[5]

    clip_image002[9]

    2.            I noticed the following output from the debugger after it loaded the dump file:

     

    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)

     

     

    3.            Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:

     

    0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL.  Here’s the syntax: 

     

    0:011> .symfix c:\websymbols

    0:011> .sympath

    Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols

     

    4.            To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.

     

    0:005> !peb

    PEB at 000007fffffdb000

    ....

       CommandLine:  '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '

     

    You could alternatively dump the  CommandLine from the nt!_PEB using the dt command

     

    0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000

    ntdll!_PEB

       +0x020 ProcessParameters              :

          +0x070 CommandLine                    : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "

     

    5.            Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception

     

    0:011> ~* kL

    ... (ommitted the non-relevent threads)

     

    # 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

    ChildEBP RetAddr 

    0691f03c 7343a91c kernel32!RaiseException+0x58

    0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8

    *** WARNING: Unable to verify checksum for MMCEx.ni.dll

    0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf

    *** WARNING: Unable to verify checksum for mscorlib.ni.dll

    0691f1e8 69926cf6 MMCEx_ni+0xd0b5a

    0691f1f4 6993019f mscorlib_ni+0x216cf6

    0691f208 69926c74 mscorlib_ni+0x22019f

    0691f220 733d1b4c mscorlib_ni+0x216c74

    0691f230 733e21b1 mscorwks!CallDescrWorker+0x33

    0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3

    0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c

    0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f

    0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a

    0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192

    0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a

    0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3

    0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a

    0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e

     

    6.            Out of curiosity, I also ran the Get Last Error command

     

    0:011> !gle

    LastErrorValue: (Win32) 0 (0) - The operation completed successfully.

    LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0

     

    7.            After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.

     

    0:011> !analyze -v

    ...

                            EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)

                            ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)

                              ExceptionCode: e0434f4d (CLR exception)

                              ExceptionFlags: 00000001

                            NumberParameters: 1

                            Parameter[0]: 80131604

                            MANAGED_BITNESS_MISMATCH:

                                                    Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.

                            FAULTING_THREAD:  00000a2c

                                                    PRIMARY_PROBLEM_CLASS:  CLR_EXCEPTION

     

    8.            I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension.  This extension ships in the Operating System along with the .Net Framework.  On most occasions, I would have initiated the loading of the extension through the use of the following syntax:

     

     0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll

     

    OR

     

    0:011> .load  c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll

     

    However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load.  This is due to the fact that the version of SOS must match the version of the CLR loaded into that process.  Here’s the recommended syntax: 

     

                            0:011>.loadby sos mscorwks

     

    I always verify that my extensions are loaded properly by using the  .chain command.

     

    0:011> .chain

    ...   Extension DLL chain:

                            C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007

     

    9.            Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:

     

    0:011> !help

    -------------------------------------------------------------------------------

    SOS is a debugger extension DLL designed to aid in the debugging of managed

    programs. Functions are listed by category, then roughly in order of

    importance. Shortcut names for popular functions are listed in parenthesis.

    Type "!help <functionname>" for detailed info on that function.

     

    Object Inspection                  Examining code and stacks

    -----------------------------      -----------------------------

    DumpObj (do)                       Threads

    DumpArray (da)                     CLRStack

    DumpStackObjects (dso)             IP2MD

    DumpHeap                           U

    DumpVC                             DumpStack

    GCRoot                             EEStack

    ObjSize                            GCInfo

    FinalizeQueue                      EHInfo

    PrintException (pe)                COMState

    TraverseHeap                       BPMD

     

    10.            Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:

                   

    0:011> !pe 771a42eb

    Invalid object

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !pe -nested 771a42eb

    Invalid object

     

    Nested exception -------------------------------------------------------------

    Exception object: 040a676c

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more

    StackTrace (generated):

        SP       IP       Function

     

    StackTraceString: <none>

    HResult: 80131604

     

    0:011> !PrintException 040a6a20

    Exception object: 040a6a20

    Exception type: System.Reflection.TargetInvocationException

    Message: Exception has been thrown by the target of an invocation.

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131604

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a6cf8

    Exception object: 040a6cf8

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Configuration system failed to initialize

    InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    0:011> !PrintException 040a7174

    Exception object: 040a7174

    Exception type: System.Configuration.ConfigurationErrorsException

    Message: Unrecognized configuration section system.web/myInvalidData  

    InnerException: <none>

    StackTrace (generated):

    <none>

    StackTraceString: <none>

    HResult: 80131902

    There are nested exceptions on this thread. Run with -nested for details

     

    11.            Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData  inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.

     

                0:011> !analyze -v

                    ...

                            EXCEPTION_MESSAGE:  Unrecognized configuration section system.web/myInvalidData.

                                                    MANAGED_OBJECT_NAME:  System.Configuration.ConfigurationErrorsException

                                                    FAULTING_THREAD:  00000a2c

     

    0:011> ~

       0  Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen

       1  Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen

       2  Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen

       3  Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen

       4  Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen

       5  Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen

       6  Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen

       7  Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen

       8  Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen

       9  Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen

      10  Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen

    . 11  Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen

     

    12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for     

     

                "unrecognized configuration section" system.web site:microsoft.com

                 

    This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE

     

    By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications.  This is due to invalid entries in the various .config  files used in .Net. Looking through the posts, different configuration file names and paths were observed.

     

    So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config,  and a successful status result. It's best to be as specific as possible.

     

    clip_image002[7]

    I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:             

                                    c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

                    (on x64 this would be     C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)

    clip_image002[11]

    Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.

     

    The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file.  By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.

     

    Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-

     

    A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:

         (please make a backup copy first in case you make a mistake)  

       c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config

         OR   (Open the version that matches the architecture of your platform )

       C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config

     

    clip_image012

    B. Find the section <system.web> in this file (you can use the find function in notepad):

    clip_image013

    C.  Add the following line directly after <system.web> as shown in the example below:

                         <myInvalidData/>   

    clip_image014

    D. Save the file and then open eventvwr.msc and verify that the following error is displayed:

    clip_image015

    Conclusion

    Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.

     

  • Ntdebugging Blog

    NTDebugging Puzzler 0x00000005 (Better late than never)

    • 10 Comments

    Hello NTDebuggers, from time to time we see the following problem.  It’s another access violation, and the debug notes below are from a minidump.   

     

    Here is what we need to know…

     

    ·         Generally speaking what happened to cause this AV?

    ·         What method you would use to isolate root cause of the failure?

     

    There are a lot of ways to do this.  We look forward to hearing your approach.

     

    We will post our methods and answer at the end of the week.   If you need anything please let us know.

     

    -------------------------------------------

     

    Microsoft (R) Windows Debugger  Version 6.8.0001.0

    Copyright (c) Microsoft Corporation. All rights reserved.

     

     

    Loading Dump File [D:\test123.dmp]

    User Mini Dump File: Only registers, stack and portions of memory are available

     

    0:000> k 123

    ChildEBP RetAddr 

    0017f93c 75e4edb5 ntdll!ZwWaitForMultipleObjects+0x15

    0017f9d8 75e430c3 kernel32!WaitForMultipleObjectsEx+0x11d

    0017f9f4 75ef2084 kernel32!WaitForMultipleObjects+0x18

    0017fa60 75ef22b1 kernel32!WerpReportFaultInternal+0x16c

    0017fa74 75ebbe60 kernel32!WerpReportFault+0x70

    0017fb00 7732d15a kernel32!UnhandledExceptionFilter+0x1c1

    0017fb08 773000c4 ntdll!_RtlUserThreadStart+0x6f

    0017fb1c 77361d05 ntdll!_EH4_CallFilterFunc+0x12

    0017fb44 772eb6d1 ntdll!_except_handler4+0x8e

    0017fb68 772eb6a3 ntdll!ExecuteHandler2+0x26

    0017fc10 772cee57 ntdll!ExecuteHandler+0x24

    0017fc10 10011127 ntdll!KiUserExceptionDispatcher+0xf

    *** ERROR: Module load completed but symbols could not be loaded for crash3.exe

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    0017ff40 0040104a 0x10011127

    0017ffa0 75eb19f1 crash3+0x104a

    0017ffac 7732d109 kernel32!BaseThreadInitThunk+0xe

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23

     

    0:000> lm

    start    end        module name

    00400000 0040d000   crash3     (no symbols)          

    6c250000 6c288000   odbcint    (deferred)            

    6c290000 6c2f5000   odbc32     (deferred)            

    72a00000 72a86000   comctl32   (deferred)            

    74820000 749b4000   comctl32_74820000   (deferred)            

    75240000 75251000   samlib     (deferred)            

    75260000 75281000   ntmarta    (deferred)            

    754b0000 75510000   secur32    (deferred)            

    75510000 75570000   imm32      (deferred)            

    75700000 75790000   gdi32      (deferred)            

    757a0000 75870000   user32     (deferred)            

    758a0000 758a6000   nsi        (deferred)            

    758b0000 759f4000   ole32      (deferred)            

    75a00000 75aaa000   msvcrt     (deferred)            

    75ab0000 75ba0000   rpcrt4     (deferred)            

    75ba0000 75c1d000   usp10      (deferred)            

    75c20000 75c75000   shlwapi    (deferred)            

    75d60000 75e27000   msctf      (deferred)            

    75e30000 75f40000   kernel32   (pdb symbols)  

    76140000 76189000   Wldap32    (deferred)            

    76190000 7624f000   advapi32   (deferred)            

    76250000 76d1e000   shell32    (deferred)             

    76d20000 76d94000   comdlg32   (deferred)            

    76da0000 76dcd000   ws2_32     (deferred)            

    77280000 77287000   psapi      (deferred)            

    77290000 77299000   lpk        (deferred)            

    772b0000 77400000   ntdll      (pdb symbols)     

     

    Good luck and happy debugging.

     

    Jeff-

     

     


    [Update: our answer. Posted 5/13/2008]

     

    We enjoyed seeing different people’s approaches on this week’s puzzler.   This was a simple module unload.  We loaded a lib, did a GetProcAddress, freed the lib, and called the function.  The dump was a mini dump created via .dump /m C:\dump file.   There are various ways this type of scenario may arise.   Obviously someone could unload a lib, but why?    In most cases I’ve seen, it was due to a ref count problem in a com object.  Poor accounting leading to one too many decrements, and the dll will get unloaded causing a simple crash footprint.

     

    There are quite a few ways to track this down.  First of all, if you had the debugger attached and got a full dump or /ma dump you would have seen the loaded module list.  This would have been a dead giveaway and part of why we did the .dump /m.  There are other options you can enable that make tracking of module loads easy under the debugger.  I personally like “loader snaps” if I’m trying to track down module load shenanigans.   To enable this, just go into the image section of the gflags tool and enable loader snaps for the exe in question.  Now attach a debugger and watch the mode load and GetProcAddress details scroll by.

     

    Yet another popular approach is to use process monitor.  This tool is not only easy to set up, but it also gives you great logs with call stacks and other details such as registry accesses.  

     

    This puzzler provided the bare minumum data required.  We did not give you much to go on because sometimes in real debugging scenarios you have to work with a lack of data.   I really liked how many people questioned the source of the dump file.  It really shows how familiar you all are with the various dump types.

     

    Great work!

     





  • Ntdebugging Blog

    Storport ETW Logging to Measure Requests Made to a Disk Unit

    • 10 Comments

    Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:

    ·         KB 979764 (Windows Server 2008)

    ·         KB 978000 (Windows Server 2008 R2)

     

    Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit.  These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows).  Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver.  The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions.  Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.

    Configuring and Starting STORPORT ETW Logging

    To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.)  When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.  

    Here is a walkthrough to enable this feature via the GUI:

    Click Start, type “perfmon”, and press ENTER to start Performance Monitor.

    image

     

    Expand “Data Collector Sets” and select “Event Trace Sessions”.

    image

     

    Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”. 

    image

     

    The following dialog will appear:

    image

     

    Give the new data collector set a name in the dialog box.  In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.

    image

     

    After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.

    image

     

    When the dialog below opens, select “Keywords (Any)” then click Edit.

    image

     

    Check the box for IOPerfNotification, and then click OK. Note:  For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume). 

    image

     

     

    Next, select “Filter” and click Edit…

    image

     

    This is where we enter our time threshold.  Remember that timing values greater than or equal to this filter value will be logged to the trace file.

     

    Select “Binary” and check the “Filter enabled” box

    image

    In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.

    Decimal

    Hexadecimal

    Binary(little endian)

    1ms

    5ms

    10ms

    15ms

    1

    5

    A

    F

    01 00 00 00 00 00 00 00

    05 00 00 00 00 00 00 00

    0A 00 00 00 00 00 00 00

    0F 00 00 00 00 00 00 00

     

    Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms. 

     

    WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.

    image 

    Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.

    image

    After clicking OK, you will see a new Event Trace session (screenshot below).  Right-Click the new Event Trace session and select “Properties”

    image

     

    Next we need to set a location for the trace file.  Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits. 

    image

     

    The target folder may be changed in the “Directory” tab

    image

     

    I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”

    image

     

    Note:  It is not recommended to store the trace file on the same storage you are troubleshooting.

    image

    The name for the file is selected on the “File” tab.

    image

    When you are done setting the file and directory information click OK.

    Now right-click the log and select “Start” to begin tracing.

    image

    The “Status” should indicate that it is running.

    image

    How long to run the log depends on you issue.  You should let the log run until the issue or slow down occur then check the log for delays.

    Viewing the Log File

    First right-click and select “Stop” to flush any cached data and stop the logging.

    image

    Then use Event Viewer to examine the log.

    image

     

    Next choose "Open the saved log…"

    image

     

    image

    image

     

    After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.

    image

     

    Example: Event ID 13

    image

     

    Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set.  If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.

    image

    Breaking down the Event Data we have …

     

    Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)

    PortNumber:  This is the adapter port number (i.e. RaidPort1, etc.)

    PathID:  Path number if multi-pathing is in place

    TargetID:  Target LUN exposed to the Operating System

    LUN:  The Logical Unit Number of the physical storage where the particular I/O is directed

    *RequestDuration:  How long the request took in milliseconds (the number of note here for storage latency diagnostics)

    Command:  the SCSI command sent to the unit. (Usually read or write)  This is listed in decimal.  Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command

    ScsiStatus:  the status that came back from the drive

    SRBStatus:  the status as returned from adapter  (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)

     

     Example: Command 42 equals 2A which is a Write of 10 bytes.

    image

    image

    Here’s the respective SCSI command:

    image

    When viewing the log, the commands will be mostly reads and writes.  If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.

    Hope you enjoyed the walkthrough!

    -Bob

     

     

     

     

     

     

     

  • Ntdebugging Blog

    Understanding ARM Assembly Part 1

    • 10 Comments

    My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group.  You may be wondering why Microsoft support would need to know ARM assembly.  Doesn’t Windows only run on x86 and x64 machines?  No.  Windows has ran on a variety of processors in the past.  Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM.  Most of these processors are antiquated now.  The common ones now are IA-32, x86, x64.  However Windows has started supporting ARM processors in order to jump into the portable devices arena.  You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure.  So you may be saying that these devices are locked, and cannot be debugged.  That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.

     

    Processor

     

    There are limitations on ARM processors that Windows supports.  There are 3 System on Chip (SOC) vendors that are supported.  nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode.  This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU.  It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes.  So it will look strange in the assembly.  Luckily the debuggers know this and handle it for you.  This also helps to shrink the size of the assembly code in memory.  The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).

     

    In order to understand the assembly that you will see you need to understand the processor internals.

     

    ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on.  It is a 32 bit load/store style processor.  It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering.  In ARM devices these as ISB, DSB, and DMB instructions.

     

    Registers

     

    The processor has 16 available registers r0 – r15.

    0: kd> r

    r0=00000001  r1=00000000  r2=00000000  r3=00000000  r4=e1820044  r5=e17d0580

    r6=00000001  r7=e17f89b9  r8=00000002  r9=00000000 r10=1afc38ec r11=e1263b78

    r12=e127813c  sp=e1263b20  lr=e16c12c3  pc=e178b6d0 psr=00000173 ----- Thumb

     

    r0, r1, r2, r3, and r12 are volatile registers.  Volatile registers are scratch registers presumed by the caller to be destroyed across a call.  Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. 

     

    On Windows four of these registers have a designated purpose.  Those are:

    • PC (r15) – Program Counter (EIP on x86)
    • LR (r14) – Link Register.  Used as a return address to the caller.
    • SP (r13) – Stack Pointer (ESP on x86).
    • R11 – Frame Pointer (EBP on x86).
    • CPSR – Current Program Status Register (Flags on x86).

     

    In Windbg all but r11 will be labeled appropriately for you.  So you may be asking why r11 is not labeled “fp” in the debugger.  That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine.  The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions.  Will discuss leaf functions later.

     

    CPSR (Current Program Status Register)

     

    Now we need to understand some about the CPSR register.  Here is the bit breakdown:

    31

    30

    29

    28

    27

    26

    25

    24

    23

    22

    21

    20

    19

    18

    17

    16

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    0

    N

    Z

    C

    V

    Q

    IT

    J

    Reserved

    GE

    IT

    E

    A

    I

    F

    T

    M

     

    • Bits [31:28] – Condition Code Flags
      • N – bit 31 – If this bit is set, the result was negative.  If bit is cleared the result was positive or zero.
      • Z – bit 30 – If set this bit indicates the result was zero or values compared were equal.  If it is cleared, the value is non-zero or the compared values are not equal.
      • C – bit 29 – If this bit is set the instruction resulted in a carry condition.  E.g. Adding two unsigned values resulted in a value too large to be strored.
      • V – bit 28 – If this bit is set then the instruction resulted in an overflow condition.  E.g. An overflow of adding two signed values.
    • Instructions variants ending with ‘s’ set the condition codes (mov/movs)
    • E – bit 9 – Endianness (big = 1/Little = 0)
    • T – bit 5 – Set if executing Thumb instructions
    • M – bits [4:0] – CPU Mode (User 10000/Supervisor 10011)

     

    So why do I need to know about the CPSR (Current Program Status Register)?  You will need to know where some of these bits are due to how some of the assembly instruction affect these flags.  Example of this is:

     

    ADD will add two registers together, or add an immediate value to a register.  However it will not affect the flags.

     

    ADDS will do the same as ADD, but it does affect the flags.

     

    MOV will allow you to move a value into a register, and a value between registers.  This is not like the x86/x64.  MOV will not let you read or write to memory.  This does not affect the flags.

     

    MOVS does the same thing as MOV, but it does affect the flags.

     

    I hope you are seeing a trend here.  There are instructions that will look the same.  However if they end in “S” then you need to know that this will affect the flags.  I am not going to list all of those assembly instructions here.  Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.

     

    So now we have an idea of what can set the flags.  Now we need to understand what the flags are used for.  They are mainly used for branching instructions.  Here is an example:

    003a11d2 429a     cmp         r2,r3

    003a11d4 d104     bne         |MyApp!FirstFunc+0x28 (003a11e0)|

     

    The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.

     

    There are a few compare instructions.  “cmp” subtracts two register values, sets the flags, and discards the result.  “cmn” adds two register values, sets the flags, and discards the results.  “tst” does a bit wise AND of two register values, sets the flags, and discards the results.  There is even an If Then (it) instruction.  I am not going to discuss that one here as I have never seen it in any of the Windows code.

     

    So is “bne” the only branch instruction?  No.  There is a lot of them.  Here is a table of things that can be seen beside “b”, and what they check the CPSR register:

    Mnemonic

    Extension

    Meaning (Integer)

    Condition Flags (in CPSR)

    EQ

    Equal

    Z==1

    NE

    Not Equal

    Z==0

    MI

    Negative (Minus)

    N==1

    PL

    Positive or Zero (Plus)

    N==0

    HI

    Unsigned higher

    C==1 and Z==0

    LS

    Unsigned lower or same

    C==0 or Z==1

    GE

    Signed greater than or equal

    N==V

    LT

    Signed less than

    N!=V

    GT

    Signed greater than

    Z==0 and N==V

    LE

    Signed less than or equal

    Z==1 or N!=V

    VS

    Overflow

    V==1

    VC

    No overflow

    V==0

    CS

    Carry set

    C==1

    CC

    Carry clear

    C==0

    None (AL)

    Execute always

     

     

    Floating Point Registers

     

    As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).  Here is what they are.

    Floating Point

     

    As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31).  There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers.  Windows 8 will support both 16 and 32 register variants.  You have to be careful when using these, because if you access unaligned floats you may cause an exception.

     

    SIMD (NEON)

     

    As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers.  So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.

     

    In part 2 we will discuss how Windows utilizes this processor.

  • Ntdebugging Blog

    Event ID 157 "Disk # has been surprise removed"

    • 9 Comments

    Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk.  This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.

     

    What does this error mean?

    The PNP manager does what is called enumerations.  An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices.  The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.

     

    These enumerations can happen for a number of reasons.  For example, hardware can request an enumeration when it detects a change in configuration.  Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.  

     

    When an enumeration request is received, the bus driver will rescan the bus for all devices.  It will issue commands to the existing devices as though it was looking for new ones.  If these commands fail on an existing unit, the driver will mark the device as “missing”.  When the device is marked “missing”, it will not be reported back to PNP in the inventory.  When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.

     

    Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable.  An example of a non-removable disk is a hard drive on a SCSI or IDE bus.  An example of a removable disk is a USB thumb drive.

     

    Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication.  The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.

     

    Why does this error happen?

    These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem.  The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running.  An administrator that sees these errors needs to verify the heath of the disk subsystem.

     

    Event ID 157 Example:

     

  • Ntdebugging Blog

    Interpreting Event 153 Errors

    • 9 Comments

    Hello my name is Bob Golding and I would like to share with you a new event that you may see in the system event log.  Event ID 153 is an error associated with the storage subsystem. This event was new in Windows 8 and Windows Server 2012 and was added to Windows 7 and Windows Server 2008 R2 starting with hot fix KB2819485.

     

    An event 153 is similar to an event 129.  An event 129 is logged when the storport driver times out a request to the disk; I described event 129 messages in a previous article.  The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request.  The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.

     

    Because the miniport driver has a better knowledge of the request execution environment, some miniport drivers time the request themselves instead of letting storport handle request timing.  This is because the miniport driver can abort the individual request and return an error rather than storport resetting the drive after a timeout.  Resetting the drive is disruptive to the I/O subsystem and may not be necessary if only one request has timed out.  The error returned from the miniport driver is bubbled up to the class driver who can log an event 153 and retry the request.

     

    Below is an example event 153:

     

    Event 153 Example

     

    This error means that a request failed and was retried by the class driver.  In the past no message would be logged in this situation because storport did not timeout the request.  The lack of messages resulted in confusion when troubleshooting disk errors because timeouts would occur but there would be no evidence of the error.

     

    The details section of the event the log record will present what error caused the retry and whether the request was a read or write. Below is the details output:

     

    Event 153 Details

     

    In the example above at byte offset 29 is the SCSI status, at offset 30 is the SRB status that caused the retry, and at offset 31 is the SCSI command that is being retried.  In this case the SCSI status was 00 (SCSISTAT_GOOD), the SRB status was 09 (SRB_STATUS_TIMEOUT), and the command was 28 (SCSIOP_READ). 

     

    The most common SCSI commands are:

    SCSIOP_READ - 0x28

    SCSIOP_WRITE - 0x2A

     

    The most common SRB statuses are below:

    SRB_STATUS_TIMEOUT - 0x09

    SRB_STATUS_BUS_RESET - 0x0E

    SRB_STATUS_COMMAND_TIMEOUT - 0x0B

     

    A complete list of SCSI operations and statuses can be found in scsi.h in the WDK.  A list of SRB statuses can be found in srb.h.

     

    The timeout errors (SRB_STATUS_TIMEOUT and SRB_STATUS_COMMAND_TIMEOUT) indicate a request timed out in the adapter. In other words a request was sent to the drive and there was no response within the timeout period.  The bus reset error (SRB_STATUS_BUS_RESET) indicates that the device was reset and that the request is being retried due to the reset since all outstanding requests are aborted when a drive receives a reset.

     

    A system administrator who encounters event 153 errors should investigate the health of the computer’s disk subsystem.  Although an occasional timeout may be part of the normal operation of a system, the frequent need to retry requests indicates a performance issue with the storage that should be corrected.

  • Ntdebugging Blog

    Would you be interested in a live chat session on debugging techniques?

    • 9 Comments

    We’ve been thinking about hosting a live group chat session to talk with the debugging community.  If we had such a chat, the discussion would focus on debugging techniques and any questions you may have about anything we’ve previously blogged about.  If you’d be interested in participating in a chat session, please answer the survey question below.  Also, feel free to leave a comment on this page if there’s a particular debugging topic you would like for us to cover in the chat.

     


    Thanks everyone for your feedback! This survey is now closed.

  • Ntdebugging Blog

    Tracking Down a Multi-Process Deadlock

    • 9 Comments

     

    Hello, my name is Louis Shanks and my story starts off with a tricky little deadlock issue that I worked on recently. Let me set the stage by explaining the surface level symptom, and then we will get to the good stuff. Randomly when users running Vista would attempt to launch Internet Explorer nothing would happen. So they would again double-click the icon, and still nothing happened. So the users would reboot and go back to their web activities.

     

    After some basic trouble shooting it was discovered that Iexplore.exe was running multiple times and every attempt to open the browser would result in another Iexplore.exe instance.  Since no UI was ever presented I went ahead and collected a few Iexplore.exe process dumps along with a complete system dump.

     

     The process dumps didn’t prove to be terribly useful so I decided to move over to the complete memory dump.

     

    Since I knew we had multiple IExplore.exe processes running I wanted to investigate them first.

     

    0: kd>!process 0 0

     

    <snip>

    PROCESS 87289d90  SessionId: 5  Cid: 0fdc    Peb: 7ffdc000  ParentCid: 0968

        DirBase: 7d66c520  ObjectTable: a203a178  HandleCount: 205.

        Image: iexplore.exe

     

    PROCESS 861a9d90  SessionId: 5  Cid: 0d34    Peb: 7ffd9000  ParentCid: 0fdc

        DirBase: 7d66c7a0  ObjectTable: a5651a98  HandleCount: 170.

        Image: ieuser.exe

     

    PROCESS 86c175b0  SessionId: 5  Cid: 1250    Peb: 7ffd9000  ParentCid: 0968

        DirBase: 7d66c940  ObjectTable: a0853f10  HandleCount:  91.

        Image: iexplore.exe


    PROCESS 861ac7a8  SessionId: 5  Cid: 1024    Peb: 7ffde000  ParentCid: 0968

        DirBase: 7d66c620  ObjectTable: a83f7898  HandleCount:  91.

    Image: iexplore.exe
    </snip>

    I decided to just start with the first one in the list (87289d90)

     

    0: kd> !process 87289d90
        Image: iexplore.exe

        ElapsedTime                       00:24:06.666

     

    It appears this process has been around a while, so I wanted to look at the first thread in this process as it is responsible for the UI in Internet Explorer.  I am looking at this thread because the surface level problem illustrates that when we launch Iexplore.exe the UI is never displayed. Here I used the !thread command to display the specific thread.

     

            THREAD 870b6398  Cid 0fdc.16a4  Teb: 7ffdf000 Win32Thread: fd850b78 WAIT: (UserRequest) UserMode Non-Alertable

                85e86cf0  SynchronizationEvent

                85cb35e0  SynchronizationEvent

            Not impersonating

            Attached Process          87289d90       Image:         iexplore.exe

            Wait Start TickCount      10817197       Ticks: 69571 (0:00:18:05.314)

            UserTime                  00:00:00.015

            KernelTime                00:00:00.093  

     

    Based on the tick count this thread has been waiting for 18 minutes.

     

    0: kd> kvn

    # ChildEBP RetAddr  Args to Child             

    00 931208d8 81a940a1 870b6398 803d2120 870b6420 nt!KiSwapContext+0x26

    01 9312091c 81a8dacc 870b6398 00000000 00000002 nt!KiSwapThread+0x44f

    02 93120970 81bedb4f 00000002 93120aa8 00000001 nt!KeWaitForMultipleObjects+0x53d

    03 93120bfc 81bee2e3 00000002 00000001 00000000 nt!ObpWaitForMultipleObjects+0x256

    04 93120d48 81a30b3a 00000002 0027e9c4 00000001 nt!NtWaitForMultipleObjects+0xcc

    05 93120d48 77568254 00000002 0027e9c4 00000001 nt!KiFastCallEntry+0x12a

    06 0027e970 775679fc 7708c973 00000002 0027e9c4 ntdll!KiFastSystemCallRet

    07 0027e974 7708c973 00000002 0027e9c4 00000001 ntdll!NtWaitForMultipleObjects+0xc

    08 0027ea10 7608ff95 0027e9c4 0027ea38 00000000 kernel32!WaitForMultipleObjectsEx+0x11d

    09 0027ea64 773d3a46 00000028 0027eaac ffffffff USER32!RealMsgWaitForMultipleObjectsEx+0x13c

    0a 0027ea8c 773d3b7f 0027eaac ffffffff 0027eabc ole32!CCliModalLoop::BlockFn+0x97

    0b 0027eab4 774e75be ffffffff 003ddf90 0027ebc0 ole32!ModalLoop+0x5b

    0c 0027ead0 774e9294 00000000 0027ebd4 00000000 ole32!ThreadSendReceive+0x12c

    0d 0027eaf8 774e9100 0027ebc0 003d3538 0027ec1c ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x194

    0e 0027ebd8 773d3c25 003d3538 0027ecfc 0027ece4 ole32!CRpcChannelBuffer::SendReceive2+0xef

    0f 0027ebf4 773d3bd7 0027ecfc 0027ece4 003d3538 ole32!CCliModalLoop::SendReceive+0x1e

    10 0027ec6c 77403efb 003d3538 0027ecfc 0027ece4 ole32!CAptRpcChnl::SendReceive+0x73

    11 0027ecc0 76052ff4 003d3538 0027ecfc 0027ece4 ole32!CCtxComChnl::SendReceive+0x1c5

    12 0027ecd8 76053055 003e50a4 0027ede4 76052e0d RPCRT4!NdrProxySendReceive+0x43

    13 0027ece4 76052e0d 09ea0d59 0027f1fc 070001f3 RPCRT4!NdrpProxySendReceive+0xc

     

    Based on the call stack it appears this thread made an OLE call and is now waiting for some objects to signal.

     

    What is IExplore doing in this thread?  It is waiting on another process to service this OLE request.   We can find the PID of the other process by digging into the ole32 SendReceive call.

     

    I looked at the first argument:

     

    0: kd> dd 003d3538

    003d3538  77403a50 773f57b4 00000003 00000002

    003d3548  00000000 00000000 003d2ef0 003d8840

    003d3558  0038b1c0 003e05f8 77403980 00070005

     

    This gives us a pointer to another data structure that contains the information we are attempting to locate.


    0: kd> dd 003d2ef0

    003d2ef0  774ee600 003d2e70 00000d34 00000000

    003d2f00  862941fa c94cde09 55a88424 801a2601

    003d2f10  55a88424 801a2601 0000c400 00000d34

     

    Ah there it is! 0d34 is the process id servicing our OLE request.

     

    I went back and reviewed the !process 0 0 output to locate the process containing PID 0D34. I’m dumping it here.

     

    PROCESS 861a9d90  SessionId: 5  Cid: 0d34    Peb: 7ffd9000  ParentCid: 0fdc

        DirBase: 7d66c7a0  ObjectTable: a5651a98  HandleCount: 170.

        Image: ieuser.exe

     

    Ok, so the process handing this request is IEUSER.exe, which is the broker process used by Internet Explorer in Vista when running in protect mode.  This is new in Vista.

     

    More on IEUser and Protected mode can be seen at the following location:
    http://msdn.microsoft.com/en-us/library/bb250462.aspx

     

    Now I needed to dig into the IEUser.exe process to find out what it was working on causing Iexplore to wait.

     

    0: kd> !process 861a9d90 

    PROCESS 861a9d90  SessionId: 5  Cid: 0d34    Peb: 7ffd9000  ParentCid: 0fdc

        DirBase: 7d66c7a0  ObjectTable: a5651a98  HandleCount: 170.

        Image: ieuser.exe


    I discovered the thread in the ieuser.exe process working on our call from Iexplore.exe. However this thread is also waiting on access to something before it can complete.


    0: kd> kvn+200

     # ChildEBP RetAddr  Args to Child             

    00 a41b0c50 81a940a1 85ecbac0 85ecbb48 81af9920 nt!KiSwapContext+0x26

    01 a41b0c94 81a2ddc8 85ecbac0 00000000 85e18510 nt!KiSwapThread+0x44f

    02 a41b0ce8 81bf0d9f 85e18510 00000006 00000001 nt!KeWaitForSingleObject+0x492

    03 a41b0d50 81a30b3a 00000224 00000000 00000000 nt!NtWaitForSingleObject+0xbe

    04 a41b0d50 77568254 00000224 00000000 00000000 nt!KiFastCallEntry+0x12a

    05 0163e8a4 77567a0c 7708c7c4 00000224 00000000 ntdll!KiFastSystemCallRet

    06 0163e8a8 7708c7c4 00000224 00000000 00000000 ntdll!NtWaitForSingleObject+0xc

    07 0163e918 7708c732 00000224 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xbe

    08 0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12

    09 0163e948 747e5fe0 0163ec48 00497b38 00000000 RASAPI32!ReadPhonebookFileEx+0x1a4

    0a 0163e968 747e4e9d 0163ec48 00000000 00000000 RASAPI32!ReadPhonebookFile+0x1b

    0b 0163e9a8 747e52d3 0163ec48 0048f660 0163e9e0 RASAPI32!DwEnumEntriesFromPhonebook+0x35

    0c 0163ee58 747e4fd4 0163ee9c 00000001 0048f660 RASAPI32!DwEnumEntriesInDir+0x1cc

    0d 0163f0ac 747e677e 00000001 0048f660 0163f0f4 RASAPI32!DwEnumEntriesForPbkMode+0xab

    0e 0163f0e0 76e534b9 00000000 00004180 0048f660 RASAPI32!RasEnumEntriesW+0xb0

    0f 0163f10c 76e5342a 76ecc2dc 00000000 00000000 WININET!RasEnumHelp::RasEnumHelp+0x55

    10 0163f120 76e533a3 00000001 00000000 00000001 WININET!DoConnectoidsExist+0x2b

    11 0163f14c 76e534ff 76ecf02c 76ecc2dc 00000000 WININET!GetRasConnections+0x34

    12 0163f168 76e4196e 00000000 0163f180 6ee3228c WININET!IsDialUpConnection+0xa9

    13 0163f184 76e70564 00000001 00000000 00000000 WININET!FixProxySettingsForCurrentConnection+0x31

    14 0163f1a8 6ec2db3b 00000000 0163f1bc 0163f3d4 WININET!InternetInitializeAutoProxyDll+0x6c

    15 0163f1b8 75fe9142 004884a0 00000202 0047e900 ieframe!CShdocvwBroker::StartAutoProxyDetection+0x4d

    16 0163f1d0 76054294 6ec2daf8 0163f3d8 00000001 RPCRT4!Invoke+0x2a

    17 0163f5fc 76054a49 004858d0 0046e4e0 0047e900 RPCRT4!NdrStubCall2+0x27b

    18 0163f64c 774e86d7 004858d0 0047e900 0046e4e0 RPCRT4!CStdStubBuffer_Invoke+0xa0

    19 0163f694 774e877d 0047e900 0047f968 00480b38 ole32!SyncStubInvoke+0x3c

    1a 0163f6e0 7740249e 0047e900 0047f860 004858d0 ole32!StubInvoke+0xb9

    1b 0163f7bc 774023af 0046e4e0 00000000 004858d0 ole32!CCtxComChnl::ContextInvoke+0xfa

    1c 0163f7d8 774e85df 0047e900 00000001 004858d0 ole32!MTAInvoke+0x1a

    1d 0163f808 774e8a18 d0908070 0046e4e0 004858d0 ole32!AppInvoke+0xaa

    1e 0163f8e4 774e85ef 0047e8a8 00474118 0047ebc0 ole32!ComInvokeWithLockAndIPID+0x32c

    1f 0163f930 75fe94b0 0047ebc0 08ac70b7 0047ebc0 ole32!ThreadInvoke+0x2fd

     

    I’m dumping the first parameter passed to KeWaitForSingleObject to determine the object type.

     

    0: kd> !object 85e18510

    Object: 85e18510  Type: (85484d40) Mutant

        ObjectHeader: 85e184f8 (old version)

        HandleCount: 3  PointerCount: 5

        Directory Object: 881f3030  Name: RasPbFile


    Ok, we are waiting on a Mutex named RasPbFile. The big question in your mind now is, “who owns this lock”, right?

    To determine this I’ll dump the structure for the mutex object:

     

    0: kd> dt _KMUTANT 85e18510

    nt!_KMUTANT

       +0x000 Header           : _DISPATCHER_HEADER

       +0x010 MutantListEntry  : _LIST_ENTRY [ 0x870b6564 - 0x870b6564 ]

       +0x018 OwnerThread      : 0x870b6398 _KTHREAD

       +0x01c Abandoned        : 0 ''

       +0x01d ApcDisable       : 0 ''

     

    The owner of the mutex is 870b6398. The bad news is this is the same thread I started with in the Iexplore.exe process so we’re looking at a classic deadlock scenario. Thread 870b6398 running under Iexplore.exe has made an OLE call over to the IEuser.exe broker process. The IEuser process needs the mutex owned by the thread that called it in the first place.

    This turned out to be a bug in the RasApi code, not the browser or IEuser process.  The RasAPI code was written in such a way that it would wait for an infinite period of time to gain the Mutex.

     

    0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12

     

     The good news is that we have created hotfix 951738 to address this issue.  The official KB article is still pending.

  • Ntdebugging Blog

    This button doesn’t do anything!

    • 8 Comments

     

    Hello - Matthew here again.  Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago.  From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs.  The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind.  When this happens, how can you determine what is going wrong?

     

    In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure.  If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event.  Or if you suspect that the problem is due to something failing over the network, a network trace could help.

     

    Let's assume that the techniques described above do not move us any closer to understanding the failure.  What next?  If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed?  Let's answer those questions by debugging a sample application...

     

    The Nothing button

     

    Download the sample application here.

     

     

    Here's what we know:

    1. Clicking the “Button 1” button is supposed to display a useful dialog box

    2. For certain users, clicking “Button 1” results in no observable response

    3. You don't have access to the source code or debug symbols for the app

    4. No one seems to know what the “Button 1” button does behind the scenes

    5. The developer of the application hasn't been heard from in 3 years

     

    We need to understand what exactly happens when the user clicks the “Button 1” button.  Every window has a WindowProc function that receives messages upon user input.  Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked.  To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.

     

    So we need to:

     

    a)     Find the control identifier for “Button 1”

    b)    Find the WindowProc for the main application window

    c)     Determine where WM_COMMAND messages for the “Button 1” button are handled

    d)    Figure out what is failing in that code

     

    Let’s get started...

     

     

     

    Find the control identifier for “Button 1”      

     

    Spy++ (available as part of Visual Studio) is the tool for this job.  Click Spy -> Log Messages.  Use the Finder Tool to select the main window of ntdbghang1.exe.  On the messages tab, click “Clear All” then select “WM_COMMAND”.  Click “OK” and SPY++ will start logging.  Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window.  For good measure, click “Button 2” also, and you should see a second line of text.  At this point, your SPY++ window should look something like this....

     

    Spy Windows Msg

     

    We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102).  We’ll need this information later.

     

     

     

    Find the WindowProc for the main application window

     

    We can find the WindowProc address by using SPY++ again.  Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe.  Make sure “Show properties” is selected, and then click “OK”.  You will get a dialog that looks like this....

     

    Spy Window Properties

     

    Note the value of Window Proc.  This is the address of the window procedure for the main window (the exact value may differ on your system).   We know that this function is a WindowProc, which means it is of the form...

     

    LRESULT CALLBACK WindowProc(         

        HWND hwnd,

        UINT uMsg,

        WPARAM wParam,

        LPARAM lParam

    );

     

    The meaning of wParam and lParam differ depending on what uMsg is being processed.  When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101.  The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).

     

    So, we specifically are interested in the case where:

               

                uMsg = WM_COMMAND (literally 0x111)

                wParam = 0x101

     

    While looking at assembly language for the WindowProc, the stack frame will look like this...

     

    ebp    =  “old ebp”

    ebp+4  =  “return address”

    ebp+8  =  hwnd

    ebp+c  =  uMsg

    ebp+10 =  wParam

    ebp+14 =  lParam

     

     

     

    Determine where WM_COMMAND messages for the “Button 1” button are handled

     

    Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg.  Launch windbg.exe and select “File -> Attach to a Process”.  Select “ntdbghang1.exe” from the list and click “OK”.  When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”.  Based on the address on my system, that command will be “u 01002830”.  To unassemble more just type “u” again.  I’m now going to unassemble the relevant code and explain what it means along the way...

     

     

    0:001> u 01002830

    ntdbghang1+0x2830:

     

    The first 3 instructions are a function prologue, just getting everything set up.

    01002830 8bff            mov     edi,edi

    01002832 55              push    ebp

    01002833 8bec            mov     ebp,esp

     

    Here is the code that moves the value of uMsg into ecx

    01002835 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]

     

    Now the app enters the assembly equivalent of a switch statement.

    We are interested in the check for uMsg = WM_COMMAND

     

    if uMsg = WM_CREATE (1) goto 01002893

    01002838 49              dec     ecx

    01002839 7458            je      ntdbghang1+0x2893 (01002893)

    if uMsg = WM_ DESTROY (2) goto 01002889

    0100283b 49              dec     ecx

    0100283c 744b            je      ntdbghang1+0x2889 (01002889)

    if uMsg = WM_CLOSE (0x10) goto 01002889

    0100283e 83e90e          sub     ecx,0Eh

    01002841 743b            je      ntdbghang1+0x287e (0100287e)

    if uMsg = WM_COMMAND (0x111) goto 01002853

    01002843 b801010000      mov     eax,101h

    01002848 2bc8            sub     ecx,eax

    0100284a 7407            je      ntdbghang1+0x2853 (01002853)

     

    From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853.  So let’s go there...

     

    0:001> u 01002853

    ntdbghang1+0x2853:

     

    Move the value of wParam into edx

    01002853 8b5510          mov     edx,dword ptr [ebp+10h]

    If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f

    01002856 0fb7ca          movzx   ecx,dx

    01002859 2bc8            sub     ecx,eax

    0100285b 7412            je      ntdbghang1+0x286f (0100286f)

     

    From the above assembly we can see that right away we are checking for the control ID for button 1.  This is the code path we want to follow....

     

    0:001> u 0100286f

    ntdbghang1+0x286f:

     

    If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b

    Otherwise, call the function at 010027f6

    0100286f c1ea10          shr     edx,10h

    01002872 6685d2          test    dx,dx

    01002875 7524            jne     ntdbghang1+0x289b (0100289b)

    01002877 e87affffff      call    ntdbghang1+0x27f6 (010027f6)

     

    Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6. 

     

     

     

    Figure out what is failing

     

    So now we know what code is going to run every time the button is clicked (the function at 010027f6).  Let’s examine that code and figure out what may be failing.  We’ll use “uf” to unassemble the entire function, as it is rather short.  During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.

     

     

    0:001> uf 010027f6

    ntdbghang1+0x27f6:

     

    Function Prologue

    010027f6 8bff            mov     edi,edi

    010027f8 55              push    ebp

    010027f9 8bec            mov     ebp,esp

     

    010027fb 51              push    ecx

    010027fc 56              push    esi

     

    call func1(0x20) [func1 address at 01002dca]

    set localvar1=0x10

    010027fd 6a20            push    20h

    010027ff c745fc10000000  mov     dword ptr [ebp-4],10h

    01002806 e8bf050000      call    ntdbghang1+0x2dca (01002dca)

     

    Store the result of func1 in esi (call it localvar2)

    0100280b 8bf0            mov     esi,eax

     

    Call func2(localvar2, &localvar1) [func2 address at 0100b484]

    0100280d 59              pop     ecx

    0100280e 8d45fc          lea     eax,[ebp-4]

    01002811 50              push    eax

    01002812 56              push    esi

    01002813 e86c8c0000      call    ntdbghang1+0xb484 (0100b484)

     

    If func2 returns 0, then goto 01002821

    01002818 85c0            test    eax,eax

    0100281a 7405            je      ntdbghang1+0x2821 (01002821)

     

    Call func3() [func3 address at 0100278d]

    ntdbghang1+0x281c:

    0100281c e86cffffff      call    ntdbghang1+0x278d (0100278d)

     

    Call func4(localvar2) [func4 address at 01002ce3]

    ntdbghang1+0x2821:

    01002821 56              push    esi

    01002822 e8bc040000      call    ntdbghang1+0x2ce3 (01002ce3)

     

    Clean up and exit the function

    01002827 59              pop     ecx

    01002828 5e              pop     esi

    01002829 c9              leave

    0100282a c3              ret

     

     

    So let’s translate this to some pseudo-code...

     

    localvar1 = 0x10;

    localvar2 = func1(0x20);

     

    if(func2(localvar2, &localvar1)

    {

           func3();

     

    func4(localvar2);

     

    So the behavior of the function will differ, depending on the results of func2.  Let’s see if we can figure out what func2 does...

     

    0:001> u 0100b484

    ntdbghang1+0xb484:

    0100b484 ff2568110001    jmp     dword ptr [ntdbghang1+0x1168 (01001168)]

    0100b48a cc              int     3

    0100b48b cc              int     3

    0100b48c cc              int     3

     

    0:001> dps 01001168 L1

    01001168  70b88cb1 WINSPOOL!GetDefaultPrinterW

     

     

    GetDefaultPrinterW is a public API function.  Here’s the function prototype from MSDN...

     

    BOOL GetDefaultPrinter(

      LPTSTR pszBuffer,   // printer name buffer

      LPDWORD pcchBuffer  // size of name buffer

    );

     

    This fits our pseudo-code, since the function takes two parameters and returns a BOOL.  Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...

     

    DWORD cchBuffer = 0x10;

    LPWSTR pszBuffer = func1(0x20);

     

    if(GetDefaultPrinterW(pszBuffer,  &cchBuffer))

    {

           func3();

    }

     

    func4(pszBuffer);

     

    Based on the above, we can make a couple of assumptions.  Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10.  So that accounts for all of the functions, except for func3.  We could unassemble func3 to see what it does, but that may not be needed.  From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory.  So GetDefaultPrinterW is a likely potential point of failure.

     

    Let’s test this theory out.  We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.

     

    0:001> bp WINSPOOL!GetDefaultPrinterW

     

    0:001> g

     

    <Now click Button 1>

     

    Breakpoint 0 hit

    eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c

    eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202

    WINSPOOL!GetDefaultPrinterW:

    70b88cb1 8bff            mov     edi,edi

     

    // Dump out the call stack...

    0:000> kb

    ChildEBP RetAddr  Args to Child             

    0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW

    WARNING: Stack unwind information not available. Following frames may be wrong.

    0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818

    0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c

    0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23

    0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b

    0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7

     

    // Go to the return address.  Note that eax=0, meaning GetDefaultPrinterW returned FALSE.

    0:000> gu

    eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c

    eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    ntdbghang1+0x2818:

    01002818 85c0            test    eax,eax

     

    // Check the last error...

    0:000> !gle

    LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.

     

           // Resume execution

    0:000> g

     

    The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER.  It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user.  This explains why the app works for some users and fails for others.  A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.

     

    To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...

     

    VOID Button1_OnClick()

    {

          DWORD cch = 16;

          LPTSTR pPrinterName;

     

          pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));

     

          if(GetDefaultPrinter(pPrinterName, &cch))

          {

                DisplayGoButtonMessage();

          }

     

          free(pPrinterName);

     

          return;

    }

     

    Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger.  I hope you found this useful and can apply to real-world debugging scenarios.  Please post any questions / comments!

Page 3 of 24 (238 items) 12345»