This information applies specifically to Windows 2000, Windows XP, and Windows Server 2003. I will blog separately on the boot changes in Windows Vista.
Additional information about this topic may be found in Chapter 5 of Microsoft Windows Internals by Russinovich and Solomon and also on TechNet: http://technet.microsoft.com/en-us/library/bb457123.aspx
The key to understanding and troubleshooting system startup issues is to accurately ascertain the point of failure. To facilitate this determination, I have divided the boot process into the following four phases.
2. Boot Loader
Over the next few weeks, I’ll be describing each of these phases in detail and providing appropriate guidance for troubleshooting relevant issues for each phase.
The Initial Phase of boot is divided into the Power-On Self Test (POST) and Initial Disk Access.
Power-On Self Test
POST activities are fully implemented by the computer’s BIOS and vary by manufacturer. Please refer to the technical documentation provided with your hardware for details. However, regardless of manufacturer, certain generic actions are performed to ensure stable voltage, check RAM, enable interrupts for system usage, initialize the video adapter, scan for peripheral cards and perform a memory test if necessary. Depending on manufacturer and configuration, a single beep usually indicates a successful POST.
Troubleshooting the POST
ü Make sure you have the latest BIOS and firmware updates for the hardware installed in the system.
ü Replace the CMOS battery if it has failed.
ü Investigate recently added hardware (RAM, Video cards, SCSI adapters, etc.)
ü Remove recently added RAM modules.
ü Remove all adapter cards, then replace individually, ensuring they are properly seated.
ü Move adapter cards to other slots on the motherboard.
ü If the computer still will not complete POST, contact your manufacturer.
Initial Disk Access
Depending on the boot device order specified in your computer’s BIOS, your computer may attempt to boot from a CD-ROM, Network card, Floppy disk, USB device or a hard disk. For the purposes of this document, we’ll assume that we’re booting to a hard disk since that is the most common scenario.
Before we discuss the sequence of events that occur during this phase of startup, we need to understand a little bit about the layout of the boot disk. The structure of the hard disk can be visualized this way: (Obviously, these data areas are not to scale)
Hard disks are divided into Cylinders, Heads and Sectors. A sector is the smallest physical storage unit on a disk and is almost always 512 bytes in size. For more information about the physical structure of a hard disk, please refer to the following Resource Kit chapter: http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/resguide/diskover.mspx
There are two disk sectors critical to starting the computer that we’ll be discussing in detail:
· Master Boot Record (MBR)
· Boot Sector
The MBR is always located at Sector 1 of Cylinder 0, Head 0 of each physical disk. The Boot Sector resides at Sector 1 of each partition. These sectors contain both executable code and the data required to run the code.
Please note that there is some ambiguity involved in sector numbering. Cylinder/Head/Sector (CHS) notation begins numbering at C0/H0/S1. However, Absolute Sector numbering begins numbering at zero. Absolute Sector numbering is often used in disk editing utilities such as DskProbe. These differences are discussed in the following knowledge base article:
Q97819 Ambiguous References to Sector One and Sector Zero
Now that we have that straight, when does this information get written to disk? The MBR is created when the disk is partitioned. The Boot Sector is created when you format a volume. The MBR contains a small amount of executable code called the Master Boot Code, the Disk Signature and the partition table for the disk. At the end of the MBR is a 2-byte structure called a Signature Word or End of Sector marker, which should always be set to 0x55AA. A Signature Word also marks the end of an Extended Boot Record (EBR) and the Boot Sector.
The Disk Signature, a unique number at offset 0x1B8, identifies the disk to the operating system. Windows 2000 and higher operating systems use the disk signature as an index to store and retrieve information about the disk in the registry subkey HKLM\System\MountedDevices.
The Partition Table is a 64-byte data structure within the MBR used to identify the type and location of partitions on a hard disk. Each partition table entry is 16 bytes long (four entries max). Each entry starts at a predetermined offset from the beginning of the sector as follows:
Partition 1 0x1BE (446)
Partition 2 0x1CE (462)
Partition 3 0x1DE (478)
Partition 4 0x1EE (494)
The following is a partial example of a sample MBR showing three partition table entries in-use and one empty:
000001B0: 80 01
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00 00 00
000001D0: 81 0A 07 FE FF FF 8A F5 – 7F 00 3D 26 9C 00 00 00
000001E0: C1 FF 05 FE FF FF C7 1B – 1C 01 D6 96 92 00 00 00
000001F0: 00 00 00 00 00 00 00 00 – 00 00 00 00 00 00
Let’s take a look at each of the fields of a partition table entry individually. For each of these explanations, I’ll use the first partition table entry above and highlight the relevant section. Keep in mind that these values are little-endian.
00=Do Not Use for Booting
80=Active partition (Use for Booting)
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
Only the first 6 bits are used. The upper 2 bits of this byte are used by the Starting Cylinder field.
Uses 1 byte + 2 bits from the Starting Sector field to make up the cylinder value. The Starting Cylinder is a 10-bit number with a maximum value of 1023.
Defines the volume type. 0x07=NTFS
Other Possible System ID Values:
FAT12 primary partition or logical drive (fewer than 32,680 sectors in the volume)
FAT16 partition or logical drive (32,680–65,535 sectors or 16 MB–33 MB)
BIGDOS FAT16 partition or logical drive (33 MB–4 GB)
Installable File System (NTFS partition or logical drive)
FAT32 partition or logical drive
FAT32 partition or logical drive using BIOS INT 13h extensions
BIGDOS FAT16 partition or logical drive using BIOS INT 13h extensions
Extended partition using BIOS INT 13h extensions
Dynamic disk volume
Legacy FT FAT16 disk *
Legacy FT NTFS disk *
Legacy FT volume formatted with FAT32 *
Legacy FT volume using BIOS INT 13h extensions formatted with FAT32 *
Partition types denoted with an asterisk (*) indicate that they are also used to designate non-FT configurations such as striped and spanned volumes.
Ending Head (0xFE=254 decimal)
As with the Starting Sector, it only uses the first 6 bits of the byte. The upper 2 bits are used by the Ending Cylinder field.
Uses 1 byte in addition to the upper 2 bits from the Ending Sector field to make up the cylinder value. The Ending Cylinder is a 10-bit number with a maximum value of 1023.
The offset from the beginning of the disk to the beginning of the volume, counting by sectors.
0x0000003F = 63
The total number of sectors in the volume.
0x007FF54B = 8,385,867 Sectors = 4GB
Are you with me so far? Good! Now, Cylinder/Sector encoding can be a bit tricky, so let’s take a closer look.
Cylinder - Sector Encoding
Cylinder bits 1-8
9 & 10
As you can see, the Sector value occupies the lower 6 bits of the word and the Cylinder occupies the upper 10 bits of the word. In our example, the starting values for Cylinder and Sector are 01 00. Values in the MBR use reverse-byte ordering, which is also referred to as ‘little-endian’ notation. Therefore, we swap the bytes and find that our starting values are Cyl 0, Sec 1.
Our ending values are more interesting: BF 09. First, we swap the bytes and obtain a hex value of 0x09BF. This value in binary notation is 100110111111. The following table illustrates how we derive the correct partition table values from these bytes:
Example: BF 09
10 Cylinder value bits 1-8
The 6 low bits are all set to 1, therefore our Sector value is 111111 or 63. You can see above how the bits are arranged for the Cylinder value. The value above is 1000001001 (521). Since both Cylinder and Head values begin numbering at zero, we have a total of 522 Cylinders and 255 Heads represented here. This gives us an ending CHS value of: 522 x 255 x 63 = 8,385,930 sectors.
Subtracting the starting CHS address (Cylinder 0, Head 1, Sector 1) (63) gives us the total size of this partition: 8,385,867 sectors or 4GB. We can verify this number by comparing it to the Total Sectors represented in the partition table: 4B F5 7F 00. Applying reverse-byte ordering gives us 00 7F F5 4B which equals 8,385,867 sectors.
So, now that we have an understanding of what is contained within the structures on the disk, let’s look at the sequence of events that occur. Remember, this is just after POST has successfully completed.
1. The motherboard ROM BIOS attempts to access the first boot device specified in the BIOS. (This is typically user configurable and can be edited using the BIOS configuration utility.)
2. The ROM BIOS reads Cylinder 0, Head 0, and Sector 1 of the first boot device.
3. The ROM BIOS loads that sector into memory and tests it.
a. For a floppy drive, the first sector is a FAT Boot Sector.
b. For a hard drive, the first sector is the Master Boot Record (MBR).
4. When booting to the hard drive, the ROM BIOS looks at the last two signature bytes of Sector 1 and verifies they are equal to 55AA.
a. If the signature bytes do not equal 55AA the system assumes that the MBR is corrupt or the hard drive has never been partitioned. This invokes BIOS Interrupt 18, which displays an error that is BIOS-vendor specific such as “Operating System not found”.
b. If the BIOS finds that the last two bytes are 55AA, the MBR program executes.
5. The MBR searches the partition table for a boot indicator byte 0x80 indicating an active partition.
a. If no active partition is found, BIOS Interrupt 18 is invoked and a BIOS error is displayed such as “Operating System not found”.
b. If any boot indicator in the MBR has a value other than 0x80 or 0x00, or if more than one boot indicator indicates an active partition (0x80), the system stops and displays “Invalid partition table”.
c. If an active partition is found, that partition’s Boot Sector is loaded and tested.
6. The MBR loads the active partition’s Boot Sector and tests it for 55AA.
a. If the Boot Sector cannot be read after five retries, the system halts and displays “Error loading operating system”.
b. If the Boot Sector can be read but is missing its 55AA marker, “Missing operating system” is displayed and the system halts.
c. The bootstrap area is made up of a total of 16 sectors (0-15). If sector 0 for the bootstrap code is valid, but there is corruption in sectors 1-15, you may get a black screen with no error. In this case, it may be possible to transplant sectors 1-15 (not Sector 0) from another NTFS partition using DskProbe.
d. If the Boot Sector is loaded and it passes the 55AA test, the MBR passes control over to the active partition’s Boot Sector.
7. The active partition’s Boot Sector begins executing and looks for NTLDR. The contents of the Boot Sector are entirely dependent on the format of the partition. For example, if the boot partition is a FAT partition, Windows writes code to the Boot Sector that understands the FAT file system. However, if the partition is NTFS, Windows writes NTFS-capable code. The role of the Boot Sector code is to give Windows just enough information about the structure and format of a logical drive to enable it to read the NTLDR file from the root directory. The errors at this point of the boot process are file system specific. The following are possible errors with FAT and NTFS Boot Sectors.
a. FAT: In all cases it displays “press any key to restart” after either of the following errors.
(1) If NTLDR is not found “NTLDR is missing” is displayed
(2) If NTLDR is on a bad sector, “Disk Error” displays.
b. NTFS: In all cases it displays “Press CTRL+ALT+DEL to restart” after any of the following errors.
(1) If NTLDR is on a bad sector, “A disk read error occurred” is displayed. This message may also be displayed on Windows 2000 or higher systems if Extended Int13 calls are required, but have been disabled in the CMOS or SCSI BIOS. This behavior may also be seen if an FRS (File Record Segment) cannot be loaded or if any NTFS Metadata information is corrupt.
(2) If NTLDR is not found, “NTLDR is missing” displays.
(3) If NTLDR is compressed, “NTLDR is compressed” displays.
8. Once NTLDR is found, it is loaded into memory and executed. At this point the Boot Loader phase begins.
Troubleshooting the Initial Phase (Initial Disk Access)
If you are unable to boot Windows and the failure is occurring very early in the boot process, begin by creating a Windows boot floppy and using it to attempt to boot the system.
Q119467 How to Create a Bootable Disk for an NTFS or FAT Partition
If you are unable to boot to Windows NT with a floppy, skip to the section titled “Unable to boot using floppy”.
If you can successfully start the computer from the boot disk, the problem is limited to the Master Boot Record, the Boot Sector, or one or more of the following files: NTLDR, NTDetect.com, or Boot.ini. After Windows is running, you should immediately back up all data before you attempt to fix the Boot Sector or Master Boot Record.
1. Run a current virus scanning program to verify that no virus is present.
2. Select the method of troubleshooting based upon the error generated.
· Operating System not found
Typical BIOS-specific error indicating no 55AA signature on the MBR. This must be repaired manually using a disk editor such as DskProbe.
Warning: Running FixMBR or FDISK /MBR in this case will clear the partition table.
· Invalid Partition Table
May be repaired using DiskProbe – check for multiple 0x80 boot indicators.
· Error Loading Operating System
This error indicates an invalid Boot Sector. Use the “Repairing the Boot Sector” section below.
· Missing Operating System
This error indicates that the 55AA signature is missing from the Boot Sector. This may be replaced using DskProbe or by using the “Repairing the Boot Sector” section below.
· NTLDR is missing
Use the “Repairing the NTLDR file” section below.
· Disk error (FAT only)
Indicates that NTLDR is located on a bad sector.
· A disk read error occurred (NTFS only)
Indicates that NTLDR is on a bad sector. Use the “Repairing the NTLDR file” section below.
· NTLDR is compressed
Indicates that NTLDR is compressed. Uncompress or replace NTLDR. Note: This error is rare.
· Computer boots to a black screen with blinking cursor
Could be an issue with the MBR or the active partition’s boot code (Q228734). Use the appropriate section below.
Repairing the MBR
Use the FIXMBR command from the Windows Recovery Console to re-write the boot code located in the first 446 bytes of the MBR, but to leave the partition table intact.
326215 How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start
WARNING: DO NOT use this method if you are receiving “Operating System not found” or other BIOS-specific error. FIXMBR will zero out the partition table if the 55AA signature is missing from the MBR. The data will be unrecoverable.
Repairing the Boot Sector
Use the FIXBOOT command from the Windows Recovery Console to write a new Boot Sector to the system partition.
If you are not able to repair the Boot Sector by using this method, you may repair it manually using the following knowledge base article:
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Repairing the NTLDR file
Use the Recovery Console to copy NTLDR from \i386 directory of the Windows CD-ROM to the root of the hard drive.
Unable to boot using floppy
If you are unable to start the computer with a boot floppy and you are not receiving an error message, then the problem is most likely with your partition tables. If invalid partitions are present and you are unable to start your computer with a boot disk, formatting the disk, reinstalling Windows and restoring from backup may be your only option. It may be possible to recreate the partition table using DskProbe if a backup of the partition information is available. It also may be possible to manually reconstruct the partition table however this is outside the scope of this blog post. We may cover this later.
If you do not have a current backup of the data on the computer, as a last resort a data recovery service may be able to recover some of your information.
Helpful knowledge base articles:
Q272395 Error Message: Boot Record Signature AA55 Not Found
Q155892 Windows NT Boot Problem: Kernel File Is Missing From the Disk
Q228004 Changing Active Partition Can Make Your System Unbootable
Q155053 Black Screen on Boot
Q314503 Computer Hangs with a Black Screen When You Start Windows
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Okay boys & girls, that’s it for now. Next time, I’ll continue with the boot sequence and cover the Boot Loader phase. This is where things really get interesting… J
My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft. Today I will be blogging about how I used the SOS .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue. Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.
Issue Reported: The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:
"MMC could not create the snap-in."
MMC could not create the snap-in.
The snap-in might not have been installed correctly
Name: Event Viewer
First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before. It may occur when the following registry key gets deleted or corrupted:
I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected
Next, a memory dump of the mmc.exe process was collected. The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.
Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL: http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
Now let's have a look at the debug.
1. First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.
It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.
2. I noticed the following output from the debugger after it loaded the dump file:
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)
3. Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:
0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols
However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL. Here’s the syntax:
0:011> .symfix c:\websymbols
4. To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.
PEB at 000007fffffdb000
CommandLine: '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '
You could alternatively dump the CommandLine from the nt!_PEB using the dt command
0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000
+0x020 ProcessParameters :
+0x070 CommandLine : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "
5. Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception
0:011> ~* kL
... (ommitted the non-relevent threads)
# 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
0691f03c 7343a91c kernel32!RaiseException+0x58
0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8
*** WARNING: Unable to verify checksum for MMCEx.ni.dll
0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
0691f1e8 69926cf6 MMCEx_ni+0xd0b5a
0691f1f4 6993019f mscorlib_ni+0x216cf6
0691f208 69926c74 mscorlib_ni+0x22019f
0691f220 733d1b4c mscorlib_ni+0x216c74
0691f230 733e21b1 mscorwks!CallDescrWorker+0x33
0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3
0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c
0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f
0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a
0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192
0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a
0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e
6. Out of curiosity, I also ran the Get Last Error command
LastErrorValue: (Win32) 0 (0) - The operation completed successfully.
LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0
7. After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.
0:011> !analyze -v
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)
ExceptionCode: e0434f4d (CLR exception)
Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.
8. I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension. This extension ships in the Operating System along with the .Net Framework. On most occasions, I would have initiated the loading of the extension through the use of the following syntax:
0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll
0:011> .load c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll
However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load. This is due to the fact that the version of SOS must match the version of the CLR loaded into that process. Here’s the recommended syntax:
0:011>.loadby sos mscorwks
I always verify that my extensions are loaded properly by using the .chain command.
... Extension DLL chain:
C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007
9. Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:
SOS is a debugger extension DLL designed to aid in the debugging of managed
programs. Functions are listed by category, then roughly in order of
importance. Shortcut names for popular functions are listed in parenthesis.
Type "!help <functionname>" for detailed info on that function.
Object Inspection Examining code and stacks
DumpObj (do) Threads
DumpArray (da) CLRStack
DumpStackObjects (dso) IP2MD
PrintException (pe) COMState
10. Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:
0:011> !pe 771a42eb
There are nested exceptions on this thread. Run with -nested for details
0:011> !pe -nested 771a42eb
Nested exception -------------------------------------------------------------
Exception object: 040a676c
Exception type: System.Reflection.TargetInvocationException
Message: Exception has been thrown by the target of an invocation.
InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more
SP IP Function
0:011> !PrintException 040a6a20
Exception object: 040a6a20
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more
0:011> !PrintException 040a6cf8
Exception object: 040a6cf8
Exception type: System.Configuration.ConfigurationErrorsException
Message: Configuration system failed to initialize
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more
0:011> !PrintException 040a7174
Exception object: 040a7174
Message: Unrecognized configuration section system.web/myInvalidData
11. Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.
0:011> !analyze -v
EXCEPTION_MESSAGE: Unrecognized configuration section system.web/myInvalidData.
0 Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen
1 Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen
2 Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen
3 Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen
4 Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen
5 Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen
6 Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen
7 Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen
8 Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen
9 Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen
10 Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen
. 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for
"unrecognized configuration section" system.web site:microsoft.com
This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE
By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications. This is due to invalid entries in the various .config files used in .Net. Looking through the posts, different configuration file names and paths were observed.
So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config, and a successful status result. It's best to be as specific as possible.
I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:
(on x64 this would be C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)
Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.
The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file. By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.
Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-
A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:
(please make a backup copy first in case you make a mistake)
OR (Open the version that matches the architecture of your platform )
B. Find the section <system.web> in this file (you can use the find function in notepad):
C. Add the following line directly after <system.web> as shown in the example below:
D. Save the file and then open eventvwr.msc and verify that the following error is displayed:
Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.
Hello NTDebuggers, from time to time we see the following problem. It’s another access violation, and the debug notes below are from a minidump.
Here is what we need to know…
· Generally speaking what happened to cause this AV?
· What method you would use to isolate root cause of the failure?
There are a lot of ways to do this. We look forward to hearing your approach.
We will post our methods and answer at the end of the week. If you need anything please let us know.
Microsoft (R) Windows Debugger Version 6.8.0001.0
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [D:\test123.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available
0:000> k 123
0017f93c 75e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
0017f9d8 75e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
0017f9f4 75ef2084 kernel32!WaitForMultipleObjects+0x18
0017fa60 75ef22b1 kernel32!WerpReportFaultInternal+0x16c
0017fa74 75ebbe60 kernel32!WerpReportFault+0x70
0017fb00 7732d15a kernel32!UnhandledExceptionFilter+0x1c1
0017fb08 773000c4 ntdll!_RtlUserThreadStart+0x6f
0017fb1c 77361d05 ntdll!_EH4_CallFilterFunc+0x12
0017fb44 772eb6d1 ntdll!_except_handler4+0x8e
0017fb68 772eb6a3 ntdll!ExecuteHandler2+0x26
0017fc10 772cee57 ntdll!ExecuteHandler+0x24
0017fc10 10011127 ntdll!KiUserExceptionDispatcher+0xf
*** ERROR: Module load completed but symbols could not be loaded for crash3.exe
WARNING: Frame IP not in any known module. Following frames may be wrong.
0017ff40 0040104a 0x10011127
0017ffa0 75eb19f1 crash3+0x104a
0017ffac 7732d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
start end module name
00400000 0040d000 crash3 (no symbols)
6c250000 6c288000 odbcint (deferred)
6c290000 6c2f5000 odbc32 (deferred)
72a00000 72a86000 comctl32 (deferred)
74820000 749b4000 comctl32_74820000 (deferred)
75240000 75251000 samlib (deferred)
75260000 75281000 ntmarta (deferred)
754b0000 75510000 secur32 (deferred)
75510000 75570000 imm32 (deferred)
75700000 75790000 gdi32 (deferred)
757a0000 75870000 user32 (deferred)
758a0000 758a6000 nsi (deferred)
758b0000 759f4000 ole32 (deferred)
75a00000 75aaa000 msvcrt (deferred)
75ab0000 75ba0000 rpcrt4 (deferred)
75ba0000 75c1d000 usp10 (deferred)
75c20000 75c75000 shlwapi (deferred)
75d60000 75e27000 msctf (deferred)
75e30000 75f40000 kernel32 (pdb symbols)
76140000 76189000 Wldap32 (deferred)
76190000 7624f000 advapi32 (deferred)
76250000 76d1e000 shell32 (deferred)
76d20000 76d94000 comdlg32 (deferred)
76da0000 76dcd000 ws2_32 (deferred)
77280000 77287000 psapi (deferred)
77290000 77299000 lpk (deferred)
772b0000 77400000 ntdll (pdb symbols)
Good luck and happy debugging.
[Update: our answer. Posted 5/13/2008]
We enjoyed seeing different people’s approaches on this week’s puzzler. This was a simple module unload. We loaded a lib, did a GetProcAddress, freed the lib, and called the function. The dump was a mini dump created via .dump /m C:\dump file. There are various ways this type of scenario may arise. Obviously someone could unload a lib, but why? In most cases I’ve seen, it was due to a ref count problem in a com object. Poor accounting leading to one too many decrements, and the dll will get unloaded causing a simple crash footprint.
There are quite a few ways to track this down. First of all, if you had the debugger attached and got a full dump or /ma dump you would have seen the loaded module list. This would have been a dead giveaway and part of why we did the .dump /m. There are other options you can enable that make tracking of module loads easy under the debugger. I personally like “loader snaps” if I’m trying to track down module load shenanigans. To enable this, just go into the image section of the gflags tool and enable loader snaps for the exe in question. Now attach a debugger and watch the mode load and GetProcAddress details scroll by.
Yet another popular approach is to use process monitor. This tool is not only easy to set up, but it also gives you great logs with call stacks and other details such as registry accesses.
This puzzler provided the bare minumum data required. We did not give you much to go on because sometimes in real debugging scenarios you have to work with a lack of data. I really liked how many people questioned the source of the dump file. It really shows how familiar you all are with the various dump types.
Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:
· KB 979764 (Windows Server 2008)
· KB 978000 (Windows Server 2008 R2)
Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit. These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows). Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver. The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions. Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.
Configuring and Starting STORPORT ETW Logging
To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.) When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.
Here is a walkthrough to enable this feature via the GUI:
Click Start, type “perfmon”, and press ENTER to start Performance Monitor.
Expand “Data Collector Sets” and select “Event Trace Sessions”.
Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”.
The following dialog will appear:
Give the new data collector set a name in the dialog box. In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.
After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.
When the dialog below opens, select “Keywords (Any)” then click Edit.
Check the box for IOPerfNotification, and then click OK. Note: For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume).
Next, select “Filter” and click Edit…
This is where we enter our time threshold. Remember that timing values greater than or equal to this filter value will be logged to the trace file.
Select “Binary” and check the “Filter enabled” box
In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.
01 00 00 00 00 00 00 00
05 00 00 00 00 00 00 00
0A 00 00 00 00 00 00 00
0F 00 00 00 00 00 00 00
Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms.
WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.
Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.
After clicking OK, you will see a new Event Trace session (screenshot below). Right-Click the new Event Trace session and select “Properties”
Next we need to set a location for the trace file. Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits.
The target folder may be changed in the “Directory” tab
I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”
Note: It is not recommended to store the trace file on the same storage you are troubleshooting.
The name for the file is selected on the “File” tab.
When you are done setting the file and directory information click OK.
Now right-click the log and select “Start” to begin tracing.
The “Status” should indicate that it is running.
How long to run the log depends on you issue. You should let the log run until the issue or slow down occur then check the log for delays.
Viewing the Log File
First right-click and select “Stop” to flush any cached data and stop the logging.
Then use Event Viewer to examine the log.
Next choose "Open the saved log…"
After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.
Example: Event ID 13
Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set. If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.
Breaking down the Event Data we have …
Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)
PortNumber: This is the adapter port number (i.e. RaidPort1, etc.)
PathID: Path number if multi-pathing is in place
TargetID: Target LUN exposed to the Operating System
LUN: The Logical Unit Number of the physical storage where the particular I/O is directed
*RequestDuration: How long the request took in milliseconds (the number of note here for storage latency diagnostics)
Command: the SCSI command sent to the unit. (Usually read or write) This is listed in decimal. Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command
ScsiStatus: the status that came back from the drive
SRBStatus: the status as returned from adapter (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)
Example: Command 42 equals 2A which is a Write of 10 bytes.
Here’s the respective SCSI command:
When viewing the log, the commands will be mostly reads and writes. If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.
Hope you enjoyed the walkthrough!
My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group. You may be wondering why Microsoft support would need to know ARM assembly. Doesn’t Windows only run on x86 and x64 machines? No. Windows has ran on a variety of processors in the past. Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM. Most of these processors are antiquated now. The common ones now are IA-32, x86, x64. However Windows has started supporting ARM processors in order to jump into the portable devices arena. You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure. So you may be saying that these devices are locked, and cannot be debugged. That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.
There are limitations on ARM processors that Windows supports. There are 3 System on Chip (SOC) vendors that are supported. nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode. This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU. It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes. So it will look strange in the assembly. Luckily the debuggers know this and handle it for you. This also helps to shrink the size of the assembly code in memory. The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).
In order to understand the assembly that you will see you need to understand the processor internals.
ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on. It is a 32 bit load/store style processor. It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering. In ARM devices these as ISB, DSB, and DMB instructions.
The processor has 16 available registers r0 – r15.
0: kd> r
r0=00000001 r1=00000000 r2=00000000 r3=00000000 r4=e1820044 r5=e17d0580
r6=00000001 r7=e17f89b9 r8=00000002 r9=00000000 r10=1afc38ec r11=e1263b78
r12=e127813c sp=e1263b20 lr=e16c12c3 pc=e178b6d0 psr=00000173 ----- Thumb
r0, r1, r2, r3, and r12 are volatile registers. Volatile registers are scratch registers presumed by the caller to be destroyed across a call. Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used.
On Windows four of these registers have a designated purpose. Those are:
In Windbg all but r11 will be labeled appropriately for you. So you may be asking why r11 is not labeled “fp” in the debugger. That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine. The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions. Will discuss leaf functions later.
CPSR (Current Program Status Register)
Now we need to understand some about the CPSR register. Here is the bit breakdown:
So why do I need to know about the CPSR (Current Program Status Register)? You will need to know where some of these bits are due to how some of the assembly instruction affect these flags. Example of this is:
ADD will add two registers together, or add an immediate value to a register. However it will not affect the flags.
ADDS will do the same as ADD, but it does affect the flags.
MOV will allow you to move a value into a register, and a value between registers. This is not like the x86/x64. MOV will not let you read or write to memory. This does not affect the flags.
MOVS does the same thing as MOV, but it does affect the flags.
I hope you are seeing a trend here. There are instructions that will look the same. However if they end in “S” then you need to know that this will affect the flags. I am not going to list all of those assembly instructions here. Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.
So now we have an idea of what can set the flags. Now we need to understand what the flags are used for. They are mainly used for branching instructions. Here is an example:
003a11d2 429a cmp r2,r3
003a11d4 d104 bne |MyApp!FirstFunc+0x28 (003a11e0)|
The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.
There are a few compare instructions. “cmp” subtracts two register values, sets the flags, and discards the result. “cmn” adds two register values, sets the flags, and discards the results. “tst” does a bit wise AND of two register values, sets the flags, and discards the results. There is even an If Then (it) instruction. I am not going to discuss that one here as I have never seen it in any of the Windows code.
So is “bne” the only branch instruction? No. There is a lot of them. Here is a table of things that can be seen beside “b”, and what they check the CPSR register:
Condition Flags (in CPSR)
Positive or Zero (Plus)
C==1 and Z==0
Unsigned lower or same
C==0 or Z==1
Signed greater than or equal
Signed less than
Signed greater than
Z==0 and N==V
Signed less than or equal
Z==1 or N!=V
Floating Point Registers
As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture). Here is what they are.
As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31). There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers. Windows 8 will support both 16 and 32 register variants. You have to be careful when using these, because if you access unaligned floats you may cause an exception.
As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers. So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.
In part 2 we will discuss how Windows utilizes this processor.
Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk. This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.
What does this error mean?
The PNP manager does what is called enumerations. An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices. The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.
These enumerations can happen for a number of reasons. For example, hardware can request an enumeration when it detects a change in configuration. Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.
When an enumeration request is received, the bus driver will rescan the bus for all devices. It will issue commands to the existing devices as though it was looking for new ones. If these commands fail on an existing unit, the driver will mark the device as “missing”. When the device is marked “missing”, it will not be reported back to PNP in the inventory. When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.
Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable. An example of a non-removable disk is a hard drive on a SCSI or IDE bus. An example of a removable disk is a USB thumb drive.
Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication. The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.
Why does this error happen?
These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem. The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running. An administrator that sees these errors needs to verify the heath of the disk subsystem.
Event ID 157 Example:
Hello my name is Bob Golding and I would like to share with you a new event that you may see in the system event log. Event ID 153 is an error associated with the storage subsystem. This event was new in Windows 8 and Windows Server 2012 and was added to Windows 7 and Windows Server 2008 R2 starting with hot fix KB2819485.
An event 153 is similar to an event 129. An event 129 is logged when the storport driver times out a request to the disk; I described event 129 messages in a previous article. The difference between a 153 and a 129 is that a 129 is logged when storport times out a request, a 153 is logged when the storport miniport driver times out a request. The miniport driver may also be referred to as an adapter driver or HBA driver, this driver is typically written the hardware vendor.
Because the miniport driver has a better knowledge of the request execution environment, some miniport drivers time the request themselves instead of letting storport handle request timing. This is because the miniport driver can abort the individual request and return an error rather than storport resetting the drive after a timeout. Resetting the drive is disruptive to the I/O subsystem and may not be necessary if only one request has timed out. The error returned from the miniport driver is bubbled up to the class driver who can log an event 153 and retry the request.
Below is an example event 153:
This error means that a request failed and was retried by the class driver. In the past no message would be logged in this situation because storport did not timeout the request. The lack of messages resulted in confusion when troubleshooting disk errors because timeouts would occur but there would be no evidence of the error.
The details section of the event the log record will present what error caused the retry and whether the request was a read or write. Below is the details output:
In the example above at byte offset 29 is the SCSI status, at offset 30 is the SRB status that caused the retry, and at offset 31 is the SCSI command that is being retried. In this case the SCSI status was 00 (SCSISTAT_GOOD), the SRB status was 09 (SRB_STATUS_TIMEOUT), and the command was 28 (SCSIOP_READ).
The most common SCSI commands are:
SCSIOP_READ - 0x28
SCSIOP_WRITE - 0x2A
The most common SRB statuses are below:
SRB_STATUS_TIMEOUT - 0x09
SRB_STATUS_BUS_RESET - 0x0E
SRB_STATUS_COMMAND_TIMEOUT - 0x0B
A complete list of SCSI operations and statuses can be found in scsi.h in the WDK. A list of SRB statuses can be found in srb.h.
The timeout errors (SRB_STATUS_TIMEOUT and SRB_STATUS_COMMAND_TIMEOUT) indicate a request timed out in the adapter. In other words a request was sent to the drive and there was no response within the timeout period. The bus reset error (SRB_STATUS_BUS_RESET) indicates that the device was reset and that the request is being retried due to the reset since all outstanding requests are aborted when a drive receives a reset.
A system administrator who encounters event 153 errors should investigate the health of the computer’s disk subsystem. Although an occasional timeout may be part of the normal operation of a system, the frequent need to retry requests indicates a performance issue with the storage that should be corrected.
We’ve been thinking about hosting a live group chat session to talk with the debugging community. If we had such a chat, the discussion would focus on debugging techniques and any questions you may have about anything we’ve previously blogged about. If you’d be interested in participating in a chat session, please answer the survey question below. Also, feel free to leave a comment on this page if there’s a particular debugging topic you would like for us to cover in the chat.
Thanks everyone for your feedback! This survey is now closed.
Hello, my name is Louis Shanks and my story starts off with a tricky little deadlock issue that I worked on recently. Let me set the stage by explaining the surface level symptom, and then we will get to the good stuff. Randomly when users running Vista would attempt to launch Internet Explorer nothing would happen. So they would again double-click the icon, and still nothing happened. So the users would reboot and go back to their web activities.
After some basic trouble shooting it was discovered that Iexplore.exe was running multiple times and every attempt to open the browser would result in another Iexplore.exe instance. Since no UI was ever presented I went ahead and collected a few Iexplore.exe process dumps along with a complete system dump.
The process dumps didn’t prove to be terribly useful so I decided to move over to the complete memory dump.
Since I knew we had multiple IExplore.exe processes running I wanted to investigate them first.
0: kd>!process 0 0
PROCESS 87289d90 SessionId: 5 Cid: 0fdc Peb: 7ffdc000 ParentCid: 0968
DirBase: 7d66c520 ObjectTable: a203a178 HandleCount: 205.
PROCESS 861a9d90 SessionId: 5 Cid: 0d34 Peb: 7ffd9000 ParentCid: 0fdc
DirBase: 7d66c7a0 ObjectTable: a5651a98 HandleCount: 170.
PROCESS 86c175b0 SessionId: 5 Cid: 1250 Peb: 7ffd9000 ParentCid: 0968
DirBase: 7d66c940 ObjectTable: a0853f10 HandleCount: 91.
PROCESS 861ac7a8 SessionId: 5 Cid: 1024 Peb: 7ffde000 ParentCid: 0968
DirBase: 7d66c620 ObjectTable: a83f7898 HandleCount: 91.
I decided to just start with the first one in the list (87289d90)
0: kd> !process 87289d90 Image: iexplore.exe
It appears this process has been around a while, so I wanted to look at the first thread in this process as it is responsible for the UI in Internet Explorer. I am looking at this thread because the surface level problem illustrates that when we launch Iexplore.exe the UI is never displayed. Here I used the !thread command to display the specific thread.
THREAD 870b6398 Cid 0fdc.16a4 Teb: 7ffdf000 Win32Thread: fd850b78 WAIT: (UserRequest) UserMode Non-Alertable
Attached Process 87289d90 Image: iexplore.exe
Wait Start TickCount 10817197 Ticks: 69571 (0:00:18:05.314)
Based on the tick count this thread has been waiting for 18 minutes.
0: kd> kvn
# ChildEBP RetAddr Args to Child
00 931208d8 81a940a1 870b6398 803d2120 870b6420 nt!KiSwapContext+0x26
01 9312091c 81a8dacc 870b6398 00000000 00000002 nt!KiSwapThread+0x44f
02 93120970 81bedb4f 00000002 93120aa8 00000001 nt!KeWaitForMultipleObjects+0x53d
03 93120bfc 81bee2e3 00000002 00000001 00000000 nt!ObpWaitForMultipleObjects+0x256
04 93120d48 81a30b3a 00000002 0027e9c4 00000001 nt!NtWaitForMultipleObjects+0xcc
05 93120d48 77568254 00000002 0027e9c4 00000001 nt!KiFastCallEntry+0x12a
06 0027e970 775679fc 7708c973 00000002 0027e9c4 ntdll!KiFastSystemCallRet
07 0027e974 7708c973 00000002 0027e9c4 00000001 ntdll!NtWaitForMultipleObjects+0xc
08 0027ea10 7608ff95 0027e9c4 0027ea38 00000000 kernel32!WaitForMultipleObjectsEx+0x11d
09 0027ea64 773d3a46 00000028 0027eaac ffffffff USER32!RealMsgWaitForMultipleObjectsEx+0x13c
0a 0027ea8c 773d3b7f 0027eaac ffffffff 0027eabc ole32!CCliModalLoop::BlockFn+0x97
0b 0027eab4 774e75be ffffffff 003ddf90 0027ebc0 ole32!ModalLoop+0x5b
0c 0027ead0 774e9294 00000000 0027ebd4 00000000 ole32!ThreadSendReceive+0x12c
0d 0027eaf8 774e9100 0027ebc0 003d3538 0027ec1c ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x194
0e 0027ebd8 773d3c25 003d3538 0027ecfc 0027ece4 ole32!CRpcChannelBuffer::SendReceive2+0xef
0f 0027ebf4 773d3bd7 0027ecfc 0027ece4 003d3538 ole32!CCliModalLoop::SendReceive+0x1e
10 0027ec6c 77403efb 003d3538 0027ecfc 0027ece4 ole32!CAptRpcChnl::SendReceive+0x73
11 0027ecc0 76052ff4 003d3538 0027ecfc 0027ece4 ole32!CCtxComChnl::SendReceive+0x1c5
12 0027ecd8 76053055 003e50a4 0027ede4 76052e0d RPCRT4!NdrProxySendReceive+0x43
13 0027ece4 76052e0d 09ea0d59 0027f1fc 070001f3 RPCRT4!NdrpProxySendReceive+0xc
Based on the call stack it appears this thread made an OLE call and is now waiting for some objects to signal.
What is IExplore doing in this thread? It is waiting on another process to service this OLE request. We can find the PID of the other process by digging into the ole32 SendReceive call.
I looked at the first argument:
0: kd> dd 003d3538
003d3538 77403a50 773f57b4 00000003 00000002
003d3548 00000000 00000000 003d2ef0 003d8840
003d3558 0038b1c0 003e05f8 77403980 00070005
This gives us a pointer to another data structure that contains the information we are attempting to locate.
0: kd> dd 003d2ef0
003d2ef0 774ee600 003d2e70 00000d34 00000000
003d2f00 862941fa c94cde09 55a88424 801a2601
003d2f10 55a88424 801a2601 0000c400 00000d34
Ah there it is! 0d34 is the process id servicing our OLE request.
I went back and reviewed the !process 0 0 output to locate the process containing PID 0D34. I’m dumping it here.
PROCESS 861a9d90 SessionId: 5 Cid: 0d34 Peb: 7ffd9000 ParentCid: 0fdc
Ok, so the process handing this request is IEUSER.exe, which is the broker process used by Internet Explorer in Vista when running in protect mode. This is new in Vista.
More on IEUser and Protected mode can be seen at the following location:http://msdn.microsoft.com/en-us/library/bb250462.aspx
Now I needed to dig into the IEUser.exe process to find out what it was working on causing Iexplore to wait.
0: kd> !process 861a9d90
I discovered the thread in the ieuser.exe process working on our call from Iexplore.exe. However this thread is also waiting on access to something before it can complete.
0: kd> kvn+200
# ChildEBP RetAddr Args to Child
00 a41b0c50 81a940a1 85ecbac0 85ecbb48 81af9920 nt!KiSwapContext+0x26
01 a41b0c94 81a2ddc8 85ecbac0 00000000 85e18510 nt!KiSwapThread+0x44f
02 a41b0ce8 81bf0d9f 85e18510 00000006 00000001 nt!KeWaitForSingleObject+0x492
03 a41b0d50 81a30b3a 00000224 00000000 00000000 nt!NtWaitForSingleObject+0xbe
04 a41b0d50 77568254 00000224 00000000 00000000 nt!KiFastCallEntry+0x12a
05 0163e8a4 77567a0c 7708c7c4 00000224 00000000 ntdll!KiFastSystemCallRet
06 0163e8a8 7708c7c4 00000224 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
07 0163e918 7708c732 00000224 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xbe
08 0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
09 0163e948 747e5fe0 0163ec48 00497b38 00000000 RASAPI32!ReadPhonebookFileEx+0x1a4
0a 0163e968 747e4e9d 0163ec48 00000000 00000000 RASAPI32!ReadPhonebookFile+0x1b
0b 0163e9a8 747e52d3 0163ec48 0048f660 0163e9e0 RASAPI32!DwEnumEntriesFromPhonebook+0x35
0c 0163ee58 747e4fd4 0163ee9c 00000001 0048f660 RASAPI32!DwEnumEntriesInDir+0x1cc
0d 0163f0ac 747e677e 00000001 0048f660 0163f0f4 RASAPI32!DwEnumEntriesForPbkMode+0xab
0e 0163f0e0 76e534b9 00000000 00004180 0048f660 RASAPI32!RasEnumEntriesW+0xb0
0f 0163f10c 76e5342a 76ecc2dc 00000000 00000000 WININET!RasEnumHelp::RasEnumHelp+0x55
10 0163f120 76e533a3 00000001 00000000 00000001 WININET!DoConnectoidsExist+0x2b
11 0163f14c 76e534ff 76ecf02c 76ecc2dc 00000000 WININET!GetRasConnections+0x34
12 0163f168 76e4196e 00000000 0163f180 6ee3228c WININET!IsDialUpConnection+0xa9
13 0163f184 76e70564 00000001 00000000 00000000 WININET!FixProxySettingsForCurrentConnection+0x31
14 0163f1a8 6ec2db3b 00000000 0163f1bc 0163f3d4 WININET!InternetInitializeAutoProxyDll+0x6c
15 0163f1b8 75fe9142 004884a0 00000202 0047e900 ieframe!CShdocvwBroker::StartAutoProxyDetection+0x4d
16 0163f1d0 76054294 6ec2daf8 0163f3d8 00000001 RPCRT4!Invoke+0x2a
17 0163f5fc 76054a49 004858d0 0046e4e0 0047e900 RPCRT4!NdrStubCall2+0x27b
18 0163f64c 774e86d7 004858d0 0047e900 0046e4e0 RPCRT4!CStdStubBuffer_Invoke+0xa0
19 0163f694 774e877d 0047e900 0047f968 00480b38 ole32!SyncStubInvoke+0x3c
1a 0163f6e0 7740249e 0047e900 0047f860 004858d0 ole32!StubInvoke+0xb9
1b 0163f7bc 774023af 0046e4e0 00000000 004858d0 ole32!CCtxComChnl::ContextInvoke+0xfa
1c 0163f7d8 774e85df 0047e900 00000001 004858d0 ole32!MTAInvoke+0x1a
1d 0163f808 774e8a18 d0908070 0046e4e0 004858d0 ole32!AppInvoke+0xaa
1e 0163f8e4 774e85ef 0047e8a8 00474118 0047ebc0 ole32!ComInvokeWithLockAndIPID+0x32c
1f 0163f930 75fe94b0 0047ebc0 08ac70b7 0047ebc0 ole32!ThreadInvoke+0x2fd
I’m dumping the first parameter passed to KeWaitForSingleObject to determine the object type.
0: kd> !object 85e18510
Object: 85e18510 Type: (85484d40) Mutant
ObjectHeader: 85e184f8 (old version)
HandleCount: 3 PointerCount: 5
Directory Object: 881f3030 Name: RasPbFile
Ok, we are waiting on a Mutex named RasPbFile. The big question in your mind now is, “who owns this lock”, right?
To determine this I’ll dump the structure for the mutex object:
0: kd> dt _KMUTANT 85e18510
+0x000 Header : _DISPATCHER_HEADER
+0x010 MutantListEntry : _LIST_ENTRY [ 0x870b6564 - 0x870b6564 ]
+0x018 OwnerThread : 0x870b6398 _KTHREAD
+0x01c Abandoned : 0 ''
+0x01d ApcDisable : 0 ''
The owner of the mutex is 870b6398. The bad news is this is the same thread I started with in the Iexplore.exe process so we’re looking at a classic deadlock scenario. Thread 870b6398 running under Iexplore.exe has made an OLE call over to the IEuser.exe broker process. The IEuser process needs the mutex owned by the thread that called it in the first place.This turned out to be a bug in the RasApi code, not the browser or IEuser process. The RasAPI code was written in such a way that it would wait for an infinite period of time to gain the Mutex.
0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
The good news is that we have created hotfix 951738 to address this issue. The official KB article is still pending.
Hello - Matthew here again. Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago. From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs. The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind. When this happens, how can you determine what is going wrong?
In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure. If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event. Or if you suspect that the problem is due to something failing over the network, a network trace could help.
Let's assume that the techniques described above do not move us any closer to understanding the failure. What next? If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed? Let's answer those questions by debugging a sample application...
Download the sample application here.
Here's what we know:
1. Clicking the “Button 1” button is supposed to display a useful dialog box
2. For certain users, clicking “Button 1” results in no observable response
3. You don't have access to the source code or debug symbols for the app
4. No one seems to know what the “Button 1” button does behind the scenes
5. The developer of the application hasn't been heard from in 3 years
We need to understand what exactly happens when the user clicks the “Button 1” button. Every window has a WindowProc function that receives messages upon user input. Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked. To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.
So we need to:
a) Find the control identifier for “Button 1”
b) Find the WindowProc for the main application window
c) Determine where WM_COMMAND messages for the “Button 1” button are handled
d) Figure out what is failing in that code
Let’s get started...
Find the control identifier for “Button 1”
Spy++ (available as part of Visual Studio) is the tool for this job. Click Spy -> Log Messages. Use the Finder Tool to select the main window of ntdbghang1.exe. On the messages tab, click “Clear All” then select “WM_COMMAND”. Click “OK” and SPY++ will start logging. Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window. For good measure, click “Button 2” also, and you should see a second line of text. At this point, your SPY++ window should look something like this....
We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102). We’ll need this information later.
Find the WindowProc for the main application window
We can find the WindowProc address by using SPY++ again. Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe. Make sure “Show properties” is selected, and then click “OK”. You will get a dialog that looks like this....
Note the value of Window Proc. This is the address of the window procedure for the main window (the exact value may differ on your system). We know that this function is a WindowProc, which means it is of the form...
LRESULT CALLBACK WindowProc(
The meaning of wParam and lParam differ depending on what uMsg is being processed. When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101. The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).
So, we specifically are interested in the case where:
uMsg = WM_COMMAND (literally 0x111)
wParam = 0x101
While looking at assembly language for the WindowProc, the stack frame will look like this...
ebp = “old ebp”
ebp+4 = “return address”
ebp+8 = hwnd
ebp+c = uMsg
ebp+10 = wParam
ebp+14 = lParam
Determine where WM_COMMAND messages for the “Button 1” button are handled
Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg. Launch windbg.exe and select “File -> Attach to a Process”. Select “ntdbghang1.exe” from the list and click “OK”. When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”. Based on the address on my system, that command will be “u 01002830”. To unassemble more just type “u” again. I’m now going to unassemble the relevant code and explain what it means along the way...
0:001> u 01002830
The first 3 instructions are a function prologue, just getting everything set up.
01002830 8bff mov edi,edi
01002832 55 push ebp
01002833 8bec mov ebp,esp
Here is the code that moves the value of uMsg into ecx
01002835 8b4d0c mov ecx,dword ptr [ebp+0Ch]
Now the app enters the assembly equivalent of a switch statement.
We are interested in the check for uMsg = WM_COMMAND
if uMsg = WM_CREATE (1) goto 01002893
01002838 49 dec ecx
01002839 7458 je ntdbghang1+0x2893 (01002893)
if uMsg = WM_ DESTROY (2) goto 01002889
0100283b 49 dec ecx
0100283c 744b je ntdbghang1+0x2889 (01002889)
if uMsg = WM_CLOSE (0x10) goto 01002889
0100283e 83e90e sub ecx,0Eh
01002841 743b je ntdbghang1+0x287e (0100287e)
if uMsg = WM_COMMAND (0x111) goto 01002853
01002843 b801010000 mov eax,101h
01002848 2bc8 sub ecx,eax
0100284a 7407 je ntdbghang1+0x2853 (01002853)
From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853. So let’s go there...
0:001> u 01002853
Move the value of wParam into edx
01002853 8b5510 mov edx,dword ptr [ebp+10h]
If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f
01002856 0fb7ca movzx ecx,dx
01002859 2bc8 sub ecx,eax
0100285b 7412 je ntdbghang1+0x286f (0100286f)
From the above assembly we can see that right away we are checking for the control ID for button 1. This is the code path we want to follow....
0:001> u 0100286f
If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b
Otherwise, call the function at 010027f6
0100286f c1ea10 shr edx,10h
01002872 6685d2 test dx,dx
01002875 7524 jne ntdbghang1+0x289b (0100289b)
01002877 e87affffff call ntdbghang1+0x27f6 (010027f6)
Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6.
Figure out what is failing
So now we know what code is going to run every time the button is clicked (the function at 010027f6). Let’s examine that code and figure out what may be failing. We’ll use “uf” to unassemble the entire function, as it is rather short. During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.
0:001> uf 010027f6
010027f6 8bff mov edi,edi
010027f8 55 push ebp
010027f9 8bec mov ebp,esp
010027fb 51 push ecx
010027fc 56 push esi
call func1(0x20) [func1 address at 01002dca]
010027fd 6a20 push 20h
010027ff c745fc10000000 mov dword ptr [ebp-4],10h
01002806 e8bf050000 call ntdbghang1+0x2dca (01002dca)
Store the result of func1 in esi (call it localvar2)
0100280b 8bf0 mov esi,eax
Call func2(localvar2, &localvar1) [func2 address at 0100b484]
0100280d 59 pop ecx
0100280e 8d45fc lea eax,[ebp-4]
01002811 50 push eax
01002812 56 push esi
01002813 e86c8c0000 call ntdbghang1+0xb484 (0100b484)
If func2 returns 0, then goto 01002821
01002818 85c0 test eax,eax
0100281a 7405 je ntdbghang1+0x2821 (01002821)
Call func3() [func3 address at 0100278d]
0100281c e86cffffff call ntdbghang1+0x278d (0100278d)
Call func4(localvar2) [func4 address at 01002ce3]
01002821 56 push esi
01002822 e8bc040000 call ntdbghang1+0x2ce3 (01002ce3)
Clean up and exit the function
01002827 59 pop ecx
01002828 5e pop esi
01002829 c9 leave
0100282a c3 ret
So let’s translate this to some pseudo-code...
localvar1 = 0x10;
localvar2 = func1(0x20);
So the behavior of the function will differ, depending on the results of func2. Let’s see if we can figure out what func2 does...
0:001> u 0100b484
0100b484 ff2568110001 jmp dword ptr [ntdbghang1+0x1168 (01001168)]
0100b48a cc int 3
0100b48b cc int 3
0100b48c cc int 3
0:001> dps 01001168 L1
01001168 70b88cb1 WINSPOOL!GetDefaultPrinterW
GetDefaultPrinterW is a public API function. Here’s the function prototype from MSDN...
LPTSTR pszBuffer, // printer name buffer
LPDWORD pcchBuffer // size of name buffer
This fits our pseudo-code, since the function takes two parameters and returns a BOOL. Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...
DWORD cchBuffer = 0x10;
LPWSTR pszBuffer = func1(0x20);
Based on the above, we can make a couple of assumptions. Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10. So that accounts for all of the functions, except for func3. We could unassemble func3 to see what it does, but that may not be needed. From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory. So GetDefaultPrinterW is a likely potential point of failure.
Let’s test this theory out. We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.
0:001> bp WINSPOOL!GetDefaultPrinterW
<Now click Button 1>
Breakpoint 0 hit
eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c
eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202
70b88cb1 8bff mov edi,edi
// Dump out the call stack...
ChildEBP RetAddr Args to Child
0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW
WARNING: Stack unwind information not available. Following frames may be wrong.
0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818
0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c
0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23
0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b
0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7
// Go to the return address. Note that eax=0, meaning GetDefaultPrinterW returned FALSE.
eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c
eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
01002818 85c0 test eax,eax
// Check the last error...
LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.
// Resume execution
The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER. It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user. This explains why the app works for some users and fails for others. A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.
To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...
DWORD cch = 16;
pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));
Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger. I hope you found this useful and can apply to real-world debugging scenarios. Please post any questions / comments!