My name is Ryan Mangipano (ryanman) and I am a Sr. Support Escalation Engineer at Microsoft. Today I will be blogging about how I used the SOS .Net Framework debugging extension (and !analyze -v) to easily troubleshoot a .Net Framework exception. This exception was preventing Event Viewer from displaying properly. Event Viewer was returning an error that provided very little information about what was actually causing the issue. I will demonstrate how, in this case, it was very easy to use windbg to obtain information about what went wrong. I did not have to perform a live debug on the issue. Instead, I used a process dump to obtain very exact information, which was returned by the debugger, relating to the root cause. I was then able to use Process Monitor to identify the file that needed to be examined. These actions led me to the source of the problem, which was easily corrected. Also, the issue discussed in this blog was easily reproduced on Windows Server 2008. This means that you should be able to practice this debug exercise, on your non-production Windows 2008 SP1 Server, for learning purposes if you are interested in doing so.
Issue Reported: The following error was encountered when opening eventvwr.msc (Event Viewer) on a Windows 2008 Server system:
"MMC could not create the snap-in."
MMC could not create the snap-in.
The snap-in might not have been installed correctly
Name: Event Viewer
First Step- Research & Data Gathering: After ensuring I first understood the problem reported, I searched for known issues. I found out that we have seen this error before. It may occur when the following registry key gets deleted or corrupted:
I had the customer export this key and found that it was not corrupted in any way. I verified that all data was as expected
Next, a memory dump of the mmc.exe process was collected. The mmc.exe process is used to host the eventvwr.msc snap-in. This was easily obtained using the built in Windows 2008 Server "Windows Task Manager" feature: "Create Dump File" . If you have several mmc console instances executing on your system, you can use the Task Manager context menu shortcuts "Switch To" and "Go To Process" to help you to identify the correct instance.
Note: We also collected a process monitor logfile during the startup of eventvwr.msc. This log file later proved very helpful in resolving the issue (as I will show below). Process monitor can be obtained at the following URL: http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
Now let's have a look at the debug.
1. First, I navigated Windows Explorer to the location of the dump file and double-clicked it to open it in windbg.exe.
It opened in windbg because I had previously run the command windbg -IA, which associates .dmp files with windbg. You can read more about the command line options in windbg in the help file that is included with the debugging tools.
2. I noticed the following output from the debugger after it loaded the dump file:
This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(ff8.a2c): CLR exception - code e0434f4d (first/second chance not available)
3. Next, I wanted to ensure my symbol path was set correctly. I could have set it using the .sympath command:
0:011> .sympath SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
Expanded Symbol search path is: srv*c:\websymbols*http://msdl.microsoft.com/download/symbols
However, when your goal is to simply point to the default symbol server, .symfix is a very nice shortcut. It prevents one from having to try to remember the URL. Here’s the syntax:
0:011> .symfix c:\websymbols
4. To ensure that I didn't waste time reviewing the wrong data, I performed a quick check to ensure that we collected a dump of the requested snap-in.
PEB at 000007fffffdb000
CommandLine: '"C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" '
You could alternatively dump the CommandLine from the nt!_PEB using the dt command
0:005> dt nt!_PEB ProcessParameters->CommandLine 000007fffffdb000
+0x020 ProcessParameters :
+0x070 CommandLine : _UNICODE_STRING ""C:\Windows\system32\mmc.exe" "C:\Windows\system32\eventvwr.msc" "
5. Next, I then dumped out all of the threads in this process and found the following thread contained a stack that was raising a .Net Framework exception
0:011> ~* kL
... (ommitted the non-relevent threads)
# 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
0691f03c 7343a91c kernel32!RaiseException+0x58
0691f09c 7343d81a mscorwks!RaiseTheExceptionInternalOnly+0x2a8
*** WARNING: Unable to verify checksum for MMCEx.ni.dll
0691f140 6bfe0b5a mscorwks!JIT_Rethrow+0xbf
*** WARNING: Unable to verify checksum for mscorlib.ni.dll
0691f1e8 69926cf6 MMCEx_ni+0xd0b5a
0691f1f4 6993019f mscorlib_ni+0x216cf6
0691f208 69926c74 mscorlib_ni+0x22019f
0691f220 733d1b4c mscorlib_ni+0x216c74
0691f230 733e21b1 mscorwks!CallDescrWorker+0x33
0691f2b0 733f6501 mscorwks!CallDescrWorkerWithHandler+0xa3
0691f3e8 733f6534 mscorwks!MethodDesc::CallDescr+0x19c
0691f404 733f6552 mscorwks!MethodDesc::CallTargetWorker+0x1f
0691f41c 7349d803 mscorwks!MethodDescCallSite::CallWithValueTypes+0x1a
0691f604 733f845f mscorwks!ThreadNative::KickOffThread_Worker+0x192
0691f618 733f83fb mscorwks!Thread::DoADCallBack+0x32a
0691f6ac 733f8321 mscorwks!Thread::ShouldChangeAbortToUnload+0xe3
0691f6e8 733f84ad mscorwks!Thread::ShouldChangeAbortToUnload+0x30a
0691f710 7349d5d4 mscorwks!Thread::ShouldChangeAbortToUnload+0x33e
6. Out of curiosity, I also ran the Get Last Error command
LastErrorValue: (Win32) 0 (0) - The operation completed successfully.
LastStatusValue: (NTSTATUS) 0 - STATUS_WAIT_0
7. After this, I ran analyze -v to see what helpful information the debugger would provide. The debugger did output exception information but informed me that I needed to use the x86 debugger instead.
0:011> !analyze -v
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 771a42eb (kernel32!RaiseException+0x00000058)
ExceptionCode: e0434f4d (CLR exception)
Managed code needs matching platform of sos.dll for proper analysis. Use 'x86' debugger.
8. I fired up the x86 debugger and loaded the appropriate version of the SOS .Net Framework debugger extension. This extension ships in the Operating System along with the .Net Framework. On most occasions, I would have initiated the loading of the extension through the use of the following syntax:
0:011> .load C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll
0:011> .load c:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.dll
However, once you realize that managed debugging will be necessary and that you need the services of the SOS extension, it’s best to use the .loadby command rather than .load. This is due to the fact that the version of SOS must match the version of the CLR loaded into that process. Here’s the recommended syntax:
0:011>.loadby sos mscorwks
I always verify that my extensions are loaded properly by using the .chain command.
... Extension DLL chain:
C:\Windows\Microsoft.NET\Framework\v2.0.50727\sos.dll: image 2.0.50727.1434, API 1.0.0, built Wed Dec 05 22:42:38 2007
9. Running !help printed out the following helpful information about the SOS extension since sos.dll was at the top of the .chain output:
SOS is a debugger extension DLL designed to aid in the debugging of managed
programs. Functions are listed by category, then roughly in order of
importance. Shortcut names for popular functions are listed in parenthesis.
Type "!help <functionname>" for detailed info on that function.
Object Inspection Examining code and stacks
DumpObj (do) Threads
DumpArray (da) CLRStack
DumpStackObjects (dso) IP2MD
PrintException (pe) COMState
10. Using the exception address, displayed by the debugger when opening the dump, and the !pe command listed above, I obtained more information about the exception:
0:011> !pe 771a42eb
There are nested exceptions on this thread. Run with -nested for details
0:011> !pe -nested 771a42eb
Nested exception -------------------------------------------------------------
Exception object: 040a676c
Exception type: System.Reflection.TargetInvocationException
Message: Exception has been thrown by the target of an invocation.
InnerException: System.Reflection.TargetInvocationException, use !PrintException 040a6a20 to see more
SP IP Function
0:011> !PrintException 040a6a20
Exception object: 040a6a20
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a6cf8 to see more
0:011> !PrintException 040a6cf8
Exception object: 040a6cf8
Exception type: System.Configuration.ConfigurationErrorsException
Message: Configuration system failed to initialize
InnerException: System.Configuration.ConfigurationErrorsException, use !PrintException 040a7174 to see more
0:011> !PrintException 040a7174
Exception object: 040a7174
Message: Unrecognized configuration section system.web/myInvalidData
11. Based on the exception information listed above, it appeared that a .Net Framework configuration section, system.web, contained an invalid configuration section named myInvalidData inside of it. I then re-ran !analyze -v against the dump again (now that I had loaded the x86 debugger) and found that !analyze -v will load the sos.dll extension and even run the !pe extension automatically. It then automatically displayed the exception record information for me as well. Also, notice that the thread listed by !analyze -v matches the thread I examined earlier.
0:011> !analyze -v
EXCEPTION_MESSAGE: Unrecognized configuration section system.web/myInvalidData.
0 Id: ff8.c84 Suspend: 1 Teb: 7ffdf000 Unfrozen
1 Id: ff8.96c Suspend: 1 Teb: 7ffde000 Unfrozen
2 Id: ff8.d10 Suspend: 1 Teb: 7ffdd000 Unfrozen
3 Id: ff8.d94 Suspend: 1 Teb: 7ffdc000 Unfrozen
4 Id: ff8.a14 Suspend: 1 Teb: 7ffda000 Unfrozen
5 Id: ff8.fbc Suspend: 1 Teb: 7ffd9000 Unfrozen
6 Id: ff8.f88 Suspend: 1 Teb: 7ffd8000 Unfrozen
7 Id: ff8.a64 Suspend: 1 Teb: 7ffd6000 Unfrozen
8 Id: ff8.bf8 Suspend: 1 Teb: 7ffd5000 Unfrozen
9 Id: ff8.d24 Suspend: 1 Teb: 7ffd4000 Unfrozen
10 Id: ff8.ff0 Suspend: 1 Teb: 7ffd7000 Unfrozen
. 11 Id: ff8.a2c Suspend: 1 Teb: 7ffd3000 Unfrozen
12. At this point I was interested in identifying the source of this unrecognized configuration. Instead of engaging our .Net support team, I started with a quick search using www.live.com for
"unrecognized configuration section" system.web site:microsoft.com
This returned the following results http://search.live.com/results.aspx?q=%22unrecognized+configuration+section%22+system.web+site%3Amicrosoft.com&form=QBRE
By quickly reviewing some of the hits returned, I found that others had encountered this exception in their own applications. This is due to invalid entries in the various .config files used in .Net. Looking through the posts, different configuration file names and paths were observed.
So, I opened up the process monitor logfile to see which configuration files we were reading data from. I added filter criterion to match entries from the mmc.exe process, the TID from the FAULTING_THREAD listed in the exception data, path data containing .config, and a successful status result. It's best to be as specific as possible.
I found that we were reading in a large amount of settings over and over again from the .net Framework global configuration file:
(on x64 this would be C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config)
Final Step- Putting it all together, Reproducing the issue, & confirming resolution : Using notepad, a quick search of the suspect xml file (C:\Windows\Microsoft.NET\Framework64\v2.0.50727\CONFIG\machine.config) on my system revealed a <system.web> section. At this point, I suspected that this section contained an invalid section which may have been related to the problem. To verify this, and since I like to break things, I added an invalid configuration setting <myInvalidData/> to my global configuration file. Doing so, I successfully reproduced the issue on my system. I then contacted the customer and asked if they had by any chance added any settings under the <system.web> in the configuration file: c:\Windows\Microsoft.NET\Framework\v2.0.50727\CONFIG\machine.config.
The customer informed me that, per the request of their ASP.net developer, they had in fact added settings to that section of the file. By researching http://msdn.microsoft.com/en-us/library/system.web.aspx and the schema documentation at http://msdn.microsoft.com/en-us/library/dayb112d.aspx, we were able to determine that the settings that were present in this file should not have been present inside of <system.web> . The settings were moved to the proper location per the developer and the issue was resolved.
Here are the steps I used to reproduce the issue in case you are attempting to replicate this at home-
A. Using notepad, open the following configuration file on a non-production Windows Server 2008 SP1 system:
(please make a backup copy first in case you make a mistake)
OR (Open the version that matches the architecture of your platform )
B. Find the section <system.web> in this file (you can use the find function in notepad):
C. Add the following line directly after <system.web> as shown in the example below:
D. Save the file and then open eventvwr.msc and verify that the following error is displayed:
Hopefully this blog has demonstrated an example of how you can use the "create dump file" feature of Windows 2008, windbg, and other related tools in an attempt to gain more specific data when your error message is not revealing the source of the problem. Feel free to post any questions, comments, or concerns.
Hello NTDebuggers, from time to time we see the following problem. It’s another access violation, and the debug notes below are from a minidump.
Here is what we need to know…
· Generally speaking what happened to cause this AV?
· What method you would use to isolate root cause of the failure?
There are a lot of ways to do this. We look forward to hearing your approach.
We will post our methods and answer at the end of the week. If you need anything please let us know.
Microsoft (R) Windows Debugger Version 6.8.0001.0
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [D:\test123.dmp]
User Mini Dump File: Only registers, stack and portions of memory are available
0:000> k 123
0017f93c 75e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
0017f9d8 75e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
0017f9f4 75ef2084 kernel32!WaitForMultipleObjects+0x18
0017fa60 75ef22b1 kernel32!WerpReportFaultInternal+0x16c
0017fa74 75ebbe60 kernel32!WerpReportFault+0x70
0017fb00 7732d15a kernel32!UnhandledExceptionFilter+0x1c1
0017fb08 773000c4 ntdll!_RtlUserThreadStart+0x6f
0017fb1c 77361d05 ntdll!_EH4_CallFilterFunc+0x12
0017fb44 772eb6d1 ntdll!_except_handler4+0x8e
0017fb68 772eb6a3 ntdll!ExecuteHandler2+0x26
0017fc10 772cee57 ntdll!ExecuteHandler+0x24
0017fc10 10011127 ntdll!KiUserExceptionDispatcher+0xf
*** ERROR: Module load completed but symbols could not be loaded for crash3.exe
WARNING: Frame IP not in any known module. Following frames may be wrong.
0017ff40 0040104a 0x10011127
0017ffa0 75eb19f1 crash3+0x104a
0017ffac 7732d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
start end module name
00400000 0040d000 crash3 (no symbols)
6c250000 6c288000 odbcint (deferred)
6c290000 6c2f5000 odbc32 (deferred)
72a00000 72a86000 comctl32 (deferred)
74820000 749b4000 comctl32_74820000 (deferred)
75240000 75251000 samlib (deferred)
75260000 75281000 ntmarta (deferred)
754b0000 75510000 secur32 (deferred)
75510000 75570000 imm32 (deferred)
75700000 75790000 gdi32 (deferred)
757a0000 75870000 user32 (deferred)
758a0000 758a6000 nsi (deferred)
758b0000 759f4000 ole32 (deferred)
75a00000 75aaa000 msvcrt (deferred)
75ab0000 75ba0000 rpcrt4 (deferred)
75ba0000 75c1d000 usp10 (deferred)
75c20000 75c75000 shlwapi (deferred)
75d60000 75e27000 msctf (deferred)
75e30000 75f40000 kernel32 (pdb symbols)
76140000 76189000 Wldap32 (deferred)
76190000 7624f000 advapi32 (deferred)
76250000 76d1e000 shell32 (deferred)
76d20000 76d94000 comdlg32 (deferred)
76da0000 76dcd000 ws2_32 (deferred)
77280000 77287000 psapi (deferred)
77290000 77299000 lpk (deferred)
772b0000 77400000 ntdll (pdb symbols)
Good luck and happy debugging.
[Update: our answer. Posted 5/13/2008]
We enjoyed seeing different people’s approaches on this week’s puzzler. This was a simple module unload. We loaded a lib, did a GetProcAddress, freed the lib, and called the function. The dump was a mini dump created via .dump /m C:\dump file. There are various ways this type of scenario may arise. Obviously someone could unload a lib, but why? In most cases I’ve seen, it was due to a ref count problem in a com object. Poor accounting leading to one too many decrements, and the dll will get unloaded causing a simple crash footprint.
There are quite a few ways to track this down. First of all, if you had the debugger attached and got a full dump or /ma dump you would have seen the loaded module list. This would have been a dead giveaway and part of why we did the .dump /m. There are other options you can enable that make tracking of module loads easy under the debugger. I personally like “loader snaps” if I’m trying to track down module load shenanigans. To enable this, just go into the image section of the gflags tool and enable loader snaps for the exe in question. Now attach a debugger and watch the mode load and GetProcAddress details scroll by.
Yet another popular approach is to use process monitor. This tool is not only easy to set up, but it also gives you great logs with call stacks and other details such as registry accesses.
This puzzler provided the bare minumum data required. We did not give you much to go on because sometimes in real debugging scenarios you have to work with a lack of data. I really liked how many people questioned the source of the dump file. It really shows how familiar you all are with the various dump types.
Hello NTDebuggers, in the spirit of Click and Clack (The Tappet brothers), a favorite troubleshooting show of mine, we thought it would be fun to offer up some Debug puzzlers for our readers.
That said, this week’s Debug Puzzler is in regard to Dr. Watson. I’m sure most of you have seen Dr. Watson errors. This typically means your application has crashed due to an unhandled exception. Sometimes however the process just seems to disappear. The Just-in-Time (JIT) debugging options configured via the AEDebug key does not catch the crash… Does anyone know why this may happen?
We will post reader’s comments as they respond during the week, and next Monday will post our answer and recognize some of the best answers we received from our readers.
Good luck and happy debugging!
- Jeff Dailey
[Update: our answer, posted 4/11/2008]
Hello NTDebuggers. Let me start of by saying that we were very impressed by our reader’s answers. Our two favorite answers were submitted by Skywing and molotov.
When a thread starts, the ntdll Run Time Library (RTL) for the process inserts an exception hander before it calls the BaseThreadInit code to hand control over to the executable or DLL running in the process (notepad in the example below). If anything goes wrong with the chain of exception handlers, the process can’t make it back to the RTL exception handler and the process will simply terminate. See http://www.microsoft.com/msj/0197/Exception/Exception.aspx for details.
000ef7ac 75fbf837 ntdll!KiFastSystemCallRet
000ef7b0 75fbf86a USER32!NtUserGetMessage+0xc
000ef7cc 00b21418 USER32!GetMessageW+0x33
000ef80c 00b2195d notepad!WinMain+0xec
000ef89c 76e24911 notepad!_initterm_e+0x1a1
000ef8a8 7704e4b6 kernel32!BaseThreadInitThunk+0xe
000ef8e8 7704e489 ntdll!__RtlUserThreadStart+0x23 << Exception Handler is inserted here.
000ef900 00000000 ntdll!_RtlUserThreadStart+0x1b
Secondly, the process that crashes is actually responsible for starting the debugger via the RTL exception handler. The debugger is registered under the AeDebug registry key. Even if you are able to unwind to the RTL exception handler you may still run into trouble. If the computer is low on system resources such as desktop heap, you may not be able to create a new process and thus will not be able to launch the debugger. As SkyWing stated, it’s a relatively heavyweight operation. Applications may also call TerminateProcess from within their own code based on an error condition. If we have a customer that sees this symptom on a regular basis we typically recommend having them attach a debugger to monitor the process. Simply run via ADPLUS -crash -p (PROCESSID).
Good work folks! We’ll have another puzzler ready next Monday.
Good Luck and happy debugging!
Greetings NTDEBUGGING community, Bob here again, and today I would like to let everyone know about a new feature implemented in the STORPORT.SYS binaries released in the following KB articles:
· KB 979764 (Windows Server 2008)
· KB 978000 (Windows Server 2008 R2)
Beginning with these new versions, it is now possible to measure timing statistics for requests made to a system’s disk unit. These measurements are taken at the lowest possible level of OS interaction with the storage adapter hardware, making it much easier to diagnose storage performance issues. The measurements are taken from the port driver, STORPORT.SYS, via ETW (Event Tracing for Windows). Windows drivers are implemented in a layered architecture, so in our case the STORPORT driver interfaces directly with the adapter or miniport driver. The STORPORT driver is responsible for managing requests and queues, and providing other upper level functions. Each miniport driver is specific to a particular storage adapter card, and knows how to deliver requests to the disk unit over the transport medium, like fibre channel for instance.
Configuring and Starting STORPORT ETW Logging
To measure storage performance, the user enters a threshold value via the GUI interface. (This can be scripted as well; see KB 978000 for details.) When an I/O time is equal to or greater than the entered threshold, an event 13 is written to the ETW log.
Here is a walkthrough to enable this feature via the GUI:
Click Start, type “perfmon”, and press ENTER to start Performance Monitor.
Expand “Data Collector Sets” and select “Event Trace Sessions”.
Right-Click on “Event Trace Sessions” and select “New -> Data Collector Set”.
The following dialog will appear:
Give the new data collector set a name in the dialog box. In this example I called it “Storport Performance Logging”. Choose the “Create manually (Advanced)” option then click Next to see the following dialog.
After clicking “Add...” on the dialog above the following list of providers will appear. Select “Microsoft-Windows-StorPort” and click OK.
When the dialog below opens, select “Keywords (Any)” then click Edit.
Check the box for IOPerfNotification, and then click OK. Note: For Windows Server 2008 SP1 and SP2, some properties may not be present (e. g. Reset, PortNotification and PauseResume).
Next, select “Filter” and click Edit…
This is where we enter our time threshold. Remember that timing values greater than or equal to this filter value will be logged to the trace file.
Select “Binary” and check the “Filter enabled” box
In the “Filter data” field, enter the threshold value in the number of milliseconds. This must be entered in little endian format. Please refer to this table for the most common entries.
01 00 00 00 00 00 00 00
05 00 00 00 00 00 00 00
0A 00 00 00 00 00 00 00
0F 00 00 00 00 00 00 00
Note: The “Filter type” value will always remain 0 as in the example above showing an entry for 1 ms.
WARNING - The whole data line needs to be filled in when entering the threshold. For demonstration purposes here is the WRONG WAY to enter 256 ms. This will not work because of the missing 00 entries on the line.
Below is the correct way to enter the value (256ms) Note: The Filter value may have to be re-entered for next run because it could get reset.
After clicking OK, you will see a new Event Trace session (screenshot below). Right-Click the new Event Trace session and select “Properties”
Next we need to set a location for the trace file. Ensure you have sufficient disk space in the target folder because logs can be many Megabytes in size depending on the number of hits.
The target folder may be changed in the “Directory” tab
I recommend choosing a directory that is more easily located, like C:\PerfLogs, in which case you can click “Browse”
Note: It is not recommended to store the trace file on the same storage you are troubleshooting.
The name for the file is selected on the “File” tab.
When you are done setting the file and directory information click OK.
Now right-click the log and select “Start” to begin tracing.
The “Status” should indicate that it is running.
How long to run the log depends on you issue. You should let the log run until the issue or slow down occur then check the log for delays.
Viewing the Log File
First right-click and select “Stop” to flush any cached data and stop the logging.
Then use Event Viewer to examine the log.
Next choose "Open the saved log…"
After clicking OK a popup dialog will be displayed to ask if you want to convert the log. Select “No”. Here is some sample event data that you will likely see in the log.
Example: Event ID 13
Let's examine the “Details” tab which shows information about the I/O that exceeded or equaled the threshold we had set. If you do not set a threshold, then all I/O events are logged, but for performance reasons we recommend setting a "reasonable" threshold like 15ms.
Breaking down the Event Data we have …
Miniport Extension: (Kernel address of the miniport extension, not of use to performance logging, only debugging)
PortNumber: This is the adapter port number (i.e. RaidPort1, etc.)
PathID: Path number if multi-pathing is in place
TargetID: Target LUN exposed to the Operating System
LUN: The Logical Unit Number of the physical storage where the particular I/O is directed
*RequestDuration: How long the request took in milliseconds (the number of note here for storage latency diagnostics)
Command: the SCSI command sent to the unit. (Usually read or write) This is listed in decimal. Convert this value to hexadecimal with Calc.exe for referencing SCSI commands http://en.wikipedia.org/wiki/SCSI_command
ScsiStatus: the status that came back from the drive
SRBStatus: the status as returned from adapter (see srb.h and scsi.h in the Microsoft WDK or http://en.wikipedia.org/wiki/SCSI_Status_Code)
Example: Command 42 equals 2A which is a Write of 10 bytes.
Here’s the respective SCSI command:
When viewing the log, the commands will be mostly reads and writes. If I/O is determined to be slow at this level, then the issue is most likely disk or SAN related.
Hope you enjoyed the walkthrough!
My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group. You may be wondering why Microsoft support would need to know ARM assembly. Doesn’t Windows only run on x86 and x64 machines? No. Windows has ran on a variety of processors in the past. Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM. Most of these processors are antiquated now. The common ones now are IA-32, x86, x64. However Windows has started supporting ARM processors in order to jump into the portable devices arena. You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure. So you may be saying that these devices are locked, and cannot be debugged. That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.
There are limitations on ARM processors that Windows supports. There are 3 System on Chip (SOC) vendors that are supported. nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode. This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU. It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes. So it will look strange in the assembly. Luckily the debuggers know this and handle it for you. This also helps to shrink the size of the assembly code in memory. The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).
In order to understand the assembly that you will see you need to understand the processor internals.
ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on. It is a 32 bit load/store style processor. It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering. In ARM devices these as ISB, DSB, and DMB instructions.
The processor has 16 available registers r0 – r15.
0: kd> r
r0=00000001 r1=00000000 r2=00000000 r3=00000000 r4=e1820044 r5=e17d0580
r6=00000001 r7=e17f89b9 r8=00000002 r9=00000000 r10=1afc38ec r11=e1263b78
r12=e127813c sp=e1263b20 lr=e16c12c3 pc=e178b6d0 psr=00000173 ----- Thumb
r0, r1, r2, r3, and r12 are volatile registers. Volatile registers are scratch registers presumed by the caller to be destroyed across a call. Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used.
On Windows four of these registers have a designated purpose. Those are:
In Windbg all but r11 will be labeled appropriately for you. So you may be asking why r11 is not labeled “fp” in the debugger. That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine. The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions. Will discuss leaf functions later.
CPSR (Current Program Status Register)
Now we need to understand some about the CPSR register. Here is the bit breakdown:
So why do I need to know about the CPSR (Current Program Status Register)? You will need to know where some of these bits are due to how some of the assembly instruction affect these flags. Example of this is:
ADD will add two registers together, or add an immediate value to a register. However it will not affect the flags.
ADDS will do the same as ADD, but it does affect the flags.
MOV will allow you to move a value into a register, and a value between registers. This is not like the x86/x64. MOV will not let you read or write to memory. This does not affect the flags.
MOVS does the same thing as MOV, but it does affect the flags.
I hope you are seeing a trend here. There are instructions that will look the same. However if they end in “S” then you need to know that this will affect the flags. I am not going to list all of those assembly instructions here. Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.
So now we have an idea of what can set the flags. Now we need to understand what the flags are used for. They are mainly used for branching instructions. Here is an example:
003a11d2 429a cmp r2,r3
003a11d4 d104 bne |MyApp!FirstFunc+0x28 (003a11e0)|
The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.
There are a few compare instructions. “cmp” subtracts two register values, sets the flags, and discards the result. “cmn” adds two register values, sets the flags, and discards the results. “tst” does a bit wise AND of two register values, sets the flags, and discards the results. There is even an If Then (it) instruction. I am not going to discuss that one here as I have never seen it in any of the Windows code.
So is “bne” the only branch instruction? No. There is a lot of them. Here is a table of things that can be seen beside “b”, and what they check the CPSR register:
Condition Flags (in CPSR)
Positive or Zero (Plus)
C==1 and Z==0
Unsigned lower or same
C==0 or Z==1
Signed greater than or equal
Signed less than
Signed greater than
Z==0 and N==V
Signed less than or equal
Z==1 or N!=V
Floating Point Registers
As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture). Here is what they are.
As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31). There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers. Windows 8 will support both 16 and 32 register variants. You have to be careful when using these, because if you access unaligned floats you may cause an exception.
As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers. So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.
In part 2 we will discuss how Windows utilizes this processor.
Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk. This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.
What does this error mean?
The PNP manager does what is called enumerations. An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices. The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.
These enumerations can happen for a number of reasons. For example, hardware can request an enumeration when it detects a change in configuration. Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.
When an enumeration request is received, the bus driver will rescan the bus for all devices. It will issue commands to the existing devices as though it was looking for new ones. If these commands fail on an existing unit, the driver will mark the device as “missing”. When the device is marked “missing”, it will not be reported back to PNP in the inventory. When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.
Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable. An example of a non-removable disk is a hard drive on a SCSI or IDE bus. An example of a removable disk is a USB thumb drive.
Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication. The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.
Why does this error happen?
These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem. The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running. An administrator that sees these errors needs to verify the heath of the disk subsystem.
Event ID 157 Example:
We’ve been thinking about hosting a live group chat session to talk with the debugging community. If we had such a chat, the discussion would focus on debugging techniques and any questions you may have about anything we’ve previously blogged about. If you’d be interested in participating in a chat session, please answer the survey question below. Also, feel free to leave a comment on this page if there’s a particular debugging topic you would like for us to cover in the chat.
Thanks everyone for your feedback! This survey is now closed.
Hello, my name is Louis Shanks and my story starts off with a tricky little deadlock issue that I worked on recently. Let me set the stage by explaining the surface level symptom, and then we will get to the good stuff. Randomly when users running Vista would attempt to launch Internet Explorer nothing would happen. So they would again double-click the icon, and still nothing happened. So the users would reboot and go back to their web activities.
After some basic trouble shooting it was discovered that Iexplore.exe was running multiple times and every attempt to open the browser would result in another Iexplore.exe instance. Since no UI was ever presented I went ahead and collected a few Iexplore.exe process dumps along with a complete system dump.
The process dumps didn’t prove to be terribly useful so I decided to move over to the complete memory dump.
Since I knew we had multiple IExplore.exe processes running I wanted to investigate them first.
0: kd>!process 0 0
PROCESS 87289d90 SessionId: 5 Cid: 0fdc Peb: 7ffdc000 ParentCid: 0968
DirBase: 7d66c520 ObjectTable: a203a178 HandleCount: 205.
PROCESS 861a9d90 SessionId: 5 Cid: 0d34 Peb: 7ffd9000 ParentCid: 0fdc
DirBase: 7d66c7a0 ObjectTable: a5651a98 HandleCount: 170.
PROCESS 86c175b0 SessionId: 5 Cid: 1250 Peb: 7ffd9000 ParentCid: 0968
DirBase: 7d66c940 ObjectTable: a0853f10 HandleCount: 91.
PROCESS 861ac7a8 SessionId: 5 Cid: 1024 Peb: 7ffde000 ParentCid: 0968
DirBase: 7d66c620 ObjectTable: a83f7898 HandleCount: 91.
I decided to just start with the first one in the list (87289d90)
0: kd> !process 87289d90 Image: iexplore.exe
It appears this process has been around a while, so I wanted to look at the first thread in this process as it is responsible for the UI in Internet Explorer. I am looking at this thread because the surface level problem illustrates that when we launch Iexplore.exe the UI is never displayed. Here I used the !thread command to display the specific thread.
THREAD 870b6398 Cid 0fdc.16a4 Teb: 7ffdf000 Win32Thread: fd850b78 WAIT: (UserRequest) UserMode Non-Alertable
Attached Process 87289d90 Image: iexplore.exe
Wait Start TickCount 10817197 Ticks: 69571 (0:00:18:05.314)
Based on the tick count this thread has been waiting for 18 minutes.
0: kd> kvn
# ChildEBP RetAddr Args to Child
00 931208d8 81a940a1 870b6398 803d2120 870b6420 nt!KiSwapContext+0x26
01 9312091c 81a8dacc 870b6398 00000000 00000002 nt!KiSwapThread+0x44f
02 93120970 81bedb4f 00000002 93120aa8 00000001 nt!KeWaitForMultipleObjects+0x53d
03 93120bfc 81bee2e3 00000002 00000001 00000000 nt!ObpWaitForMultipleObjects+0x256
04 93120d48 81a30b3a 00000002 0027e9c4 00000001 nt!NtWaitForMultipleObjects+0xcc
05 93120d48 77568254 00000002 0027e9c4 00000001 nt!KiFastCallEntry+0x12a
06 0027e970 775679fc 7708c973 00000002 0027e9c4 ntdll!KiFastSystemCallRet
07 0027e974 7708c973 00000002 0027e9c4 00000001 ntdll!NtWaitForMultipleObjects+0xc
08 0027ea10 7608ff95 0027e9c4 0027ea38 00000000 kernel32!WaitForMultipleObjectsEx+0x11d
09 0027ea64 773d3a46 00000028 0027eaac ffffffff USER32!RealMsgWaitForMultipleObjectsEx+0x13c
0a 0027ea8c 773d3b7f 0027eaac ffffffff 0027eabc ole32!CCliModalLoop::BlockFn+0x97
0b 0027eab4 774e75be ffffffff 003ddf90 0027ebc0 ole32!ModalLoop+0x5b
0c 0027ead0 774e9294 00000000 0027ebd4 00000000 ole32!ThreadSendReceive+0x12c
0d 0027eaf8 774e9100 0027ebc0 003d3538 0027ec1c ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x194
0e 0027ebd8 773d3c25 003d3538 0027ecfc 0027ece4 ole32!CRpcChannelBuffer::SendReceive2+0xef
0f 0027ebf4 773d3bd7 0027ecfc 0027ece4 003d3538 ole32!CCliModalLoop::SendReceive+0x1e
10 0027ec6c 77403efb 003d3538 0027ecfc 0027ece4 ole32!CAptRpcChnl::SendReceive+0x73
11 0027ecc0 76052ff4 003d3538 0027ecfc 0027ece4 ole32!CCtxComChnl::SendReceive+0x1c5
12 0027ecd8 76053055 003e50a4 0027ede4 76052e0d RPCRT4!NdrProxySendReceive+0x43
13 0027ece4 76052e0d 09ea0d59 0027f1fc 070001f3 RPCRT4!NdrpProxySendReceive+0xc
Based on the call stack it appears this thread made an OLE call and is now waiting for some objects to signal.
What is IExplore doing in this thread? It is waiting on another process to service this OLE request. We can find the PID of the other process by digging into the ole32 SendReceive call.
I looked at the first argument:
0: kd> dd 003d3538
003d3538 77403a50 773f57b4 00000003 00000002
003d3548 00000000 00000000 003d2ef0 003d8840
003d3558 0038b1c0 003e05f8 77403980 00070005
This gives us a pointer to another data structure that contains the information we are attempting to locate.
0: kd> dd 003d2ef0
003d2ef0 774ee600 003d2e70 00000d34 00000000
003d2f00 862941fa c94cde09 55a88424 801a2601
003d2f10 55a88424 801a2601 0000c400 00000d34
Ah there it is! 0d34 is the process id servicing our OLE request.
I went back and reviewed the !process 0 0 output to locate the process containing PID 0D34. I’m dumping it here.
PROCESS 861a9d90 SessionId: 5 Cid: 0d34 Peb: 7ffd9000 ParentCid: 0fdc
Ok, so the process handing this request is IEUSER.exe, which is the broker process used by Internet Explorer in Vista when running in protect mode. This is new in Vista.
More on IEUser and Protected mode can be seen at the following location:http://msdn.microsoft.com/en-us/library/bb250462.aspx
Now I needed to dig into the IEUser.exe process to find out what it was working on causing Iexplore to wait.
0: kd> !process 861a9d90
I discovered the thread in the ieuser.exe process working on our call from Iexplore.exe. However this thread is also waiting on access to something before it can complete.
0: kd> kvn+200
# ChildEBP RetAddr Args to Child
00 a41b0c50 81a940a1 85ecbac0 85ecbb48 81af9920 nt!KiSwapContext+0x26
01 a41b0c94 81a2ddc8 85ecbac0 00000000 85e18510 nt!KiSwapThread+0x44f
02 a41b0ce8 81bf0d9f 85e18510 00000006 00000001 nt!KeWaitForSingleObject+0x492
03 a41b0d50 81a30b3a 00000224 00000000 00000000 nt!NtWaitForSingleObject+0xbe
04 a41b0d50 77568254 00000224 00000000 00000000 nt!KiFastCallEntry+0x12a
05 0163e8a4 77567a0c 7708c7c4 00000224 00000000 ntdll!KiFastSystemCallRet
06 0163e8a8 7708c7c4 00000224 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
07 0163e918 7708c732 00000224 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xbe
08 0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
09 0163e948 747e5fe0 0163ec48 00497b38 00000000 RASAPI32!ReadPhonebookFileEx+0x1a4
0a 0163e968 747e4e9d 0163ec48 00000000 00000000 RASAPI32!ReadPhonebookFile+0x1b
0b 0163e9a8 747e52d3 0163ec48 0048f660 0163e9e0 RASAPI32!DwEnumEntriesFromPhonebook+0x35
0c 0163ee58 747e4fd4 0163ee9c 00000001 0048f660 RASAPI32!DwEnumEntriesInDir+0x1cc
0d 0163f0ac 747e677e 00000001 0048f660 0163f0f4 RASAPI32!DwEnumEntriesForPbkMode+0xab
0e 0163f0e0 76e534b9 00000000 00004180 0048f660 RASAPI32!RasEnumEntriesW+0xb0
0f 0163f10c 76e5342a 76ecc2dc 00000000 00000000 WININET!RasEnumHelp::RasEnumHelp+0x55
10 0163f120 76e533a3 00000001 00000000 00000001 WININET!DoConnectoidsExist+0x2b
11 0163f14c 76e534ff 76ecf02c 76ecc2dc 00000000 WININET!GetRasConnections+0x34
12 0163f168 76e4196e 00000000 0163f180 6ee3228c WININET!IsDialUpConnection+0xa9
13 0163f184 76e70564 00000001 00000000 00000000 WININET!FixProxySettingsForCurrentConnection+0x31
14 0163f1a8 6ec2db3b 00000000 0163f1bc 0163f3d4 WININET!InternetInitializeAutoProxyDll+0x6c
15 0163f1b8 75fe9142 004884a0 00000202 0047e900 ieframe!CShdocvwBroker::StartAutoProxyDetection+0x4d
16 0163f1d0 76054294 6ec2daf8 0163f3d8 00000001 RPCRT4!Invoke+0x2a
17 0163f5fc 76054a49 004858d0 0046e4e0 0047e900 RPCRT4!NdrStubCall2+0x27b
18 0163f64c 774e86d7 004858d0 0047e900 0046e4e0 RPCRT4!CStdStubBuffer_Invoke+0xa0
19 0163f694 774e877d 0047e900 0047f968 00480b38 ole32!SyncStubInvoke+0x3c
1a 0163f6e0 7740249e 0047e900 0047f860 004858d0 ole32!StubInvoke+0xb9
1b 0163f7bc 774023af 0046e4e0 00000000 004858d0 ole32!CCtxComChnl::ContextInvoke+0xfa
1c 0163f7d8 774e85df 0047e900 00000001 004858d0 ole32!MTAInvoke+0x1a
1d 0163f808 774e8a18 d0908070 0046e4e0 004858d0 ole32!AppInvoke+0xaa
1e 0163f8e4 774e85ef 0047e8a8 00474118 0047ebc0 ole32!ComInvokeWithLockAndIPID+0x32c
1f 0163f930 75fe94b0 0047ebc0 08ac70b7 0047ebc0 ole32!ThreadInvoke+0x2fd
I’m dumping the first parameter passed to KeWaitForSingleObject to determine the object type.
0: kd> !object 85e18510
Object: 85e18510 Type: (85484d40) Mutant
ObjectHeader: 85e184f8 (old version)
HandleCount: 3 PointerCount: 5
Directory Object: 881f3030 Name: RasPbFile
Ok, we are waiting on a Mutex named RasPbFile. The big question in your mind now is, “who owns this lock”, right?
To determine this I’ll dump the structure for the mutex object:
0: kd> dt _KMUTANT 85e18510
+0x000 Header : _DISPATCHER_HEADER
+0x010 MutantListEntry : _LIST_ENTRY [ 0x870b6564 - 0x870b6564 ]
+0x018 OwnerThread : 0x870b6398 _KTHREAD
+0x01c Abandoned : 0 ''
+0x01d ApcDisable : 0 ''
The owner of the mutex is 870b6398. The bad news is this is the same thread I started with in the Iexplore.exe process so we’re looking at a classic deadlock scenario. Thread 870b6398 running under Iexplore.exe has made an OLE call over to the IEuser.exe broker process. The IEuser process needs the mutex owned by the thread that called it in the first place.This turned out to be a bug in the RasApi code, not the browser or IEuser process. The RasAPI code was written in such a way that it would wait for an infinite period of time to gain the Mutex.
0163e92c 747e60aa 00000224 ffffffff 00000000 kernel32!WaitForSingleObject+0x12
The good news is that we have created hotfix 951738 to address this issue. The official KB article is still pending.
Hello all; my name is Scott Olson and I work as an Escalation Engineer for Microsoft Global Escalation Services team in Platforms support, and I wanted to share an interesting problem that came up recently. A co-worker was running Windows Vista Ultimate x64 on their home machine and ran into a problem where the system would get random bugchecks after upgrading the RAM from 2GB to 4GB. Any combination of the RAM with 2GB was fine; however with 4GB of RAM installed the system would bugcheck within 10 minutes of booting. Once I heard about this I wanted to look at the memory dump in kernel debugger.
Here's is what I found:
The system got the following bugcheck:
0: kd> .bugcheckBugcheck code 000000D1Arguments fffff800`03a192d0 00000000`00000002 00000000`00000000 fffff980`064aa8b6
Tip: The help file included with the Debugging Tools For Windows contains a Bug Check Code Reference that includes details on how to parse the Bug Check code and its arguments. See: Help > Debugging Techniques > Bug Checks (Blue Screens) > Bug Check Code Reference
!analyze -v provides the following information for this bugcheck:
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses. If kernel debugger is available get stack backtrace.Arguments:Arg1: fffff80003a192d0, memory referencedArg2: 0000000000000002, IRQLArg3: 0000000000000000, value 0 = read operation, 1 = write operationArg4: fffff980064aa8b6, address which referenced memoryDebugging Details:------------------READ_ADDRESS: fffff80003a192d0CURRENT_IRQL: 2
So with this data I can say that the system took a page fault on a read operation trying to reference the memory at fffff80003a192d0 at DISPATCH_LEVEL. OK, so let's get the trap frame so we can get into context of the system when the crashed happened:
0: kd> kv 3Child-SP RetAddr : Args to Child : Call Sitefffff800`03218f28 fffff800`0204da33 : 00000000`0000000a fffff800`03a192d0 00000000`00000002 00000000`00000000 : nt!KeBugCheckExfffff800`03218f30 fffff800`0204c90b : 00000000`00000000 fffffa80`0a3c6cf0 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x73fffff800`03219070 fffff980`064aa8b6 : 00000000`00000002 00000000`00000000 00000000`000005e0 fffff800`03219220 : nt!KiPageFault+0x20b (TrapFrame @ fffff800`03219070)
Here is the trap frame and it looks like system crashed while trying to reference memory at an offset from the stack pointer, rsp+0xD0 (highlighted above)
0: kd> .trap fffff800`03219070NOTE: The trap frame does not contain all registers.Some register values may be zeroed or incorrect.rax=0000000000000000 rbx=0000000000000010 rcx=0000000000000011rdx=0000000000000002 rsi=0000000000000000 rdi=0000000000000001rip=fffff980064aa8b6 rsp=fffff80003219200 rbp=00000000000071d6r8=fffff80003219280 r9=00000000000071d6 r10=0000000000000000r11=0000000000000000 r12=0000000000000000 r13=0000000000000000r14=0000000000000000 r15=0000000000000000iopl=0 nv up ei pl zr na po nctcpip!InetInspectReceiveDatagram+0xf6:fffff980`064aa8b6 440fb78c24d0000000 movzx r9d,word ptr [rsp+0D0h] ss:0018:fffff800`032192d0=8c13
As you can see above fffff800`032192d0 looks like valid memory and shouldn't normally cause a page fault on a read operation. At this point, I want to make sure the system did what it was told. I want to know what happened when the system trapped. To verify the faulting address I dumped the CR2 register to see what address was referenced when the page fault happened; this is also the first parameter in the bugcheck code for a stop 0xd1.
0: kd> r cr2cr2=fffff80003a192d0
Looking at this address it is clear that the trap frame does not exactly match, so let's look at how these addresses are different. Here is the stack pointer from the trap frame and the page fault converted into varying formats (focusing on the binary)
0: kd> .formats fffff800`032192d0Evaluate expression:Hex: fffff800`032192d0Decimal: -8796040490288Octal: 1777777600000310311320Binary: 11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000Chars: .....!..Time: ***** Invalid FILETIMEFloat: low 4.74822e-037 high -1.#QNANDouble: -1.#QNAN0: kd> .formats fffff800`03a192d0Evaluate expression:Hex: fffff800`03a192d0Decimal: -8796032101680Octal: 1777777600000350311320Binary: 11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000Chars: ........Time: ***** Invalid FILETIMEFloat: low 9.49644e-037 high -1.#QNANDouble: -1.#QNAN
Notice that there is a one bit difference between these 2 addresses
11111111 11111111 11111000 00000000 00000011 00100001 10010010 11010000
11111111 11111111 11111000 00000000 00000011 10100001 10010010 11010000
Since the software asked the system to do one thing and it did something different this is clearly some type of hardware problem (most likely with the processor). I reported this back to the co-worker and they contacted their hardware vendor. This must have been a common problem with this vendor because I found out later that they replied back within 10 minutes of contacting them with a recommendation to change the memory voltage in the BIOS. The memory voltage was set to Auto, which is a default. They recommended it be changed from 1.85 volts to 2.1 volts. After making the change the system was stable with 4GB of RAM.
Hello - Matthew here again. Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago. From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs. The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind. When this happens, how can you determine what is going wrong?
In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure. If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event. Or if you suspect that the problem is due to something failing over the network, a network trace could help.
Let's assume that the techniques described above do not move us any closer to understanding the failure. What next? If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed? Let's answer those questions by debugging a sample application...
Download the sample application here.
Here's what we know:
1. Clicking the “Button 1” button is supposed to display a useful dialog box
2. For certain users, clicking “Button 1” results in no observable response
3. You don't have access to the source code or debug symbols for the app
4. No one seems to know what the “Button 1” button does behind the scenes
5. The developer of the application hasn't been heard from in 3 years
We need to understand what exactly happens when the user clicks the “Button 1” button. Every window has a WindowProc function that receives messages upon user input. Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked. To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.
So we need to:
a) Find the control identifier for “Button 1”
b) Find the WindowProc for the main application window
c) Determine where WM_COMMAND messages for the “Button 1” button are handled
d) Figure out what is failing in that code
Let’s get started...
Find the control identifier for “Button 1”
Spy++ (available as part of Visual Studio) is the tool for this job. Click Spy -> Log Messages. Use the Finder Tool to select the main window of ntdbghang1.exe. On the messages tab, click “Clear All” then select “WM_COMMAND”. Click “OK” and SPY++ will start logging. Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window. For good measure, click “Button 2” also, and you should see a second line of text. At this point, your SPY++ window should look something like this....
We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102). We’ll need this information later.
Find the WindowProc for the main application window
We can find the WindowProc address by using SPY++ again. Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe. Make sure “Show properties” is selected, and then click “OK”. You will get a dialog that looks like this....
Note the value of Window Proc. This is the address of the window procedure for the main window (the exact value may differ on your system). We know that this function is a WindowProc, which means it is of the form...
LRESULT CALLBACK WindowProc(
The meaning of wParam and lParam differ depending on what uMsg is being processed. When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101. The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).
So, we specifically are interested in the case where:
uMsg = WM_COMMAND (literally 0x111)
wParam = 0x101
While looking at assembly language for the WindowProc, the stack frame will look like this...
ebp = “old ebp”
ebp+4 = “return address”
ebp+8 = hwnd
ebp+c = uMsg
ebp+10 = wParam
ebp+14 = lParam
Determine where WM_COMMAND messages for the “Button 1” button are handled
Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg. Launch windbg.exe and select “File -> Attach to a Process”. Select “ntdbghang1.exe” from the list and click “OK”. When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”. Based on the address on my system, that command will be “u 01002830”. To unassemble more just type “u” again. I’m now going to unassemble the relevant code and explain what it means along the way...
0:001> u 01002830
The first 3 instructions are a function prologue, just getting everything set up.
01002830 8bff mov edi,edi
01002832 55 push ebp
01002833 8bec mov ebp,esp
Here is the code that moves the value of uMsg into ecx
01002835 8b4d0c mov ecx,dword ptr [ebp+0Ch]
Now the app enters the assembly equivalent of a switch statement.
We are interested in the check for uMsg = WM_COMMAND
if uMsg = WM_CREATE (1) goto 01002893
01002838 49 dec ecx
01002839 7458 je ntdbghang1+0x2893 (01002893)
if uMsg = WM_ DESTROY (2) goto 01002889
0100283b 49 dec ecx
0100283c 744b je ntdbghang1+0x2889 (01002889)
if uMsg = WM_CLOSE (0x10) goto 01002889
0100283e 83e90e sub ecx,0Eh
01002841 743b je ntdbghang1+0x287e (0100287e)
if uMsg = WM_COMMAND (0x111) goto 01002853
01002843 b801010000 mov eax,101h
01002848 2bc8 sub ecx,eax
0100284a 7407 je ntdbghang1+0x2853 (01002853)
From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853. So let’s go there...
0:001> u 01002853
Move the value of wParam into edx
01002853 8b5510 mov edx,dword ptr [ebp+10h]
If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f
01002856 0fb7ca movzx ecx,dx
01002859 2bc8 sub ecx,eax
0100285b 7412 je ntdbghang1+0x286f (0100286f)
From the above assembly we can see that right away we are checking for the control ID for button 1. This is the code path we want to follow....
0:001> u 0100286f
If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b
Otherwise, call the function at 010027f6
0100286f c1ea10 shr edx,10h
01002872 6685d2 test dx,dx
01002875 7524 jne ntdbghang1+0x289b (0100289b)
01002877 e87affffff call ntdbghang1+0x27f6 (010027f6)
Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6.
Figure out what is failing
So now we know what code is going to run every time the button is clicked (the function at 010027f6). Let’s examine that code and figure out what may be failing. We’ll use “uf” to unassemble the entire function, as it is rather short. During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.
0:001> uf 010027f6
010027f6 8bff mov edi,edi
010027f8 55 push ebp
010027f9 8bec mov ebp,esp
010027fb 51 push ecx
010027fc 56 push esi
call func1(0x20) [func1 address at 01002dca]
010027fd 6a20 push 20h
010027ff c745fc10000000 mov dword ptr [ebp-4],10h
01002806 e8bf050000 call ntdbghang1+0x2dca (01002dca)
Store the result of func1 in esi (call it localvar2)
0100280b 8bf0 mov esi,eax
Call func2(localvar2, &localvar1) [func2 address at 0100b484]
0100280d 59 pop ecx
0100280e 8d45fc lea eax,[ebp-4]
01002811 50 push eax
01002812 56 push esi
01002813 e86c8c0000 call ntdbghang1+0xb484 (0100b484)
If func2 returns 0, then goto 01002821
01002818 85c0 test eax,eax
0100281a 7405 je ntdbghang1+0x2821 (01002821)
Call func3() [func3 address at 0100278d]
0100281c e86cffffff call ntdbghang1+0x278d (0100278d)
Call func4(localvar2) [func4 address at 01002ce3]
01002821 56 push esi
01002822 e8bc040000 call ntdbghang1+0x2ce3 (01002ce3)
Clean up and exit the function
01002827 59 pop ecx
01002828 5e pop esi
01002829 c9 leave
0100282a c3 ret
So let’s translate this to some pseudo-code...
localvar1 = 0x10;
localvar2 = func1(0x20);
So the behavior of the function will differ, depending on the results of func2. Let’s see if we can figure out what func2 does...
0:001> u 0100b484
0100b484 ff2568110001 jmp dword ptr [ntdbghang1+0x1168 (01001168)]
0100b48a cc int 3
0100b48b cc int 3
0100b48c cc int 3
0:001> dps 01001168 L1
01001168 70b88cb1 WINSPOOL!GetDefaultPrinterW
GetDefaultPrinterW is a public API function. Here’s the function prototype from MSDN...
LPTSTR pszBuffer, // printer name buffer
LPDWORD pcchBuffer // size of name buffer
This fits our pseudo-code, since the function takes two parameters and returns a BOOL. Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...
DWORD cchBuffer = 0x10;
LPWSTR pszBuffer = func1(0x20);
Based on the above, we can make a couple of assumptions. Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10. So that accounts for all of the functions, except for func3. We could unassemble func3 to see what it does, but that may not be needed. From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory. So GetDefaultPrinterW is a likely potential point of failure.
Let’s test this theory out. We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.
0:001> bp WINSPOOL!GetDefaultPrinterW
<Now click Button 1>
Breakpoint 0 hit
eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c
eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202
70b88cb1 8bff mov edi,edi
// Dump out the call stack...
ChildEBP RetAddr Args to Child
0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW
WARNING: Stack unwind information not available. Following frames may be wrong.
0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818
0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c
0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23
0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b
0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7
// Go to the return address. Note that eax=0, meaning GetDefaultPrinterW returned FALSE.
eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c
eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
01002818 85c0 test eax,eax
// Check the last error...
LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.
// Resume execution
The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER. It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user. This explains why the app works for some users and fails for others. A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.
To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...
DWORD cch = 16;
pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));
Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger. I hope you found this useful and can apply to real-world debugging scenarios. Please post any questions / comments!