Hello NTDebuggers, in the spirit of Click and Clack (The Tappet brothers), a favorite troubleshooting show of mine, we thought it would be fun to offer up some Debug puzzlers for our readers.
That said, this week’s Debug Puzzler is in regard to Dr. Watson. I’m sure most of you have seen Dr. Watson errors. This typically means your application has crashed due to an unhandled exception. Sometimes however the process just seems to disappear. The Just-in-Time (JIT) debugging options configured via the AEDebug key does not catch the crash… Does anyone know why this may happen?
We will post reader’s comments as they respond during the week, and next Monday will post our answer and recognize some of the best answers we received from our readers.
Good luck and happy debugging!
- Jeff Dailey
[Update: our answer, posted 4/11/2008]
Hello NTDebuggers. Let me start of by saying that we were very impressed by our reader’s answers. Our two favorite answers were submitted by Skywing and molotov.
When a thread starts, the ntdll Run Time Library (RTL) for the process inserts an exception hander before it calls the BaseThreadInit code to hand control over to the executable or DLL running in the process (notepad in the example below). If anything goes wrong with the chain of exception handlers, the process can’t make it back to the RTL exception handler and the process will simply terminate. See http://www.microsoft.com/msj/0197/Exception/Exception.aspx for details.
000ef7ac 75fbf837 ntdll!KiFastSystemCallRet
000ef7b0 75fbf86a USER32!NtUserGetMessage+0xc
000ef7cc 00b21418 USER32!GetMessageW+0x33
000ef80c 00b2195d notepad!WinMain+0xec
000ef89c 76e24911 notepad!_initterm_e+0x1a1
000ef8a8 7704e4b6 kernel32!BaseThreadInitThunk+0xe
000ef8e8 7704e489 ntdll!__RtlUserThreadStart+0x23 << Exception Handler is inserted here.
000ef900 00000000 ntdll!_RtlUserThreadStart+0x1b
Secondly, the process that crashes is actually responsible for starting the debugger via the RTL exception handler. The debugger is registered under the AeDebug registry key. Even if you are able to unwind to the RTL exception handler you may still run into trouble. If the computer is low on system resources such as desktop heap, you may not be able to create a new process and thus will not be able to launch the debugger. As SkyWing stated, it’s a relatively heavyweight operation. Applications may also call TerminateProcess from within their own code based on an error condition. If we have a customer that sees this symptom on a regular basis we typically recommend having them attach a debugger to monitor the process. Simply run via ADPLUS -crash -p (PROCESSID).
Good work folks! We’ll have another puzzler ready next Monday.
Good Luck and happy debugging!
I recently came across a very interesting profiling tool that is available in Vista SP1 and Server 08 called the Windows Performance Analyzer. You can use this tool to profile and diagnose different kinds of symptoms that the machine is experiencing. This tool is built on top off the Event Tracing for Windows (ETW) infrastructure. It uses the ETW providers to record kernel events and then display them in a graphical format.
Performance Analyzer provides many different graphical views of trace data including:
Download the latest version of the Windows Performance Tools Kit, and install it on your machine. (http://www.microsoft.com/whdc/system/sysperf/perftools.mspx : Windows Performance Tools Kit, v.4.1.1 (QFE)) You will need to find the toolkit that corresponds to your processor architecture. Currently there are 3 versions available i.e. X86, IA64, X64.
After installation you should be able to see 2 new tools. The first one is Xperf, which is a command line tool that is used to capture the trace. The second is called XperfView, which graphically interprets the trace that has been collected by Xperf.
You will need to run the Xperf and XperfView from an elevated command prompt for all functionality.
For many tasks all you need for effective analysis is a kernel trace. For this example, we'll use the –on DiagEasy parameter to enable several kernel events including: image loading; disk I/O; process and thread events; hard faults; deferred procedure calls; interrupts; context switches; and, and performance counters. From an elevated command prompt launch xperf –on DiagEasy.
This starts the kernel logger in sequential mode to the default file "\kernel.etl"; uses a default buffer size of 64K, with a minimum of 64 and a maximum of 320 buffers.
To stop a trace, type xperf –d <filename>.etl at the command line. This will stop the trace and output the file.
There are 2 ways to view the trace. From an Elevated Command prompt, launch xperf <filename>.etl, or launch the XperfView tool and open the file manually. When you open the trace file, you should see something similar like this.
NOTE - While you need to run xperf from an elevated command prompt in order to record a trace you do not need an elevated command prompt in order to *analyze* a trace.
Using the Chart Selector tab, you can select all the graphs that you want to look at. To drill down in each chart, you can select the Summary table. For instance, in the CPU Sampling chart, the summary table gets you the summary of the processes that were running, with information like the amount of CPU time, CPU %, stacks (if the stacks were collected in the trace, see below). When looking at the Summary table for the Disk I/O chart, you can see which processes were writing files (the filename too!) to disk, as well as how much time it took.
You also have the ability to zoom in on a selected area. Another really cool feature is the ability to overlay multiple graphs on one frame. This way you can correlate different pieces of data together very easily.
Also, you select which counter instances you want to see in each specific chart. On the top right corner of each chart is a drop down box from where you can select the counter instances. For instance on the Disk I/O chart, you can select Disk 0, Disk 1, or a combination as well.
You can also view detailed information about the system that the trace was taken on. Click on the Trace menu item, and select System Configuration.
In the first sample Xperf command we ran, xperf –on DiagEasy. I am sure many of you were wondering what DiagEasy means. DiagEasy is a group of kernel events that are predefined by the Windows Performance Toolkit. This group includes Process, Threads, Kernel and User Image Load/Unload, Disk I/O, DPCs and Context Switch events.
When we used the xperf –on DiagEasy command, we did not specify an individual provider, so we enabled the kernel events for all the ETW providers on the system. If you want to enable events for a specific provider, you can the following format xperf -on: (GUID|KnownProviderName)[:Flags[:Level]]. For more information about ETW providers, Kernel Flags and Groups, you can run the xperf –help providers command.
One of the most powerful features in Performance Analyzer is the ability to visualize stacks. It's important to note that this requires no special instrumentation in the code – only that you have symbols for the binary components you are interested in analyzing.
When the trace is setup to collect the stacks, Performance Analyzer will display call stack summary information for the events that had stack walking enabled. Here is an example that takes a trace (with stack tracing enabled) of the entire system while running a "find string" utility.. We can use the Stack Tracing feature of Xperf to record a stack when certain events happen, or take sample at regular intervals over time. See xperf –help stackwalk output for more info.
Below, we will use the Stack Tracking feature of Xperf to take stack samples at regular intervals. With this output, we will be able to determine where the CPU is spending most of its time within a process.
xperf -on latency -stackwalk Profile
xperf -on latency -stackwalk Profile
Latency is the kernel group to enable certain events, including the profile event which records the CPUs' activity every millisecond. The "-stackwalk Profile" flag tells Xperf to record stack walks on every profile event, which makes the profile information much more useful. In other words, in order to get profile information with stack walks you need to turn on the profile event, and turn on stack walking for that event.
Note that decoding of stacks requires that symbol decoding be configured. However stacks can be recorded without symbols, and can even be viewed without symbols, although they are much less useful without symbols. I only mention this in the event you're trying to record a trace of a problematic machine with little time to mess around with _NT_SYMBOL_PATH.
To get a trace with the stack information, do the following:
Click on the selector tab to bring up the column chooser list. Then select "Process name", "Process", "Stack", "Weight" and "%Weight". These are the most useful columns when looking at stacks from the sample profile event. You should get a view similar to this.
At this point I need to mention a few of the restrictions with stack walking coupled with when and how it works.
· Xperf stack walking is not available on XP
· On Vista stack walking is available for x86, and is available for x64 as of Vista SP1.
· On Windows 7 stack walking is available.
· Stack walking on x64 is complicated. You have to set DisablePagingExecutive in the registry, as documented here:
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD –f
I recently came across a case where the customer was complaining that DPC processing was taking up too much CPU time. We ran Xperf on the machine and drilled down into the DPC activity on the machine.
From the Xperf graph, I was able to confirm that the customer was actually seeing high DPC usage. I selected the Summary for this chart, and got the list of drivers that were actually taking up CPU time.
Right off the bat, I could identify the driver that had a lot of DPC activity. I also noted that the average duration for each DPC from that driver was taking 788 microseconds. This is way too high. Each DPC should be taking a maximum of 100 microseconds.
Performance.Analyzer.QuickStart.xps – This is shipped with the performance toolkit.
From an elevated command prompt, launch xperf -help
Written By Jeff Dailey:
Not all our cases are crashes, leaks, or high CPU. Sometimes the problems we are faced with are purely a question of why a given application runs slow on a particular version of Windows versus another version of windows. In other cases an application may just start running slow for no reason. OK, not likely. There is ALWAYS SOME RESASON. Something changed! In this case, the customer reported that an application started running slow when booted into “Normal Mode”, but when the OS was booted in safe mode, the application would run fast. In this particular case the customer reported that a given operation went from taking just a few seconds (safe mode) to several minutes (normal mode). Further research found that the problem was related to accessing the registry and registry performance in general. At this point I’m already thinking, “Registry Access?” and “Safe Mode”. What could affect registry access that does not run in safe mode? Well lots of services DO NOT start in safe mode. What kind of services could affect registry calls? Antivirus? Maybe… Let’s look deeper.
One of the first things I typically do in such cases is to ask for a kernrate log of the slow and fast scenario. http://download.microsoft.com/download/8/e/c/8ec3a7d8-05b4-440a-a71e-ca3ee25fe057/rktools.exe Kernrate is a sampling profiler. It basically checks the location of the instruction pointer at regular intervals and stores the results in a hash table. We can then get a breakdown of the %time spent in each module that is executing. Even better you can zoon in to each module. Zooming in shows utilization at a function level within the module and requires symbols to be present in a flat symbol directory on the machine being profiled. I recommend downloading the symbol pack for this (http://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx) or use symchk.exe (included in the debugging tools) to download the symbols. We’ll talk more about symbols and symchk.exe in an upcoming post.
In a lot of cases kernrate data is only a starting point. We will find some code that is running a lot longer in one case verses another and that in turn requires a follow up code review and multiple debugs to further isolate the problem. This case however was different. The following is output from beyond compare that shows a comparison between the module execution time in kernel. The slow test run is on the right, and the fast test run is on the left. Keeping in mind that I was looking for something different between safe mode and normal mode, I simply started by looking at the modules listed on the slow side (Right) there were not on the fast side (Left). What was loaded during the normal run that was not a factor during the safe mode run. Right away FILTERDRVXYZ just above the HAL jumped off the page. (Some names were changed to protect the innocent. J) I did a http://www.live.com search to find out what this driver was. It was a file system filter driver for an antivirus program.
To understand why a filter driver can slow things down you need to understand that a filter driver basically installs itself between the application and the underlying file system and intercepts nearly every call being made to the underlying I/O subsystem. Most of the time filter drivers are just fine and run with no problems at all. However, what do they do? An antivirus filter driver may have several tasks that it needs to do for each I/O. This may involve looking up information about a process or file when it’s being accessed. If the data source associated with this information changes it can change the amount of time that these operations take. The interesting thing is that the timing change may only be a hundred or so extra milliseconds per call to the registry or file system. Under most cases a user would not even notice this. If however you have some aspect of your application needs to make many thousands of calls and they each take 250+ milliseconds vs. 10 milliseconds, all of a sudden the problem compounds. The classic example of this is a VIRUS PROFILE gets uploaded or pushed to servers and all of a sudden the customer starts seeing performance problems because the data set or comparison for the filter driver changes. In this case I simply recommended that the customer remove the suspect filter driver to see if it was the root of the problem. IT WAS… They contacted the product vendor to get an update and the case was resolved.
Good luck and happy debugging….