Using the xperf tools, ETW, and the kernel sample profile interrupt all together provides a very effective and easy to use sample profiler for the analysis of both application and system wide performance. At each sample interrupt, the ETW sub-system captures the instruction pointer and the stack. This data is lazily and efficiently logged to an ETL file. Once the data is saved, it can be analyzed with Performance Analyzer.
The next article in this series is So just what is in a trace? Using the xperf trace dumper
Note: the examples in this post only works on Vista or Server 2008 32-bit; Prior operating system's do not support taking stack traces. Taking stack traces on 64-bit platforms will be the topic of another post.
Here is an example of profiling FS.EXE, a grep-like utility I've written. I use this tool for experimenting with various topics such as efficient I/O, well performing string matching algorithms, and instrumenting applications with ETW.
For this test, I put the following commands in a CMD file:
Since the commands are a bit long, I've separated them above and added line breaks to make them readable. Each command above should be on one line in your command file.
The first command turns on the kernel logger and enables the following events:
The other flags are important as well.
The second command simply runs the experiment. It searches for the string 'farglenorgin' in all my .CPP and .H files. I'm using a string that doesn't exist so I execute the worst case code paths in the application. Replace this command with a command to run your experiment, or a pause instruction so you can dork around with a graphical program.
The third command simply stops the kernel logger, merges the data and saves it in profile.etl.
NOTE: These commands need to be run from an elevated command prompt. Controlling ETW tracing requires administrative privileges.
There is now one other thing to do before examining the data - setting the symbol path. Here is how I set the symbol path for this example:
set _NT_SYMBOL_PATH = c:\coding\fs\release\fs; SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
This tells the symbol decoder to look for symbols in the release build directory for FS.EXE and in the Windows public symbol server, caching the served symbols in c:\symbols. The xperf tools uses the symbol decoding libraries from the debugging tools for Windows. You can find more information on using symbols here.
Once the trace is taken and the symbol path is set, then simply open the trace in Performance Analyzer with the command "xperf profile.etl".
The CPU Sampling by Process graph is the most interesting graph for this example. To select the visible graphs, click on the flyout control on the left of the window, then select the CPU Sampling by CPU, and by Process graphs.
For his experiment, the CPU sampling by Process graph looks like this:
By default, all processes running during the trace are shown except the idle task (as seen above). You can change which processes are displayed or hidden by by using the check boxes in the legend drop down as shown above.
This graph illustrates an important concept about the kernel event provider and the xperf tools in general - they are specifically designed to analyze system wide and application performance data and events. For example, in the legend above, there are many processes listed, but only a few of them actually used any CPU time during the experiment.
Using the legend, you can eliminate all processes except the interesting one. Here is what the graph of CPU utilization for only FS.EXE looks like
This is pretty cool as it provides a nice overview of the CPU utilization of FS.EXE, but it really doesn't tell us much about where time is being spent in the process it self.
The real power in Performance Analyzer is in its summary tables. These are tabular displays of data about a specific chart, or a region in a chart. For this experiment, I looked at the sample profile data for the entire trace. To do this, right mouse click on the CPU Sampling by Process chart, make sure that the load symbols option is set, then select the Summary table view.
Note, it will take 10 to 20 seconds for the summary table to show up. Performance Analyzer is loading symbols while this is happening. (putting symbol loading on a background thread is on our to do list...).
After the summary table pops up, click on the flyout and select the columns for display. In this case you will want the process, stack and % Weight columns (feel free to experiment with other columns).
Next, arrange the columns as follows. The columns to the left of the gold column are grouping columns. You can change the order of columns and put them to the left or right of the gold column by dragging.
Now, you can expand the stacks for FS.EXE and see where it is spending its time. Not that this isn't by function as in some profilers but by call stack. This is much more powerful than simply knowing the functions where time is spent as it also shows you how the time consuming functions were called.
Its no surprise that my find string utility spends most of its time in the following stack:
As with other sample profilers, you can look "up" and "down" the stacks from any particular point. This is commonly called a butterfly view. Right mouse click on any item in the stack column and experiment with the callers/callees and inntermost/outermost options, like this:
This stack trace has very simple call stacks so it isn't very useful for looking at butterfly views. But try one of your own programs and look at a butterfly stack view of a function that is called often from multiple places. Or, use the butterfly view too look at a intermediate function and see all the functions it calls, and their stacks.
The above screen shots and summary table views contain the data from the entire trace. This works ok for short traces. But for longer traces, or even short traces with a lot of detail, we often need to look at specific time spans.
For example, there are some time spans in my experiment where FS isn't using very little CPU time. I'd like to see what FS is up to in that time span. This is easily done by using the left mouse button to select a time span on the X axis and zooming the graph to that view, or look at the summary table for that span, as in this example.
Once the interesting region is selected, I simply use the right mouse button to pop up the context menu and select summary table.
Note that you can open up multiple summary tables, each from different regions of a graph, or even different graphs. This is great for making comparisons.
The new summary table window now only shows the data for selected time span in the trace. Is not surprising that FS is spending the little CPU time it is using in user mode asynchronous procedure calls.
This post illustrates some key concepts: