Using the xperf tools, ETW, and the kernel sample profile interrupt all together provides a very effective and easy to use sample profiler for the analysis of both application and system wide performance.  At each sample interrupt, the ETW sub-system captures the instruction pointer and the stack.  This data is lazily and efficiently logged to an ETL file.  Once the data is saved, it can be analyzed with Performance Analyzer.

The next article in this series is  So just what is in a trace? Using the xperf trace dumper

Note: the examples in this post only works on Vista or Server 2008 32-bit;  Prior operating system's do not support taking stack traces.  Taking stack traces on 64-bit platforms will be the topic of another post.

Here is an example of profiling FS.EXE, a grep-like utility I've written.  I use this tool for experimenting with various topics such as efficient I/O, well performing string matching algorithms, and instrumenting applications with ETW.

For this test, I put the following commands in a CMD file:

    • xperf -on PROC_THREAD+LOADER+INTERRUPT+DPC+PROFILE
      -stackwalk profile
      -minbuffers 16 -maxbuffers 1024 -flushtimer 0
      -f e:\tmp.etl
    • fs.exe farglenorgin c:\coding\*.cpp *.h -s
    • xperf -d profile.etl

Since the commands are a bit long, I've separated them above and added line breaks to make them readable.  Each command above should be on one line in your command file.

The first command turns on the kernel logger and enables the following events:

  • PROC_THREAD flag enables the process and thread events. These mark the beginning and ending of each process and thread.  The kernel provider guarantees that there will be a begin/end pair for every process and thread during the trace.   Process and threads that exist before the trace was started or are still running when the trace is stopped also have these events.
  • The LOADER flag enables the loader events that log when the kernel loads an image (an EXE or DLL)
  • The INTERRUPT  and DPC flags enable the ETW interrupt and DPC events which mark each interrupt and deferred procedure call which are routines that run at DISPATCH_LEVELt
  • The PROFILE flag does two things; it turns on the systems sample profile interrupt and it enables the kernel's sample profile ETW event.

The other flags are important as well.

  • The -stackwalk profile parameter turns on ETW's stack walking feature for the sample profile event.   Every time a sample profile event is triggered by the sample profile interrupt, ETW will capture the stack and save the data in the trace buffers.  
  • The -minbuffers 16 parameter sets the minimum number of buffers that ETW will allocate for storing events.  Note, you need at least two for each processor in you system.
  • The -maxbuffers 1024 parameter sets the maximum number of buffers ETW will allocate to 1024 - a total of 64MB.
  • The -flushtimer 0 parameter tells ETW to never flush the buffers based on  timer, buffer's will only be written to disk when they are full.
  • The -f e:\tmp.etl parameter tells ETW to lazily write the full ETW buffers to e:\tmp.etl.   This puts the log file on a different physical drive than the drive on which the experiment is running.  This means that the writes that ETW uses to save the trace data do not occur on the interesting drive.

The second command simply runs the experiment.  It searches for the string 'farglenorgin' in all my .CPP and .H files.  I'm using a string that doesn't exist so I execute the worst case code paths in the application.  Replace this command with a command to run your experiment, or a pause instruction so you can dork around with a graphical program.

The third command simply stops the kernel logger, merges the data and saves it in profile.etl.

NOTE: These commands need to be run from an elevated command prompt.   Controlling ETW tracing requires administrative privileges.

There is now one other thing to do before examining the data - setting the symbol path.   Here is how I set the symbol path for this example: 

set _NT_SYMBOL_PATH =
c:\coding\fs\release\fs;
SRV*c:\symbols*http://msdl.microsoft.com/download/symbols

This tells the symbol decoder to look for symbols in the release build directory for FS.EXE and in the Windows public symbol server, caching the served symbols in c:\symbols.  The xperf tools uses the symbol decoding libraries from the debugging tools for Windows.  You can find more information on using symbols here.

Once the trace is taken and the symbol path is set, then simply open the trace in Performance Analyzer with the command "xperf profile.etl". 

imageThe CPU Sampling by Process graph is the most interesting graph for this example.  To select the visible graphs, click on the flyout control on the left of the window, then select the CPU Sampling by CPU, and by Process graphs.

For his experiment, the CPU sampling by Process graph looks like this:

image

By default, all processes running during the trace are shown except the idle task (as seen above).  You can change which processes are displayed or hidden by by using the check boxes in the legend drop down as shown above.

This graph illustrates an important concept about the kernel event provider and the xperf tools in general - they are specifically designed to analyze system wide and application performance data and events.  For example, in the legend above, there are many processes listed, but only a few of them actually used any CPU time during the experiment.

Using the legend, you can eliminate all processes except the interesting one.   Here is what the graph of CPU utilization for only FS.EXE looks like

image

This is pretty cool as it provides a nice overview of the CPU utilization of FS.EXE, but it really doesn't tell us much about where time is being spent in the process it self.

The real power in Performance Analyzer is in its summary tables.  These are tabular displays of data about a specific chart, or a region in a chart.  For this experiment, I looked at the sample profile data for the entire trace.  To do this, right mouse click on the CPU Sampling by Process chart, make sure that the load symbols option is set, then select the Summary table view.

image

Note, it will take 10 to 20 seconds for the summary table to show up.  Performance Analyzer is loading symbols while this is happening.  (putting symbol loading on a background thread is on our to do list...).

imageAfter the summary table pops up, click on the flyout and select the columns for display.  In this case you will want the process, stack and % Weight columns (feel free to experiment with other columns). 

Next, arrange the columns as follows.  The columns to the left of the gold column are grouping columns.   You can change the order of columns and put them to the left or right of the gold column by dragging.

image

Now, you can expand the stacks for FS.EXE and see where it is spending its time.  Not that this isn't by function as in some profilers but by call stack.  This is much more powerful than simply knowing the functions where time is spent as it also shows you how the time consuming functions were called.

Its no surprise that my find string utility spends most of its time in the following stack:

image 

As with other sample profilers, you can look "up" and "down" the stacks from any particular point. This is commonly called a butterfly view.  Right mouse click on any item in the stack column and experiment with the callers/callees and inntermost/outermost options, like this:

image

This stack trace has very simple call stacks so it isn't very useful for looking at butterfly views.  But try one of your own programs and look at a butterfly stack view of a function that is called often from multiple places.   Or, use the butterfly view too look at a intermediate function and see all the functions it calls, and their stacks.

The above screen shots and summary table views contain the data from the entire trace.   This works ok for short traces.  But for longer traces, or even short traces with a lot of detail, we often need to look at specific time spans. 

For example, there are some time spans in my experiment where FS isn't using very little CPU time.  I'd like to see what FS is up to in that time span.  This is easily done by using the left mouse button to select a time span on the X axis and zooming the graph to that view, or look at the summary table for that span, as in this example.

image

imageOnce the interesting region is selected, I simply use the right mouse button to pop up the context menu and select summary table. 

Note that you can open up multiple summary tables, each from different regions of a graph, or even different graphs.  This is great for making comparisons.

The new summary table window now only shows the data for selected time span in the trace.  Is not surprising that FS is spending the little CPU time it is using in user mode asynchronous procedure calls.

image

This post illustrates some key concepts:

  • The xperf tools are designed for both system wide and application specific analysis.
  • Profiling with ETW is very, very light weight.  While the experiment is running, the xperf tools are not even loaded - the kernel itself is collecting the data.   All analysis is done as post processing tasks.
  • OS based sample profiling collects both user and kernel mode stacks.  
  • So long as you have symbols, production code can be profiled - no special debug or instrumented builds are required.
  • In this example, I started and stopped FS.exe (the experiment) between the tracing start and stop.  But, since this is ETW based, sample profiling can be started and stopped at any time, without stopping or restating even a single process.  You can profile anything at any time on any system.
  • Stack views provide a very powerful method for analyzing where time is spent in a process.
  • The general technique with Performance Analyzer is to use the graphics to identify interesting time spans in the trace, then use the summary tables to look at the data in detail.
del.icio.us Tags: ,,,