Bruce Dawson is a performance analyst on the client performance team.  He has written this guest post on enabling stack walking using xperf for both 32-bit and 64-bit Windows systems (Vista and Win7).  For more posts on xperf see this page.

Introduction

When I first started working with xperf I was confused by the many gotchas and settings surrounding the recording of call stacks. It seemed like there were many bits of crucial information needed in order to successfully record call stacks, and these bits were never gathered in one place. In order to save future generations from this complexity (and to give me a convenient reference to look at) I decided to write up what I have learned, while the lessons are still fresh in my memory.

Warning: The syntax examples below make use of ‘-‘ characters and the single quote character. Both of these characters are often altered by word processing software to visually similar but actually different characters, which can lead to obscure syntax errors. If the example syntax doesn’t work then try typing in the command manually instead of using copy/paste to eliminate that as a factor.

Recording Stack Walks

In order to enable call stacks in xperf you need to choose what type of event you want call stacks for, make sure that event is being recorded, and then enable call stacks for that type of event. The sampling profile events from the kernel provider (interrupting the CPU every millisecond and recording what it is doing) are one of the main uses of call stacks. The profile events can be enabled with the “PROFILE” kernel flag, or by using some of the kernel groups such as “Base” or “Latency” which include the PROFILE flag. For a list of all the kernel flags and kernel groups use:

xperf -providers k

In addition to enabling the sampling profiler we need to enable stack walking on that event with the -stackwalk command. We also need to make sure that the PROC_THREAD and LOADER kernel flags are enabled – so that the xperf tools can identify what modules the code addresses are in.

Putting this together, the following example command will tell xperf to record a trace with sampling profiler call stacks and save it to mytrace.etl:

xperf -on PROC_THREAD+LOADER+PROFILE -stackwalk profile
rem Your scenario goes here…
xperf -d mytrace.etl

Alternately, the following command will also work, because the “Latency” kernel group includes all of the necessary kernel flags.

xperf -on Latency -stackwalk profile
rem Your scenario goes here...
xperf -d mytrace.etl

Many events can have stack walking enabled using this basic method, so that if you want call stacks for context switches, registry operations, etc., you can get them. You can see the list of events that can have call stacks enabled for them with:

xperf -help stackwalk

You can also record call stacks for manifest-based ETW (Event Tracing for Windows) events, but the syntax is quite different, and it only works on Windows 7 and above. When you specify your ETW provider to xperf after “-on” you can specify extra parameters after the provider name, separated by colons. These are flags, a level, and, for manifest-based providers, a list of extra data to record, which can include call stacks. You can leave the flags and level fields blank and just specify ‘stack’ (in single quotes) after three colons like this:

xperf -on Latency -stackwalk profile -start browse -on Microsoft-IE:::'stack'
rem Your scenario goes here...
xperf -stop browse -stop -d mytrace.etl

The syntax above starts the kernel session with the “Latency” set of kernel flags, enables stack walking on the profile events, starts a user session called “browse”, and enables the “Microsoft-IE” ETW provider for that session, with call stacks recorded for all of those events. Then, when your scenario is finished it stops the user session, stops the kernel session, and merges the kernel and user traces into “mytrace.etl”. It’s a bit of a mouthful, but it works.

Operating System Support for XPerf Stack Walks

Recording stack walks from xperf is not supported in all variations of Windows and does have some prerequisites. Note also that stack walking for classic ETW events is not supported. Owners of classic ETW providers must upgrade their providers to manifest-based to take advantage of stack walking on their custom events.

The following chart summarizes the situation for stack walking on kernel and manifest-based events:

 

32-bit

64-bit

XP and below

No

No

Vista RTM*

Yes

No

Vista SP1*

Yes

Yes**

Windows 7

Yes

Yes**

 

* Stack walking not supported for manifest-based ETW events.

** Stack walking on 64-bit requires that the DisablePagingExecutive registry key be set.

Disable Paging Executive

In order for tracing to work on 64-bit Windows you need to set the DisablePagingExecutive registry key. This tells the operating system not to page kernel mode drivers and system code to disk, which is a prerequisite for getting 64-bit call stacks using xperf, because 64-bit stack walking depends on metadata in the executable images, and in some situations the xperf stack walk code is not allowed to touch paged out pages. Running the following command from an elevated command prompt will set this registry key for you.

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

After setting this registry key you will need to reboot your system before you can record call stacks. Having this flag set means that the Windows kernel locks more pages into RAM, so this will probably consume about 10 MB of additional physical memory.

Looking at Stack Walks

The call stacks that result from the stack walks are visible in xperfview. You can run xperfview and load your trace or you can invoke it from the command line with:

xperf mytrace.etl

If you enabled the PROFILE kernel flag then you should see one or more CPU Sampling graphs – if they aren’t enabled then use the selector on the left sidebar, or the Graphs menu, to enable one of them.

image

From one of the CPU sampling graphs you can select an area of interest and then right-click and select Summary Table. Use the selector on the left sidebar of the summary table, or the Columns menu, to make sure that the Stack column is enabled and visible – this is where your call stacks will be shown. Xperfview summarizes multiple stack walks together, collapsing them down where they are identical. The Count field shows how many stacks were collapsed together on each row (how many times that partial call stack was hit), the Weight column shows an estimate of how many milliseconds of CPU time were spent in that call stack, and the %Weight column shows that as a percentage of total CPU time available.

image

Sampling Implies Estimation

CPU sampling is a statistical profiling process. If the number of samples is small, or if the code execution is correlated with the sampling interval, then the results can be off by a large amount. It is important to be aware of the limits of sampling in order to avoid reading too much meaning into a small number of counts.

Symbols

If you haven’t loaded symbols then your call stacks will show DLL names followed by question marks, as shown below. You can tell that you’ve recorded a call stack, but the details of what code was executing are still hidden.

Stack Count Weight % Weight
[Root] 4973 4972.513 93.51
|- ntdll.dll!? 4950 4949.49 93.08
| kernel32.dll!? 4950 4949.49 93.08
|  |- MSVCR90.DLL!? 4904 4903.486 92.21
|  |    MSVCR90.DLL!? 4904 4903.486 92.21
|  |    MFC90.DLL!? 4904 4903.486 92.21
|  |    FractalX.EXE!? 4904 4903.486 92.21
|  |    FractalX.EXE!? 4904 4903.486 92.21
|  |    FractalX.EXE!? 4904 4903.486 92.21
|  |  |- FractalX.EXE!? 4900 4899.485 92.14
|  |  |- user32.dll!? 3 3.001464 0.06
|  |  |- MSVCR90.DLL!? 1 0.999918 0.02
|  |- FractalX.EXE!? 46 46.00391 0.87
|- ntkrnlmp.exe!? 23 23.0229 0.43

 

A single question mark in the stack column means that no call stack is available at all. If “?!?” is displayed then that means that xperf doesn’t know what executable image was at the address.

If the DLLs are from Microsoft then you should point your symbol path to Microsoft’s symbol servers (using the _NT_SYMBOL_PATH environment variable or Configure Symbol Paths from the Trace menu). You can have multiple symbol paths, separate by semicolons, to that you can point at Microsoft’s symbols and your own simultaneously. Then you need to select Load Symbols from the Trace menu or a graph context menu.

Exploring Call Stacks

The call stack display is a bit unusual in that the root of the call stack (conveniently labeled as [Root]) is displayed at the top and you expand downwards from there. The layout is also a bit subtle and non-obvious at first. When you see multiple functions above each other at the same indentation with no minus signs in front of them (for example endthreadex and _AfxThreadEntry below) then that means that in that section of the call stack all of the collapsed stacks took the same path – endthreadex called _AfxThreadEntry in all of the samples and these functions have a parent/child relationship. This layout helps to keep the call stack display as compact as possible.

Stack Count Weight % Weight
[Root] 4973 4972.513 93.51
|- ntdll.dll!? 4950 4949.49 93.08
|  kernel32.dll!? 4950 4949.49 93.08
|  |- MSVCR90.DLL!endthreadex 4904 4903.486 92.21
|  |     MSVCR90.DLL!endthreadex 4904 4903.486 92.21
|  |     MFC90.DLL!_AfxThreadEntry 4904 4903.486 92.21
|  |     FractalX.EXE!PrimaryCalculationProc 4904 4903.486 92.21
|  |     FractalX.EXE!HomeWork::CalculateNextBit 4904 4903.486 92.21
|  |  |- FractalX.EXE!Calculator::Calculate 4899 4898.484 92.12
|  |  |  |- FractalX.EXE!Calculator::Calculate<itself> 2743 2742.873 51.58
|  |  |  |- FractalX.EXE!Calculator::CalculateHelper 2156 2155.611 40.54
|  |  |- FractalX.EXE!IterDataImp<int>::Render 5 5.002155 0.09
|  |- FractalX.EXE!__tmainCRTStartup 46 46.00391 0.87

 

When you see multiple functions above each other at the same indentation with minus signs in front of them (for instance Calculator::Calculate and IterDataImp<int>::Render) it means that all of these functions are called by the function above them. In other words those functions are siblings, all called by HomeWork::CalculateNextBit, and the collapsed call stacks took multiple paths at this point.

When you see a line that ends with <itself> then than line represents samples that occurred in that function, as opposed to in its descendants. This is often described as the exclusive time, as opposed to inclusive time which includes descendants. The <itself> line for a function, if present, will show up as a child node of the function so that the child nodes’ counts sum up to the parent’s count. In the example above Calculator::Calculate was running for 2743 samples and its child function was running for 2156 samples, for a total of 4899.

The best way to get the feel for how this works is to play around with it. If you select a call stack and repeatedly press the right arrow key then the call stack will keep expanding down the hottest path and this can be a very effective way to drill down into what is probably the most important call stack.

To see call stacks for manifest-based ETW events (assuming you enabled them) you need to go to the Generic Events graph and then select Summary Table from its context menu. From there you should make sure that the Stack column is enabled and you can then explore the stacks. The Count column can be very useful to see how many times the various events were hit from various call stacks. When enabled the stacks of manifest-based events are also available in the tooltips of the corresponding event markers in the Generic graph itself.

Summary

Call stacks can be recorded with:

xperf -on Latency -stackwalk profile
rem Your scenario goes here...
xperf -d mytrace.etl

If you are tracing on 64-bit Windows don’t forget to set DisablePagingExecutive and then reboot before recording call stacks. And remember that Windows 7 has better ETW call stack recording support than Windows Vista, and Windows XP has no ETW call stack recording support.

To analyze call stacks load the trace into xperfview and look in the appropriate summary table in the Stack column.

Happy tracing!

Bruce Dawson, Windows Client Performance Team