Bruce Dawson is a performance analyst on the client performance team. He has written this guest post on enabling stack walking using xperf for both 32-bit and 64-bit Windows systems (Vista and Win7). For more posts on xperf see this page.
Introduction
When I first started working with xperf I was confused by the many gotchas and settings surrounding the recording of call stacks. It seemed like there were many bits of crucial information needed in order to successfully record call stacks, and these bits were never gathered in one place. In order to save future generations from this complexity (and to give me a convenient reference to look at) I decided to write up what I have learned, while the lessons are still fresh in my memory.
Warning: The syntax examples below make use of ‘-‘ characters and the single quote character. Both of these characters are often altered by word processing software to visually similar but actually different characters, which can lead to obscure syntax errors. If the example syntax doesn’t work then try typing in the command manually instead of using copy/paste to eliminate that as a factor.
Recording Stack Walks
In order to enable call stacks in xperf you need to choose what type of event you want call stacks for, make sure that event is being recorded, and then enable call stacks for that type of event. The sampling profile events from the kernel provider (interrupting the CPU every millisecond and recording what it is doing) are one of the main uses of call stacks. The profile events can be enabled with the “PROFILE” kernel flag, or by using some of the kernel groups such as “Base” or “Latency” which include the PROFILE flag. For a list of all the kernel flags and kernel groups use:
xperf -providers k
In addition to enabling the sampling profiler we need to enable stack walking on that event with the -stackwalk command. We also need to make sure that the PROC_THREAD and LOADER kernel flags are enabled – so that the xperf tools can identify what modules the code addresses are in.
Putting this together, the following example command will tell xperf to record a trace with sampling profiler call stacks and save it to mytrace.etl:
| xperf -on PROC_THREAD+LOADER+PROFILE -stackwalk profile |
| rem Your scenario goes here… |
| xperf -d mytrace.etl |
Alternately, the following command will also work, because the “Latency” kernel group includes all of the necessary kernel flags.
| xperf -on Latency -stackwalk profile |
| rem Your scenario goes here... |
| xperf -d mytrace.etl |
Many events can have stack walking enabled using this basic method, so that if you want call stacks for context switches, registry operations, etc., you can get them. You can see the list of events that can have call stacks enabled for them with:
xperf -help stackwalk
You can also record call stacks for manifest-based ETW (Event Tracing for Windows) events, but the syntax is quite different, and it only works on Windows 7 and above. When you specify your ETW provider to xperf after “-on” you can specify extra parameters after the provider name, separated by colons. These are flags, a level, and, for manifest-based providers, a list of extra data to record, which can include call stacks. You can leave the flags and level fields blank and just specify ‘stack’ (in single quotes) after three colons like this:
| xperf -on Latency -stackwalk profile -start browse -on Microsoft-IE:::'stack' |
| rem Your scenario goes here... |
| xperf -stop browse -stop -d mytrace.etl |
The syntax above starts the kernel session with the “Latency” set of kernel flags, enables stack walking on the profile events, starts a user session called “browse”, and enables the “Microsoft-IE” ETW provider for that session, with call stacks recorded for all of those events. Then, when your scenario is finished it stops the user session, stops the kernel session, and merges the kernel and user traces into “mytrace.etl”. It’s a bit of a mouthful, but it works.
Operating System Support for XPerf Stack Walks
Recording stack walks from xperf is not supported in all variations of Windows and does have some prerequisites. Note also that stack walking for classic ETW events is not supported. Owners of classic ETW providers must upgrade their providers to manifest-based to take advantage of stack walking on their custom events.
The following chart summarizes the situation for stack walking on kernel and manifest-based events:
| | 32-bit | 64-bit |
| XP and below | No | No |
| Vista RTM* | Yes | No |
| Vista SP1* | Yes | Yes** |
| Windows 7 | Yes | Yes** |
* Stack walking not supported for manifest-based ETW events.
** Stack walking on 64-bit requires that the DisablePagingExecutive registry key be set.
Disable Paging Executive
In order for tracing to work on 64-bit Windows you need to set the DisablePagingExecutive registry key. This tells the operating system not to page kernel mode drivers and system code to disk, which is a prerequisite for getting 64-bit call stacks using xperf, because 64-bit stack walking depends on metadata in the executable images, and in some situations the xperf stack walk code is not allowed to touch paged out pages. Running the following command from an elevated command prompt will set this registry key for you.
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
After setting this registry key you will need to reboot your system before you can record call stacks. Having this flag set means that the Windows kernel locks more pages into RAM, so this will probably consume about 10 MB of additional physical memory.
Looking at Stack Walks
The call stacks that result from the stack walks are visible in xperfview. You can run xperfview and load your trace or you can invoke it from the command line with:
xperf mytrace.etl
If you enabled the PROFILE kernel flag then you should see one or more CPU Sampling graphs – if they aren’t enabled then use the selector on the left sidebar, or the Graphs menu, to enable one of them.
From one of the CPU sampling graphs you can select an area of interest and then right-click and select Summary Table. Use the selector on the left sidebar of the summary table, or the Columns menu, to make sure that the Stack column is enabled and visible – this is where your call stacks will be shown. Xperfview summarizes multiple stack walks together, collapsing them down where they are identical. The Count field shows how many stacks were collapsed together on each row (how many times that partial call stack was hit), the Weight column shows an estimate of how many milliseconds of CPU time were spent in that call stack, and the %Weight column shows that as a percentage of total CPU time available.
Sampling Implies Estimation
CPU sampling is a statistical profiling process. If the number of samples is small, or if the code execution is correlated with the sampling interval, then the results can be off by a large amount. It is important to be aware of the limits of sampling in order to avoid reading too much meaning into a small number of counts.
Symbols
If you haven’t loaded symbols then your call stacks will show DLL names followed by question marks, as shown below. You can tell that you’ve recorded a call stack, but the details of what code was executing are still hidden.
| Stack | Count | Weight | % Weight |
| [Root] | 4973 | 4972.513 | 93.51 |
| |- ntdll.dll!? | 4950 | 4949.49 | 93.08 |
| | kernel32.dll!? | 4950 | 4949.49 | 93.08 |
| | |- MSVCR90.DLL!? | 4904 | 4903.486 | 92.21 |
| | | MSVCR90.DLL!? | 4904 | 4903.486 | 92.21 |
| | | MFC90.DLL!? | 4904 | 4903.486 | 92.21 |
| | | FractalX.EXE!? | 4904 | 4903.486 | 92.21 |
| | | FractalX.EXE!? | 4904 | 4903.486 | 92.21 |
| | | FractalX.EXE!? | 4904 | 4903.486 | 92.21 |
| | | |- FractalX.EXE!? | 4900 | 4899.485 | 92.14 |
| | | |- user32.dll!? | 3 | 3.001464 | 0.06 |
| | | |- MSVCR90.DLL!? | 1 | 0.999918 | 0.02 |
| | |- FractalX.EXE!? | 46 | 46.00391 | 0.87 |
| |- ntkrnlmp.exe!? | 23 | 23.0229 | 0.43 |
A single question mark in the stack column means that no call stack is available at all. If “?!?” is displayed then that means that xperf doesn’t know what executable image was at the address.
If the DLLs are from Microsoft then you should point your symbol path to Microsoft’s symbol servers (using the _NT_SYMBOL_PATH environment variable or Configure Symbol Paths from the Trace menu). You can have multiple symbol paths, separate by semicolons, to that you can point at Microsoft’s symbols and your own simultaneously. Then you need to select Load Symbols from the Trace menu or a graph context menu.
Exploring Call Stacks
The call stack display is a bit unusual in that the root of the call stack (conveniently labeled as [Root]) is displayed at the top and you expand downwards from there. The layout is also a bit subtle and non-obvious at first. When you see multiple functions above each other at the same indentation with no minus signs in front of them (for example endthreadex and _AfxThreadEntry below) then that means that in that section of the call stack all of the collapsed stacks took the same path – endthreadex called _AfxThreadEntry in all of the samples and these functions have a parent/child relationship. This layout helps to keep the call stack display as compact as possible.
| Stack | Count | Weight | % Weight |
| [Root] | 4973 | 4972.513 | 93.51 |
| |- ntdll.dll!? | 4950 | 4949.49 | 93.08 |
| | kernel32.dll!? | 4950 | 4949.49 | 93.08 |
| | |- MSVCR90.DLL!endthreadex | 4904 | 4903.486 | 92.21 |
| | | MSVCR90.DLL!endthreadex | 4904 | 4903.486 | 92.21 |
| | | MFC90.DLL!_AfxThreadEntry | 4904 | 4903.486 | 92.21 |
| | | FractalX.EXE!PrimaryCalculationProc | 4904 | 4903.486 | 92.21 |
| | | FractalX.EXE!HomeWork::CalculateNextBit | 4904 | 4903.486 | 92.21 |
| | | |- FractalX.EXE!Calculator::Calculate | 4899 | 4898.484 | 92.12 |
| | | | |- FractalX.EXE!Calculator::Calculate<itself> | 2743 | 2742.873 | 51.58 |
| | | | |- FractalX.EXE!Calculator::CalculateHelper | 2156 | 2155.611 | 40.54 |
| | | |- FractalX.EXE!IterDataImp<int>::Render | 5 | 5.002155 | 0.09 |
| | |- FractalX.EXE!__tmainCRTStartup | 46 | 46.00391 | 0.87 |
When you see multiple functions above each other at the same indentation with minus signs in front of them (for instance Calculator::Calculate and IterDataImp<int>::Render) it means that all of these functions are called by the function above them. In other words those functions are siblings, all called by HomeWork::CalculateNextBit, and the collapsed call stacks took multiple paths at this point.
When you see a line that ends with <itself> then than line represents samples that occurred in that function, as opposed to in its descendants. This is often described as the exclusive time, as opposed to inclusive time which includes descendants. The <itself> line for a function, if present, will show up as a child node of the function so that the child nodes’ counts sum up to the parent’s count. In the example above Calculator::Calculate was running for 2743 samples and its child function was running for 2156 samples, for a total of 4899.
The best way to get the feel for how this works is to play around with it. If you select a call stack and repeatedly press the right arrow key then the call stack will keep expanding down the hottest path and this can be a very effective way to drill down into what is probably the most important call stack.
To see call stacks for manifest-based ETW events (assuming you enabled them) you need to go to the Generic Events graph and then select Summary Table from its context menu. From there you should make sure that the Stack column is enabled and you can then explore the stacks. The Count column can be very useful to see how many times the various events were hit from various call stacks. When enabled the stacks of manifest-based events are also available in the tooltips of the corresponding event markers in the Generic graph itself.
Summary
Call stacks can be recorded with:
| xperf -on Latency -stackwalk profile |
| rem Your scenario goes here... |
| xperf -d mytrace.etl |
If you are tracing on 64-bit Windows don’t forget to set DisablePagingExecutive and then reboot before recording call stacks. And remember that Windows 7 has better ETW call stack recording support than Windows Vista, and Windows XP has no ETW call stack recording support.
To analyze call stacks load the trace into xperfview and look in the appropriate summary table in the Stack column.
Happy tracing!
Bruce Dawson, Windows Client Performance Team
In my last post I talked about how multi-threading can be used to improve responsiveness while loading a file. I’ve put together a sample program (source included) that shows how to do this in C# 3.0, .NET 3.5, and WPF using Visual Studio 2008. You can find a zip file with a complete project attached.
The program has a very simple, but fully functional UI.
The “Open a file…” pops up the standard open file dialog and ask the user to find a file. The cancel button is disabled unless there is a loading operation to cancel. The text box is for messages from the loader thread and the slider is just there to show that the UI is responsive while the background load occurs.
The progress bar is hidden unless a file is actually being loaded. If the async check box is un-checked, then the file will be loaded synchronously. This illustrates how the UI will ‘freeze’ if a second thread isn’t used.
The key thing about this program is that the UI stays completely responsive while a file is loaded. Of course, this can be done without threads – many programs are written using a state machine approach. But using threads, and functional programming make this much more simple than previous approaches.
There are many ways to leverage threads in managed code, but the most straight forward way is using the BackgroundWorker class. IT can be used in two ways, by creating one and attaching functions to its delegates, or by creating a new class derived from BackgroundWorker – this is the approach I took here.
I created a class called AsyncFileLoader that is derived from BackgroundWorker. We’re using all the features of BackgroundWorker here
- Support for asynchronous progress notifications
- Support for cancelation
- Handling of errors during the file loading
Progress and cancelation is enabled by setting their properties in the constructor.
1: public AsyncFileLoader()
2: { 3: WorkerReportsProgress = true;
4: WorkerSupportsCancellation = true;
5: }
We also need an event handler that will be connected to the cancel button
1: public void PerformCancel( object sender, RoutedEventArgs e )
2: { 3: this.CancelAsync();
4: }
The final element is a function that overrides BackgroundWorker.OnDoWork(). This function does all the work of asynchronously loading a text file.
1: protected override void OnDoWork( DoWorkEventArgs e )
2: { 3: // note: do not catch exceptions here. The BackgroundWorker will catch exceptions
4: // and pass them back to the initiating thread in the RunWorkerCompleted() delegate
5: // using the error member of RunWorkerCompletedEventArgs parameter.
6:
7: string FileName = e.Argument as string;
8:
9: ReportProgress( int.MinValue, String.Format( "Opening {0}", FileName ) ); 10:
11: using ( StreamReader sr = new StreamReader( FileName, Encoding.UTF8 ) )
12: { 13: ReportProgress( int.MinValue, String.Format( "Loading {0}", FileName ) ); 14:
15: long FileLength = sr.BaseStream.Length;
16: long PercentRead = 0;
17: long LastPercentRead = 0;
18:
19: String line;
20:
21: while ( (line = sr.ReadLine()) != null ) { 22:
23: if ( CancellationPending ) { 24: e.Cancel = true;
25: break;
26: }
27:
28: PercentRead = (sr.BaseStream.Position * 100) / FileLength;
29:
30: if ( PercentRead >= LastPercentRead + 1 ) { 31: ReportProgress( (int)PercentRead );
32: LastPercentRead = PercentRead;
33: }
34: }
35: }
36: }
As you can see – this is really simple, and most important procedural – there isn’t a state machine, or any explicitly state kept at all. This single function loads the file, and sends progress messages and errors asynchronously to the calling thread.
The initiating thread passes the background thread parameters using the DoWorkEventArgs class. Here, I’ve simply passed the file name as a string, but any class or structure could be passed this way.
Line 9 is important – it asynchronously passes the first progress message back to the main thread. This includes the progress percentage and the name of the file. Here I’m using int.MinValue to differentiate the first progress message from subsequent ones. This sentinel value means that the main thread doesn’t have to maintain any state to handle progress messages.
I mentioned above that this function handles errors, but you don’t see any error handling code! Any errors that occur from opening the file, or reading the data are signaled as exceptions. The BackgroundWorker class will catch these and pass them back tot he main thread. I just love this – it keeps the code here very clean.
Note the using statement on line 11. It is important to do this when using file I/O objects so the underlying OS objects get cleaned up correctly. See this for more information.
The progress percentage value is computed very simply, the code simply keeps track of the file position as a ration to the file size. This is the obvious part.
Note line 30 – this is very important. It only sends progress messages to the main thread when the value changes – not for every line, which would be quite expensive (give it a try to see).
The entire AsyncFileLoader.cs file is not very big, only 29 lines of actual code and is very straightforward.
The code in the main thread to mange the UI is equally straight forward and consists of just a single event handler attached to the open button’s click delegate. This one function does all of the following:
- Gets the file name from the user using the open file dialog.
- Maintains the UI state, disable buttons, showing/hiding the progress bar, displaying messages, etc.
- Creates a the AsyncFileLoader object
- Starts the asynchronous file loading operation
- Handles progress messages
- handles the completion of the operation
- Handles errors.
1: private void OpenAFile_Clicked( object sender, RoutedEventArgs e )
2: { 3: try { 4: OpenFileDialog ofd = new OpenFileDialog();
5: ofd.CheckFileExists = true;
6: ofd.DefaultExt = "txt";
7: ofd.Title = "Open an text File";
8: ofd.Filter = "Text|*.txt";
9:
10: OpenAFile_Button.IsEnabled = false;
11:
12: ofd.ShowDialog();
13:
14: if ( ofd.FileName.IsBlank() ) { 15: OpenAFile_Button.IsEnabled = true;
16: return;
17: }
18:
19: AsyncFileLoader TheLoader = new AsyncFileLoader();
20:
21: //---- Do this when the progress bar is updated
22: TheLoader.ProgressChanged += ( Sender, ArgsX ) => { 23:
24: if ( ArgsX.ProgressPercentage == int.MinValue ) { 25: Cancel_Button.IsEnabled = true;
26: Cancel_Button.Click += TheLoader.PerformCancel;
27: OpenProgress_ProgressBar.IsEnabled = true;
28: OpenProgress_ProgressBar.Visibility = Visibility.Visible;
29: OpenProgress_ProgressBar.Value = 0;
30: Info_TextBox.Text = ArgsX.UserState as String;
31: }
32: else { 33: OpenProgress_ProgressBar.Value = ArgsX.ProgressPercentage;
34: }
35: };
36:
37: //---- Do this when the background worker is completed
38: TheLoader.RunWorkerCompleted += ( SenderX, ArgsX ) => { 39:
40: RunWorkerCompletedEventArgs args = ArgsX as RunWorkerCompletedEventArgs;
41:
42: if ( args.Error != null ) { 43: PopErrorDialog( args.Error, "Cannot process file!" );
44: }
45: else
46: if ( args.Cancelled ) { 47: Info_TextBox.Text = "Loading cancled";
48: }
49: else { 50: Info_TextBox.Text = "File Loaded";
51: }
52:
53: Cancel_Button.IsEnabled = false;
54: Cancel_Button.Click -= TheLoader.PerformCancel;
55: OpenProgress_ProgressBar.IsEnabled = false;
56: OpenProgress_ProgressBar.Visibility = Visibility.Hidden;
57: OpenAFile_Button.IsEnabled = true;
58: };
59:
60: //---- Start the ascynrnous file loading operation
61: TheLoader.RunWorkerAsync( ofd.FileName );
62: }
63: catch ( Exception Ex ) { 64: PopErrorDialog( Ex, "General Failure!" );
65: }
66: }
Functional programming is key to making this code clean and straight forward. Here we’re using lambda functions connected to the AsyncFileLoader’s progress and completion events.
You can see in the progress changed handler how no state needs to be kept – this lambda function gets all the data it needs from the event. You can see how the sentinel value (int.MinValue) is used to trigger the change in UI state.
Completion is handles in a similarly manner and resets the UI state. This lambda also handles any errors thrown by the loader by checking the args.Error member which will reference an exception if one was thrown.
This is a great example of using multi-threaded programming for something other than speeding up a computationally bound algorithm using parallelization. Some may suggest that this isn’t “real multi-core” programming, but it is. Indeed, the developer doesn’t need to know how to implement complex inter thread communication or synchronization, but just because this isn’t hard, doesn’t mean it isn’t real multi-threaded, and multi-core enabled code. It absolutely is and the developer does need to understand asynchronous programming techniques to use it.
Using the BackgroundWorker class is a very simple straight forward way to enable your programs to be more responsive, and to leverage multi-threaded and multi-core performance. You can use other managed code techniques such as threads and dispatching for more sophisticated work.
You can find the complete VS 2008 project in the attached ZIP file.
I must admit, I’m addicted to Stackoverflow. Its a great site, being both interesting and easy to use.
Recently, I ran across this question “Are you concerned about multicore”. HenryR, a PhD candidate at Cambridge is asking if the “developer on the street” needs to concern him/herself with multi-core development practices.
Henry’s question has a few answers, including the accepted one dmckee (a particle physicist)which I’ll focus on here.
There are many types of programs such as
- A simple script one of might write to do some work then throw away,
- high performance very parallel scientific applications,
- Main stream operating systems
- Line of business applications
- Web based apps
- Command line based applications
- Applications with graphical user interfaces.
This is simply the tip of the ice berg – I’m confident that with a little brainstorming, this list would be come large.
Dmckee’s answer is correct – but only for a small set of things. Indeed, for a very simple application, that is not CPU bound, then making it multi-threaded may be more work than needed. For example, I have utilities that are not multi-threaded.
However, even utilities can take advantage of multi-threading, and I argue that its de rigueur for a production program to do so.
The astute reader may say “But wait a minute! Henry ask about multi-core, not multi-threading…”.
Yup, but to leverage multi-core, you need to write multi-threaded programs. Using more than one thread has other advantages beyond parallelizing CPU bound operations so they have better throughput.
To pre-fetch the next comment “But if its not CPU bound, then why bother with multi-threading? That’s just a waste of time!”
In my mind, there are three reasons for this:
- It can make your program more responsive from a user perspective.
- Even for things that are not CPU bound, it makes sense to do them in parallel.
- It can break up complex single threaded state machines into simpler, more procedural code.
In Windows, the user interface is message based: Windows sends a UI program a message for ech UI event the program needs to handle. A program with any kind of graphical UI uses the GetNextMessage() family of functions to get them. I don’t know about Linux or OSX, but I suspect they are similar in this regard.
This is conventionally done in a message loop where a program loops on getting then processing messages. If there are no messages, then the program can wait (block) for a new message, or do some background, or idle processing.
Messages included all kinds of things, but most importantly user input – mouse moves and clicks, and keyboard events. These are often translated into windowing operations (moves, resizes, etc) and control actions (button clicks, scrolling, etc). In short, this means that coding handling the UI is a big state machine and that this state is not kept on the stack. It has to be kept explicitly somewhere else.
This has an important implication – for an application to main responsive, it must be able to quickly and consistently service the message loop. This is done on a single thread, indeed many applications only have a single thread that does everything.
Windows programs have been written this way for a long time. Some people might argue that this is a design deficiency. That’s a topic for a different discussion about application compatibility – do remember, this pattern was set back in the very early 90’s – Windows 3.11 was released in 1992.
In any case, delays in the message loop that the user can perceive are called hangs. There are two ways to keep the message loop from hanging.
- Break up all operations into pieces that execute quickly enough so that the message loop can always be serviced quickly. Indeed, lots of developers have tackled the problem just this way.
- Move longer running operations to another thread, or threads.
The short story is that only #2 works effectively. Why? Simply put – I/O. Specifically disk and network I/O. But other I/O, such as GPU operations can also be an issue.
The problem is that any disk or network I/O operation can potentially be long – hundreds or thousands of milliseconds long. Programs that do their file I/O and networking I/O on the UI thread will have hangs. It is unavoidable.
Some may point out that “Hey! That doesn’t matter, making it multi-threaded doesn’t speed things up!”.
Right…. and wrong…
Right in the sense that it doesn’t increase an applications throughput – the work still has to get done. But wrong in the sense that it absolutely positively benefits the user.
So for point #1 multi-threading can make your program more responsive from a user perspective because you can move potentially long running operations off the UI thread – keeping the UI “alive” and responsive. Trust me, your users will love you for this.
In my next post, I”ll discusses the source code for a C# program that shows how simple it is to move operations off the UI thread in WPF applications. This doesn’t require any complex knowledge of fine grained synchronization, or complex multi-threaded programming. The code is also easy to understand, not state machine driven and easy to debug. All things that conventional wisdom holds is true for multi-threaded programming.
In future posts, I’ll talk more about the example source above, points #2 and #3 , and some general multi-core and multi-threaded topics.
In summary – multi-core and multi-threaded programming is much more than simply speeding up (parallelizing) compute bound operations. Efficient threading can bring other benefits to your programs as well.
There is a lot of information in a typical kernel trace. While the Performance Analyzer tool is quite powerful and makes it easy to view a trace graphically, sometimes you just need to see what is in the trace directly. Xperf makes this easy.
First, its important to understand that a trace file (.ETL) is simply just the buffers produced by trace session written to a file. The data in an ETL file isn't pre-processed, summarized, or otherwise annotated with meta data as it comes out of the OS. Its is just the raw data that comes from a ETW session. This is because ETW is designed for log time efficiency - ETW does the absolutely minimal amount of work needed to get the trace data to a file, or other consumer.
This means that all the heavy lifting of post processing trace data happens later. With the xperf tools, there are two places where this occurs:
- In the merge step, xperf takes the kernel trace and trace files and merges them into a single trace file. Xperf will merges (adds) meta data to the trace (I've got another post that provides all the detailed on merging in the works...). The result of merging is a single trace file that can be analyzed by the tools directly on the target machine, or copied to another system for analysis. Note that the merge step must happen on the system where the trace was taken (the target system).
- When a trace is processed xperf using actions, or loaded into Performance Analyzer, the core trace processing components do a lot of work on the raw trace data. This includes things like mapping process IDs (PIDs) to file and process names, mapping addresses to filenames, loading symbols for address, unifying stacks, and handling 64-bit and 32-bit differences.
As you've seen in the other posts, once a trace is merged it can be viewed in the Performance Analyzer. But, xperf also allows you to see what is in the trace using the dumper action. This is easy to do:
xperf -i fs.etl -a dmper >fs.csv
The -i fs.etl specified that the input file is FS.ETL. The -a dumper parameter tells xperf to execute the dumper action. The output goes to the standard output.
There is a short cut for this as well: the dumper action is the default action so if you only specify an input file then xperf simply dumps it. For example, the following command does the same thing as the one above:
xperf -i fs.etl >fs.csv
The resulting file is an ANSI text file where each line is one record. Each record consists of a comma delimited set of fields. The first field of each line is the name (or type) of the record.
There are some special lines and sections at the front of the file. Each record type is described by a header line. The header lines are delimited by the 'BeginHeader' and 'EndHeader' lines. Note that the line immediately after the 'EndHeader' line is unique, it doesn't have a header line. This line describes some of the characteristics of the trace such as its duration, and the pointer size.
The first field of each header line is the name (or type) of the ETW record that the header line describes. The rest of the fields are the names of each of the fields for the record type. Here is an example of the process start event header (P-Start) and a P-Start event.
P-Start, TimeStamp, Process Name ( PID), ParentPID, SessionID, UniqueKey, UserSid, Command Line
P-Start, 1017280, fs.exe (3004), 3608, 1, 0x86661508,
S-1-5-21-626881126-397955417-188441333-3225678, c:\coding\fs\Release\fs\fs.exe blflargorg c:\coding\*.cpp *.h -s
This event describes the start of a process. There is also a corresponding P-End event. For processes that are already running when he trace is begun, the kernel logger includes a pseudo P-Start event.
This means that every PID seen in other events will have a corresponding P-Start event in the trace before it is seen in an event.
Also note that xperf will dump events that you add to your own applications so long as you include an event manifest in your app. So, you can add your own events and use xperf to dump them in the context of all the other events you include in the trace.
Using the xperf tools, ETW, and the kernel sample profile interrupt all together provides a very effective and easy to use sample profiler for the analysis of both application and system wide performance. At each sample interrupt, the ETW sub-system captures the instruction pointer and the stack. This data is lazily and efficiently logged to an ETL file. Once the data is saved, it can be analyzed with Performance Analyzer.
The next article in this series is So just what is in a trace? Using the xperf trace dumper
Note: the examples in this post only works on Vista or Server 2008 32-bit; Prior operating system's do not support taking stack traces. Taking stack traces on 64-bit platforms will be the topic of another post.
Here is an example of profiling FS.EXE, a grep-like utility I've written. I use this tool for experimenting with various topics such as efficient I/O, well performing string matching algorithms, and instrumenting applications with ETW.
For this test, I put the following commands in a CMD file:
- xperf -on PROC_THREAD+LOADER+INTERRUPT+DPC+PROFILE
-stackwalk profile
-minbuffers 16 -maxbuffers 1024 -flushtimer 0
-f e:\tmp.etl
- fs.exe farglenorgin c:\coding\*.cpp *.h -s
- xperf -d profile.etl
Since the commands are a bit long, I've separated them above and added line breaks to make them readable. Each command above should be on one line in your command file.
The first command turns on the kernel logger and enables the following events:
- PROC_THREAD flag enables the process and thread events. These mark the beginning and ending of each process and thread. The kernel provider guarantees that there will be a begin/end pair for every process and thread during the trace. Process and threads that exist before the trace was started or are still running when the trace is stopped also have these events.
- The LOADER flag enables the loader events that log when the kernel loads an image (an EXE or DLL)
- The INTERRUPT and DPC flags enable the ETW interrupt and DPC events which mark each interrupt and deferred procedure call which are routines that run at DISPATCH_LEVELt
- The PROFILE flag does two things; it turns on the systems sample profile interrupt and it enables the kernel's sample profile ETW event.
The other flags are important as well.
- The -stackwalk profile parameter turns on ETW's stack walking feature for the sample profile event. Every time a sample profile event is triggered by the sample profile interrupt, ETW will capture the stack and save the data in the trace buffers.
- The -minbuffers 16 parameter sets the minimum number of buffers that ETW will allocate for storing events. Note, you need at least two for each processor in you system.
- The -maxbuffers 1024 parameter sets the maximum number of buffers ETW will allocate to 1024 - a total of 64MB.
- The -flushtimer 0 parameter tells ETW to never flush the buffers based on timer, buffer's will only be written to disk when they are full.
- The -f e:\tmp.etl parameter tells ETW to lazily write the full ETW buffers to e:\tmp.etl. This puts the log file on a different physical drive than the drive on which the experiment is running. This means that the writes that ETW uses to save the trace data do not occur on the interesting drive.
The second command simply runs the experiment. It searches for the string 'farglenorgin' in all my .CPP and .H files. I'm using a string that doesn't exist so I execute the worst case code paths in the application. Replace this command with a command to run your experiment, or a pause instruction so you can dork around with a graphical program.
The third command simply stops the kernel logger, merges the data and saves it in profile.etl.
NOTE: These commands need to be run from an elevated command prompt. Controlling ETW tracing requires administrative privileges.
There is now one other thing to do before examining the data - setting the symbol path. Here is how I set the symbol path for this example:
set _NT_SYMBOL_PATH =
c:\coding\fs\release\fs;
SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
This tells the symbol decoder to look for symbols in the release build directory for FS.EXE and in the Windows public symbol server, caching the served symbols in c:\symbols. The xperf tools uses the symbol decoding libraries from the debugging tools for Windows. You can find more information on using symbols here.
Once the trace is taken and the symbol path is set, then simply open the trace in Performance Analyzer with the command "xperf profile.etl".
The CPU Sampling by Process graph is the most interesting graph for this example. To select the visible graphs, click on the flyout control on the left of the window, then select the CPU Sampling by CPU, and by Process graphs.
For his experiment, the CPU sampling by Process graph looks like this:
By default, all processes running during the trace are shown except the idle task (as seen above). You can change which processes are displayed or hidden by by using the check boxes in the legend drop down as shown above.
This graph illustrates an important concept about the kernel event provider and the xperf tools in general - they are specifically designed to analyze system wide and application performance data and events. For example, in the legend above, there are many processes listed, but only a few of them actually used any CPU time during the experiment.
Using the legend, you can eliminate all processes except the interesting one. Here is what the graph of CPU utilization for only FS.EXE looks like
This is pretty cool as it provides a nice overview of the CPU utilization of FS.EXE, but it really doesn't tell us much about where time is being spent in the process it self.
The real power in Performance Analyzer is in its summary tables. These are tabular displays of data about a specific chart, or a region in a chart. For this experiment, I looked at the sample profile data for the entire trace. To do this, right mouse click on the CPU Sampling by Process chart, make sure that the load symbols option is set, then select the Summary table view.
Note, it will take 10 to 20 seconds for the summary table to show up. Performance Analyzer is loading symbols while this is happening. (putting symbol loading on a background thread is on our to do list...).
After the summary table pops up, click on the flyout and select the columns for display. In this case you will want the process, stack and % Weight columns (feel free to experiment with other columns).
Next, arrange the columns as follows. The columns to the left of the gold column are grouping columns. You can change the order of columns and put them to the left or right of the gold column by dragging.
Now, you can expand the stacks for FS.EXE and see where it is spending its time. Not that this isn't by function as in some profilers but by call stack. This is much more powerful than simply knowing the functions where time is spent as it also shows you how the time consuming functions were called.
Its no surprise that my find string utility spends most of its time in the following stack:
As with other sample profilers, you can look "up" and "down" the stacks from any particular point. This is commonly called a butterfly view. Right mouse click on any item in the stack column and experiment with the callers/callees and inntermost/outermost options, like this:

This stack trace has very simple call stacks so it isn't very useful for looking at butterfly views. But try one of your own programs and look at a butterfly stack view of a function that is called often from multiple places. Or, use the butterfly view too look at a intermediate function and see all the functions it calls, and their stacks.
The above screen shots and summary table views contain the data from the entire trace. This works ok for short traces. But for longer traces, or even short traces with a lot of detail, we often need to look at specific time spans.
For example, there are some time spans in my experiment where FS isn't using very little CPU time. I'd like to see what FS is up to in that time span. This is easily done by using the left mouse button to select a time span on the X axis and zooming the graph to that view, or look at the summary table for that span, as in this example.
Once the interesting region is selected, I simply use the right mouse button to pop up the context menu and select summary table.
Note that you can open up multiple summary tables, each from different regions of a graph, or even different graphs. This is great for making comparisons.
The new summary table window now only shows the data for selected time span in the trace. Is not surprising that FS is spending the little CPU time it is using in user mode asynchronous procedure calls.
This post illustrates some key concepts:
- The xperf tools are designed for both system wide and application specific analysis.
- Profiling with ETW is very, very light weight. While the experiment is running, the xperf tools are not even loaded - the kernel itself is collecting the data. All analysis is done as post processing tasks.
- OS based sample profiling collects both user and kernel mode stacks.
- So long as you have symbols, production code can be profiled - no special debug or instrumented builds are required.
- In this example, I started and stopped FS.exe (the experiment) between the tracing start and stop. But, since this is ETW based, sample profiling can be started and stopped at any time, without stopping or restating even a single process. You can profile anything at any time on any system.
- Stack views provide a very powerful method for analyzing where time is spent in a process.
- The general technique with Performance Analyzer is to use the graphics to identify interesting time spans in the trace, then use the summary tables to look at the data in detail.
"Do the xperf tools support XP or Windows Server 2003?" is a frequently ask question. The answer is no mostly, and yes for a few things.
The next article in this series is Using the Windows Sample Profiler with Xperf
xperf.exe can be used on Windows XP SP2, and Windows Server 2003 for turning tracing on and of, and merge kernel trace data with user mode traces into a single ETL file. These operations are simply called "trace control". NOte that the '-stackwalk' switch is not supported on XP because its kernel doesn't support capturing the stack on events, this is anew feature in the Vista kernel.
However, all operations that require trace decoding (and that's almost everything else), must be done on Vista or Windows Server 2008. This includes viewing traces in the Windows Performance Analyzer tool (xperfview.exe).
The next question is this "The xperf tool kit installer doesn't install the tools on XP or WS2003; how do I get the tools on those systems?"
The answer is simple: From a Vista or WS2008 installation copy xperf.exe and perfctrl.dll to the target system. This is all xperf needs to support trace control.
After you have generated an ETL file, you can then copy it to a Vista or WS2008 system for trace decoding.
For those of you interested in the long story....
<boooring>
Event Tracing for Windows was first introduced in Windows in 2000. Back then, the OS only supported a small number of events; very few other Window's components used ETW. In those days, event logging with ETW was in its infancy and the people that wrote event consumers generally also wrote the code that produced the events, or worked closely with those that did.
Back in the day, many event providers and consumer's simply used the same C/C++ data structures to produce and consume events. While simple, this sometimes broke because people wouldn't version the events correctly when the event structure changed. In short, if the producer and the consumer code wasn't kept in sync then things were busted. This got to be a real problem as ETW was used more broadly.
This problem was solved by using meta to describe events. This allowed event consumers to decode events without knowledge of the events binary format. This worked much better; it allowed the event provider author to change an event's binary format without breaking the consumer. In the XP time frame MOF files were used to describe events. For example, you can find he kernel's context switch event here.
Three things changed for Vista:
- The entire Windows build system was updated so that every component was described by an XML based manifest. This included describing ETW events. We deprecated the MOF format and all new events were authored with XML based descriptions in their manifests using the Event Manifest Schema.
- The use of ETW became very prevalent - many teams added event providers to their components and used them for Windows Event Logging (which is ETW based), performance work, diagnostics, and testing. For example, on my laptop, there are 985 registered ETW event providers. Use the "xperf -provider" command to see what is registered on your system.
- Our team decided to make a major investment in ETW based tools as did other teams around Windows. This meant that meta information for events was very important as it enabled event providers and the consumers to be more decoupled and cohesive.
But, this posed one problem for us: do we fully support trace decoding on both Vista and XP? Or just on Vista? It was technically possible to keep trace decoding working on XP, but this would require shipping some Vista components with the tools because the required trace decoding infrastructure is only present on Vista. Unfortunately, this isn't possible for all kinds of business, legal, and some technical reasons. It would have also doubled our test matrix.
After much discussion, we decided it was an easily workable compromise to support trace collection on XP, and require Vista or WS2008 for all trace decoding operations.
</boooring>
Lets get to it! Here is how to take a basic trace then look at CPU and disk utilization. Its really simple, just three commands to turn on tracing, turn it off, and then view the trace.
The next article in this series is Xperf support for XP
First, from an elevated command prompt window, enable a basic set of the kernel events using this command:
xperf -on PROC_THREAD+LOADER+DISK_IO+HARD_FAULTS+INTERRUPT+DPC+CSWITCH -maxbuffers 1024
This command enables a set of events in the kernel and sets the maximum number of buffers to 1024. The default size for each buffer is 64K. So for this session, ETW will use up to 64MB of memory for ETW buffers. As buffers are filled with events, they are written to the log file in the background and then made available again for accepting events. By default, xperf sets the minimum number of buffers to 64. ETW will start with this many buffers and only allocate more buffers if needed. Events will only be lost if ETW cannot allocate more buffers and/or keep up with the event rate by writing data to the disk. By default, the kernel events are written to \kernel.etl on the current drive.
Next, do something interesting - it can be anything from opening Internet explorer and a web page, or compiling a program with Visual studio, to something more complex like opening three or four Microsoft Office applications and doing some work.
Run the following command when your interesting thing is done:
xperf -d foo.etl
This simple command will take 10 to 30 seconds (or possibly longer) because it merging the raw kernel event data with meta data and doing some other post processing. We call this 'stop and merge'. Here is what this command does
-
Performs a 'run down', during which the kernel logs a set of events that describe the state of the system.
-
Turns off the kernel logger
-
Interlaces data from multiple trace files and the kernel trace.
-
Adds some meta info to the trace needed for processing the trace on other systems. This data is saved in the trace as a set of synthetic events.
-
Saves the trace data into the file foo.etl (or the file name of your choice).
Finally, load the trace in the Performance Analyzer with the following command
xperf foo.etl
For this example, I took a trace of using Visual Studio 2008 to compile a program. Here are screen shots of the CPU Usage by CPU and for disk I/O counts.
Those are pretty interesting, but lots of things are running in the system, and I'd like to see just the CPU usage for Visual Studio itself.
The CPU usage by process graph makes this easy, just click on the fly out control on the left of the window and select the CPU Usage by Process graph.
The fly out frame lists the graphs available for the events in the trace. If there trace doesn't contain events that are needed for a particular graph, then the graph is not shown.
Performance Analyzer will automatically save the graphs you have selected. You can change them at any time.
For my trace, the CPU usage for the DEVENV.EXE process and two CL.EXE processes looked like this.

DEVENV is the Visual Studio 2008 environment itself. The CL.EXE processes are the two compiler sessions it started, one for each CPU on my laptop.
This is a simple example that illustrates some key points
-
The kernel events can be enabled and disabled at any time. There is no need to re-boot the system, log-out/log-in, or restart processes to use the kernel events, or any ETW event provider. ETW events from any source can be dynamically controlled at run time.
-
The xperf tools are designed for a post processing model, one where a trace is captured, then later analyzed. This is in contrast to an observational model where you watch dynamic charts, graphics, or tabular data as something occurs. The reason for this model is that ETW and the tools are designed for log time efficiency.
-
This model is also specifically designed for taking traces on one machine, then analyzing them on another machine. This ability is critical for running performance tests in a lab setting.
-
The tools let you look at both system wide activity and process specific activity.
I just love this add for the Sony Ericsson XPERIA X1.... its just barely an add. Its really a pretty cool music video...
The WHDC folks now have web page setup for the Windows Performance Toolkit (aka the 'xperf tools'). The page includes downloads for updates to the versions that ship in the SDK. In the near future, this page will include pointers to updated documentation, and discussion forums.
The next article in this series is Using Xperf to take a Trace (updated)
This page included downloads for V4.1.1 of the Windows Performance Toolkit. This download is an updated for the SDK version and includes fixes for two small bugs in the version included in the Windows SDK.
- The COM plug in that handles the power management events is not named correctly, so the power state transition analysis feature doesn't work.
The SDK team just shipped the latest version of the Windows SDK which supports Windows Server 2008 and Vista SP1. The SDK now includes an important new tool; the Windows Performance Tool Kit from the Windows performance team (we call them the xperf tools for short...)
This is the first article in the xperf series, the next one is
Xperf Tools Landing Page and Update
The xperf tools have long been an internal tool used by our team, and widely throughout Windows, for system-wide performance analysis. Xperf got its start many years ago as a set of command-line tools that produce reports based off the ETW instrumentation in the kernel[1]. Many other components and applications in Windows are instrumented with ETW and xperf can enable these events, dump them, and analyze them.
Xperf is an important tool for anyone doing system performance work on Windows because it's specifically designed to give you a complete system-wide view of performance over long periods of time (10's of seconds, to minutes)[2]. It's also the only tool that knows how to fully process all the events from the kernel and correlate them into something that makes sense.
For example, here is a detail graph of all the disk I/O to the system drive on my laptop for opening this post, editing it a bit, and then closing Live Writer.
Here is an example of the CPU and disk utilization for Outlook 2007 launch:
Here is the same view, but with the data from all processes visible:
In addition to graphical displays, the tools can also display tabular data (what we call "summary data"). The screen capture to the right is a table of sample profile events during a 6.5 second period during a find string operation over a tree of source code. For that period, 73.93% of the total CPU time was in the idle thread, 6.78% was in the find string utility and the reset of the time was distributed around services, the system, xperf itself (at 3%) and other processes. As you start playing with the summary tables, try shifting around the columns to get different types of views on the data; for example, grouping IOs per process, IO type (read/write/...), IO size, IO service time, and so forth.
These simple examples barely scratch the surface of the data that the tools can gather and the richness of the information they can display. The tools have several other important features including:
- Full support for symbol decoding. This uses the same mechanism as the Debugging Tools for Windows. This includes full support for the public Windows symbols, and for your own symbols.
- The ability to dump all the events from a trace file to a CSV file. If the summary tables don't display what you want, then you can write your own trace processing tools on top of the text dump, or the (generally XML-based) output of the command-line actions.
- Windows Vista supports collecting stack traces on all the kernel events. One of the most useful things to do is collecting stack traces on the sample profile event. This is an extremely powerful tool for understanding where and why a program is spending time.
- The xperf command-line tool can be used to control all the ETW trace providers in a system, including all the kernel events.
- The xperf distribution also contains a quick start guide and basic reference manual. Just look for the document Performance.Analyzer.QuickStart.docx, its in XPS format as well.
In the coming weeks, I'll blog more about the tools, how to use them, and the kernel ETW events. We'll also soon have a web page up for the tools. This is where you will soon find updates, additional documentation, and a message forum.
Now! Here is how you can get the tools!
- Install the SDK by downloading the ISO image, or using the Web based installer.
- Find the xperf MSI in the SDK's "bin" directory. It will be named xperf_x86.msi, xperf_x64.msi, or xperf_ia64.msi, depending on the architecture for which you install the SDK.
- You can then install the xperf tools from the MSI directly, or copy the xperf MSI file to another location and install it from there. For example, you could keep the MSI files on a USB key.
We'll soon have a web page up for the tools on the MSDN site... stay tuned!
[1] You can see the events supported by the kernel in the docs for the EnableFlags field of the EVENT_TRACE_PROPERTIES structure. I'm going to blog more about these...
[2] The xperf tools from the Windows Performance Toolkit are very complimentarily to the SysInternals tools.
One of my favorite blogs is Worse Than Failure (WFT). Many of the articles are very interesting. But gee, you just can't believe everything you read on the web. Recently, Alex Papadimoulis posted an article titled Job Interview 2.0: Now With Riddles! I'm not sure where he got is information, but I know the Windows organization (its pretty big you know) Microsoft hasn't interviewed this way in many, many years.
I know because I'm a hiring manager in the core operating system division (COSD) and I've done about a hundred interviews for developers over the last three and a half years. I've also been through all the interview training.
Really, we don't interview this way... really...
Nit Pickers Note: yes of course some oddball might ask a brain teaser every now and then, and your very own personal and genuine experience might have include some thing thought was a brain teaser, but the norm is just a regular, normal, interview.
Walt Ritscher has a really handy tool on his blog called 'Style Snooper'. This utility will display the style of any WPF control from its assembly. It does this by parsing all the public, concrete, non-generic classes in the assembly that derive from FrameWorkElement. Even better, he's provided the source code.
CLCV V5 now has a fully working TreeMap control that zooms, supports mouse over events and looks pretty good. The regions are laid out with the Squarified TreeMap algorithm. Even better, the tree map itself scales to large numbers of nodes - easily 100's of thousands, and on my lap top, it will handle a couple of million nodes relatively well. But there are two major performance problems: 1) WPF rendering seems to be very expensive. 2) Tree Map nodes are relatively expensive in terms of memory size. This limits the number of nodes to a ~2 million. You can find a link to the source code at the end of this post.

Note - all performance numbers (number of objects, times and profile data) are gathered from my new Acer Ferrari 5000 laptop.
From a memory usage perspective the control performs acceptably for even a few hundred thousand nodes. Tens of thousands of nodes are no problem. But, it does begin to struggle with about a million, with two million being a empirical limit. A future post on this WPF version of CLCV will focus on memory optimizations and usage in detail.
For this post, I'll focus on WPF rendering performance. In summary, the primary performance issue that limits the scaling of my treemap control isn't memory or CPU time to compute or render the layout - it is actual physical rendering of the rectangles and text by WPF itself. WPF cannot keep up with more than about 750 rectangles and a small number of scaled text items.
My treemap control is pretty well optimized (there is still some work to do) - but the computational performance of the control itself is not the limiting factor. The most important aspect of my treemap implementation is that it is smart about how much work it does -limiting its computations and rendering to only the nods that are reasonably large.
When learning about treemaps, I looked at some existing code and read all the web material I could find. It looks like most implementations are pretty basic - they compute the size and location of each rectangle for the entire tree. I thought this was a bit odd. It seemed to me that it would be more efficient to only process the part of the tree that would be visible - e.g. computing and rendering tiny or invisible rectangles didn't seem like a good design decision to me.
My first implementation took the straight forward approach as well. As I suspected, for larger trees - even those with a only a few thousands nodes, computing and rendering tiny rectangles was a waste of computation and rendering time. As you can see in the image to the right, rectangles that are smaller than some limit are too small to effectively mouse over or to meaningfully discern their relative area with respect to their parent.
So, I modified both the layout code and the rendering code to stop when the area of the rectangles became less than some threshold. This worked very well both from a visual perspective and performance standpoint. It also allows the data tree to become very large as it means that neither the layout and rendering costs grow with the overall tree size as the number of nodes surpasses the number of rectangles that the algorithm decides to actually render.
Experiment: CLC V5 includes a unit test for the treemap control that allows you to control the size of the three and set the treemap parameters such as the Area Limit.

The area limit is the minimum size of the rectangles that will be rendered. Assuming there are enough nodes, the larger this number, the fewer rectangles will be visible.
From empirical testing I determined that a value of 64 works pretty well.
For trees over a one or two thousand nodes, there are usually more nodes than can be displayed. In these scenarios, the scaling limitations are inherent to WPF itself, not the layout computations, or rendering in the treemap control.
Its important to understand how the control works and when WPF actually does the visible rendering. The treemap control has two main steps: the first is the actual layout computations of the tree, and the second is the rendering of the tree. Once the control is rendered, then WPF will take the rendering data generated in the control's OnRender function and actually turn that into DirectX commands that become visible on the screen. Its this third step that is the limiting factor.
The layout computations are handled by a function called ComputeAllNodeBoundingRects. This function starts at the root node and recursively computes the size of each visible rectangle for the nodes in the tree. This function stops its recursion when it reaches nodes that have visible areas less than the Area Limit. You can experiment with area limit values using the unit test (see above). If you set the area limit to zero, then the layout computations will compute a rectangle for every node in the tree.
Computing the bounding rectangles only needs to happen when the treemap control receives a new data tree (by setting the TreeMapData property), or when the render size changes (handled by the OnRenderSizeChanged event).
When WPF determines that a control needs to be rendered, it calls the controls OnRender method. This control's OnRender method recursively paints each of the rectangles computed in ComputeAllNodeBoundingRects. It also renders the text for the root nodes and first level nodes.
Both of these functions are relatively fast - here are some statistics for the initial random tree. LQ means 'lower quartile', UQ means 'upper quartile', IQ range is the inter quartile range divided by the median value. Range is the total range (max-min) divided by the median. Times are in milliseconds.
| |
Count |
Min |
LQ |
Median |
Mean |
UQ |
Max |
IQ Range % |
Range % |
| Compute |
100 |
2.6 |
2.8 |
3.0 |
3.4 |
3.4 |
7.7 |
18.4% |
171.7% |
| Render |
100 |
5.8 |
6.2 |
6.7 |
7.3 |
7.7 |
19.6 |
21.8% |
204.9% |
These times are for computing and rendering from 1,316 to 1,539 visible nodes. As you can see from the times, neither compute nor rendering times in the control itself should inhibit the control from rendering smoothly. This is especially true when the control is rendering due to events other than size changes.
But, even with the times above, the control still struggles to render - you can see this by just rolling the mouse over the control and watching the lagging mouse-over node changes, or by resizing the control. In both cases, if the control is painting more than 700 rectangles or so, then it can't keep up.
While relatively anecdotal, the CPU utilization for the control also seems on the high side. Just continually moving the mouse over the control uses considerable CPU, 70 to 80%. This is high, especially considering that my lap top has a dual core AMD Turion processor with has a WinEI rating of 4.8.
While there does seem to be some material performance and scaling issues with WPF, there is a lot to like about the programming model. I'll talk about this more in subsequent posts.
Source Code for this post: CLCV-BLOG-5.ZIP contains the source code to this post. I didn't include the data files in this zip file as they haven't changed - you can simply use the ones from CLCV-BLOG-3.ZIP.
Rico's recent post is interesting as it hints at things some developers do that sound like a good idea, but really pose some big performance problems and can sometimes be hard to repair disasters of ship stopping proportions. Rico, here are some other possibilities for your list...
1) An architect decided that the file system would be a cheap database by storing large numbers of items as individual files where the name is the key.
3) The driver developer that decided polling for monitor presence in a DPC. Polling time was a couple of hundred milliseconds, but it only polls once every three seconds. No problem!
4) The hard drive that occasionally decides to spend 1.5 seconds flushing its cache (and nothing else...)
CLCV V4 now loads files about 27 times faster than V3 when running on my laptop. The tree view is also about 5 times faster. This comes from changing my initial naive implementation to a smarter one where I minimize the inter-thread communication and handle the tree view much more efficiently. More information about the source code is at the bottom of this post. Here is what I learned in working on V4.
- Updating WPF UI elements can be expensive for large numbers of items. An effective and simple strategy is to only load data into WPF UI elements that the user actually needs to see - don't use WPF objects as the store for large numbers of objects.
- Updating pixels on the screen is expensive - even with relatively high performance hardware. Changing pixels (e.g. updating UI controls) at a rate faster than the monitor refresh rate is a waste of system resources, CPU time, GPU time, and memory bandwidth. UI updates that are related to (or driven from) high throughput tasks should be paced to no more than the monitor refresh rate.
V3 was the first functional version of CLCV and while it worked, it loaded files very slowly - talking 90 seconds or so (on my laptop) to load the largest data file (about 8.75MB in size). Given that the ultimate goal is to handle CSV files that hold data for the entire Windows Vista source code tree, I really needed to improve CLCV's performance.

The cause of the performance problem was easy to find using the Visual Studio 2005 profiler. I simply enabled instrumented profiling and the offending code was obvious.
The most time consuming operation in V3 wasn't the actual file read time, it was the overhead of building the tree of tree view items. In V3, the file reading thread sent a message to the UI thread for every new file and directory from the CSV file. The UI thread then added that information to the tree of tree view control items as the file was read. This was very time consuming and took much longer than actualy reading the file.
It turns out that building a tree of tree view items is simply expensive. The most expensive part being setting the string that is displayed as the item's header. This is closely followed by actually making a new item the child of another (ItemCollection.Add). I don't yet know why this is expensive, only that it is. It may be because I'm building the tree bottom up instead of top down - Kiran Kumar discusses this as a potential issue here. I need to chat with my friend Tim Cahill the WPF performance guy about this a bit more.
For V4, I took a different approach: I re-factored the code so the file loading thread now does all the work of constructing a tree of objects that represent the original directory structure and its files. I also moved the logic that updates the progress bar from the UI thread to the file loader thread - only sending progress messages to the UI thread every 33 milliseconds (this is configurable).
The results were dramatically better. The file loader thread could read the file and re-construct the directory tree much faster than the previous version. On my laptop, the 8.75MB file could be loaded in about 1.47 seconds - not bad at all. In this version, the file loader thread sends a message to the UI thread to update the progress bar no more frequently than every 33 milliseconds (about 30 times per second). This is more than often enough so that the progress bar is visibily smooth, but not so much that updating the progress bar takes too long.
Experiment: Updating the progress bar is expensive - way to expensive to do this for every line read from the file. You can see how expensive this can be by using CLCV V4 and setting the "Progress Update Interval" in the options dialog box to 0. When this is zero, the file reading thread will send a message to the UI thread for every record read from the the CSV file. The messages get queued up in the UI threads dispatcher object and processed as fast as the UI can handle them. Processing all these messages and updating the progress bar takes a much longer time than processing the entire file! The lesson here is that the overall process of updating the UI (chaning pixels on the screen) is simply expensive. There is really no need to update the UI faster than the monitor refresh rate. Often about 1/2 to even 1/3 of the monitor refresh rate is just fine.
Note that for most things, a UI update rate of 30 or so updates per second will present a smooth animaition to the user. If needed, updating the UI faster -- at or close to the monitor refresh rate -- will work well.
Next, I needed a strategy for minimizing the costs of building the tree of tree view items that display the directory tree and files. Given that the file reader thread builds a tree structure, the solution was simple: CLCV V4 only populates the tree view control when it needs to, and only populates the items the user needs to see. After the file is loaded, the UI code updates the tree view with the top level elements. When the user expands an element, the UI thread then populates the data for that item's children.
In this way, CLCV only pays the costs for populating elements the user needs to see; only tree view items that will actually be displayed are created and added to the tree view item tree. This also amortizes the costs of the items displayed over the time the user actualy minipulates the control. So even if the user browses the entire three, he never experiences a sluggish tree view control.
Experiment: This works quite well, if you use CLCV version 3 to load the large test file (test-l.clc.csv) the first click on the bottom entry ("GSFD") will take a noticeably, and annoyingly, long time to expand ( about 2.5 seconds on my laptop). With V4, this happens quickly: it takes about 150ms to populate 922 child items (~6,000 items per second, or ~166us). It looks like it takes about anothe r300 milliseconds to draw the new items: its hard to measure this because WPF doesn't have a begin and ending events for item expansion (I just used a stop watch).

With V4, the profiler shows a very different picture: The bulk of the CPU time during file loading is spent actually parsing the text data from the file, primarily converting text fields to decimal values using UInt32.TryParse().
This is as expected. It probably isn't worth optimizing the conversion operations by writing a hand crafted text to Uint32 conversion routine as such a routine would need to handle localization and othe rissues. I expect it would only result modest performance improvements. The primary file loading bottleneck in V4 isn't CPU time, but I/O efficiency.
Loading a 8.75MB file in 1.47 seconds is pretty good - its about 6 MB/s; and its plenty fast enough for my intended scenario. But, it is still slow compared to other applications. I've developed other tools (such as CLC itself) that are completly I/O bound and use only one thread for the entire application. These tools easily drive the disk at its maximum sequential read rate of about 35 MB/s while handling much more complex string opeations than merely parsing some CSV data. For example, CLC is an order of magnitude faster than CLCV. The reason is the stream reader class - it doesn't do asynchronous or un-buffered I/O and its paying the cost of transcoding ANSI or UTF8 text into UTF16 for storing in .NET string objects.
In contrast, my natively implemented tools read files fully asynchronously in un-buffered mode. This makes I/O operations happen in paralell with data processing - not sequentially. They also parse the text data directly from the buffers populated by the operating system file read operation - no data copies are needed. Next, they handle ANSI and UNICODE directly - they don't transcode all input text to UTF-16.
Now this isn't to say that the StreamBuffer class is slow - it isn't. Using off the shelf I/O libraries in native C++ results in comparable performance to the .NET StreamBuffer class. The difference comes from using hand crafted I/O routines specificaly designed to for sequntial text line reading and maximizing parallism between disk I/O and computation.
One of my future tasks is writing a COM wrapper for my hand crafted C++ I/O libraries so they can be used efficiently from .NET languages. I expect this will allow CLCV to come very close to the file read performance of my native tools.
Source Code for this post: CLCV-BLOG-4.ZIP contains the source code to this post. I didn't include the data files in this zip file as they haven't changed - you can simpy use the ones from CLCV-BLOG-3.ZIP.