While speaking with some of our internal performance engineers the other day, I was reminded of the method I’ve been using to explain sampling profiling to customers.  Since I don’t think it has made it into any of the official documentation or blogs, I’ll take a moment or two to share it with a broader audience.

Back in the day before the Visual Studio Team Systems profiler (and even before other sampling profilers were available), there were already performance problems – go figure.  My personal experience with an extreme example of this was in the early nineties when I was working on CodeView.  CodeView was the standalone COW (Character Oriented Windows) debugger that we shipped up until most of its functionality was incorporated into Developer Studio. Some of our internal customers were breathing down our necks because it took a half hour to load their application into CodeView to debug it when they had symbols available for all components.  As a stop-gap they were creating hybrid builds that only had debug information for certain modules, but they were definitely not happy with that solution!

So I knew up front that there was some problem with symbol load time, but how do I figure out what?  I used what was a pretty common technique in the day which was to run the scenario under the debugger, break a number of times and take a look at the call stack.  This doesn’t always work, but in this extreme it lead me right to the main culprit.  We were reading the entire symbol table, doing individual allocations for each symbol and doing a transformation from the file-formatted symbol into an internal format for the symbol.  So my random call stacks were consistently showing that we were spending lots of time in the code that did this allocation and transformation.  Reworking the code so that the entire debug section could be loaded as a single block and doing a quick run for some fixups gave me a 10x performance gain with not too much work.

Now while this may be a somewhat interesting historical note, you may ask what this has to do with sampling based profiling. With two pretty important differences, the above technique with the debugger is a poor man’s version of a sampling profiler.  In the sampling profiler, every time a specific event occurs we walk the call stack of the current thread and put this information in our log file.  When we later analyze the file, we aggregate the results to give you a picture of what your program was doing the most when these events occurred. 

Voila – we’ve just re-invented an ancient technique for tracking down performance problems!  But now let’s take a closer look at how the two techniques are different.

The first major difference is that we can gather and aggregate a lot more data than you can when breaking your program randomly in the debugger, so issues that are less extreme than the example I cited above can be detected.  Along with that comes the ability to do more sophisticated analysis than the flat set of call stacks that you would see in the debugger.  This is the justification for having a sampling based profiler in the first place.  I recommend that you take a look at some of the other VSTS profiler team’s blogs for loads of additional info. 

The second major difference is something that was pointed out by the performance engineers the other day, and puts a different spin on a recurring issue with sampling profiling.  When you break in the debugger it stops all threads and you can take a look at all of their call stacks.  With our implementation of a sampling profiler, we only report the call stacks of the thread that was active at the time our event occurred.  Why is this a significant difference, you may ask?  Because a thread that is sitting in WaitForSingleObject or any other system level synchronization call (this counts managed locks as well) won’t ever be sampled.  In our sampling profiler, we do a good job of showing you where you use the most of a particular resource (for instance processor usage in the default case), but we don’t do anything to show you when the problem is multiple threads contending over the same resource.  In the debugger you can look at the call stacks of the inactive threads and see that some particular thread is sitting in WaitForSingleObject much more often that you expect. 

Now before you run out and start using the debugger as your primary tool for identifying resource contention issues, let me tell you that we do have a completely different solution to help you identify resource contention issues.  This is one of the things that the instrumentation based flavor of our profiler does pretty well, and I’ll follow up with a blog about that sometime in the near future.

--dwg