Once again I’ll begin by saying that I’m simplifying the below in the interest of not writing a novel, so please take this article as only approximately correct.  I try to make the principles more important than the details.

I’ve said before, that if you make no measurements, you’re pretty much doomed to miserable performance.  If you can get good performance without measuring then I think that you’ve missed your calling in life and probably you should change careers to “Professional Lottery Ticket Picker” and please call me up J

So the only thing worse than no data, is bad data, or badly interpreted data, or both.  And now without further ado, here is Rico’s free advice on bad data.

1) If you’re using an invasive profiler, don’t believe the times it gives you

When I say an invasive profiler, I mean one that instruments every call site in the program that’s being monitored.  And when I say don’t believe the times, I mean you might be better off just pretending it isn’t even giving you times at all.  There are several reasons why this is usually a good idea.

A typical invasive profiler might hook the prolog and epilog of every function.  It measures the time spent in each call and then, in digital defiance of the Uncertainty Principle, it attempts to subtract out its own influence.  Actually the problem here isn’t really quantum mechanical at all – the problem has to do with numerically unstable computations.

The function being measured is going to take some amount of time let’s call it x. And, performing the measurement is going to take some amount of time, in a fit of originality we’ll call that y.  When we hook the function and note its entry time and exit time, we’ll have actually measured x+2y.  We only want x so we’re going to have to subtract out the 2y.  The way this is usually done is that at the beginning of the run we carefully measure how long it takes to take a measurement, thereby getting a high precision measurement of y.  And armed with that number we bravely multiply it by two and prepare to subtract out all the interference.

Several things go wrong at this point.

First, y is very small, and if our function is a small hot thing being called very often then x is also very small.  The worst thing you can do to kill your significant digits is try to subtract two things that are close in magnitude in order to get a (much smaller) difference.  That’s exactly what we’d be doing here for small functions, because x+2y is actually quite close to 2y.  The next thing we’ll do is repeat this operation, perhaps tens of thousands of times for frequently called functions, multiplying the potential error.  I can feel my Numerical Analysis professor shuddering.

Second, y has significant variability from function to function.  This is what’s making the first problem truly deadly.  You see, on modern processors the cost of doing pretty much anything is deeply dependant on what happened previously, and in some cases on what is to come.  Even simple looking instructions like ones that record a time into some kind of buffer can and are deeply affected by the processors cache.  As a result, the average measurement of y at program startup could be significantly different than what is observed in any given function.  That spells trouble.  I remember a friend working on one of these started finding that in some cases he was getting negative times in functions because the overhead was much less in those cases and subtracting out the 2y measured at startup was more than the entire observed x+2y in the run.  Ouch.

But all is not lost, the bigger the function, or more generally, the bigger the time period being measured, the greater the accuracy.  But remember, if the function calls other functions all those nested overheads will need to be subtracted out as part of the process so that’s not helping you at all.   Only the big meaty worker functions are truly getting the best measurements.  Small functions, put little trust in the number, large function, put more trust.  Don’t compare the time spent calling a small function thousands of times to the time spent in fewer calls to a big function as the error margin is likely to be huge (I’ve seen times that were off by a factor of 5).

What these profilers excel at is giving you a picture of what is being called and how often.  In fact the data here is definitive.  So do trust the counts and count origins and use those to guide your analysis.  In my experience, these kinds of runs are the most costly and the least reproducible from run to run (time-wise).  That makes it hard to know if you’re actually making progress.  Take the times with a big grain of salt.  Milk the counts.

2) If you’re using a sampling profiler, make sure you have plenty of samples

Would you believe that 4 out of 5 dentists really prefer Brand X toothpaste if only 5 dentists were actually surveyed?  Of course not.  Yet perfectly sane engineers believe that 10% of their startup execution is in Function X because 1 of their 10 samples happened to land in that function.

Sampling profilers work just as the name implies.  They sample the processors IP (and sometimes a whole callstack) periodically when some event occurs, usually the passage of a fairly small amount of time, like say 5 milliseconds.   I like sampling profilers because they are the least invasive and so you can mostly trust the times.  I say mostly because you must have a statistically valid sample size to make any conclusions at all.

Where sampling profilers really fall down is when you’re trying to measure, in detail, a brief event that occurs infrequently.  Actually even getting detail on a long event that occurs infrequently is problematic because you’ll miss so much of the fine detail with the sampling.  You really need the event to be repeated so that you have plenty of chances to get a sample at many interesting points before things get good.

Even with these limitations, for my money, sampled data is the best.

3) Fear the interference

I sort of hinted at this earlier but it bears discussion on its own merits.

When measuring a large program it’s temping to isolate the part that’s interesting to you and measure just that without running the rest.  You might even be able to boil the relevant part down into a so-called micro-benchmark.  It’s very temping considering the time savings and productivity gains you can get by reducing the analysis.  Don’t do it.  Or at least, don’t do it without being very careful.

A big part of how your program runs and where the costs are has to do with how it interferes with itself.  A simple example is where two algorithms both tuned to have good caching behavior might play very poorly together because collectively they create a very bad caching pattern.  But even algorithms that are playing comparatively well together are placing somewhat of a burden on each other. If you were to run only part of the program and measure that, you might not be able to conclude very much about how changing the part you measured would affect the whole.

The very best measurements, and the only ones that really count, are the ones that time what the user will experience.  Anything that isn’t the user experience is something we’ve created for our own productivity, and those are useful only to the extent that they are probative. Do they help us to find real user problems more quickly?  Do they predict the user experience at all?  What correction factor must be applied (if any)?

I think about 4 times a year I see a new implementation of memcpy that is purportedly better than all previous known memcpy’s.  Now the thing about memcpy is people have been thinking about this problem for a long time, so already you have to treat a newcomer with some suspicion.  But the real kicker is that the proof that is usually offered comes in the form of micro-benchmarks of the standard memcpy and the super-memcpy side-by-side.  This is kind of interesting but it may not be very probative.  What really matters is: “Is the new thing faster in the context of actual programs making calls to memcpy?” Is it as frugal with cache lines?  Are there other side-effects?  Does the benchmark accurately reflect the typical alignment and length sizes seen or is it all “easy cases?”

I think it’s fair to say that memcpy is not changed four times a year, notwithstanding the availability of new implementations at that rate.  Thorough analysis on even so small a thing is after all quite difficult.