If you’ve not read previous posts in this series about auto-vectorization, you may want to begin at the beginning.
This post explains how to measure the benefits of auto-vectorization – how much it speeds up your code. (To find out whether any particular loop was successfully vectorized, please see the earlier post called Did-It-Work?)
For this post, let’s use the following, small example:
As you can see, on line 10, it steps through array a, setting each element to the sin*cosof the corresponding element from array b. The arrays a and b are dimensioned to hold a million elements each. We repeat the measurement REPS times and report the average elapsed time.
For details of the Timer object, see Simon’s post on High-resolution timer for C++.
In order to experiment, you can disable the auto-vectorizer by including the following pragma just before the sin*cos loop:
#pragma loop(no_vector)
Here are the results, in milliseconds, measured on my office PC, for three cases:
Debug
Release, no vectorization
Release, auto-vectorized
38
13.6
6.3
Notice how the performance improves in going from a debug (non-optimized) run, to a release (optimized-for-speed) run, by 2.8X. And by a further 2.2X if we allow auto-vectorization to kick in.
We should ask ourselves whether 500 repetitions are enough to get a reliable average. Here is a graph showing the time, in milliseconds, taken for the first 20 or so reps. As you can see, the very first run is slower (due to our arrays not being in the cache yet – we say the cache is“cold”). But it settles down quickly to a consistent result. (Not very scientific, I agree, but plausible)
That’s all we really need in order to experiment – just surround the loop of interest with a t.Start() and a t.Stop(). Then grab the elapsed time, for that loop, using t.Elapsed(). Include enough repetitions to get a meaningful average.
You need to be very careful in measuring the performance of programs, especially if you are going to discuss the results with anyone, other than your cat. There’s a good chance you will end up in a fist fight.
Anyhow, partly to forestall similar bitter arguments in the comments for this blog post, about the effectiveness of auto-vectorization, and partly because it’s actually useful to know, the rest of this post discusses issues on measuring performance. It’s not essential, or even specific to auto-vectorization, so feel free to skip.
First note that we are interested only in the performance of compute-bound loops. The programs we measure in this blog do not include any IO to disk, console, screen, network, or anything else. Make sure any results are reported using IO (typically to the console) lie outsideof the timed loop.
Next, before changing things to improve speed, you must profile your code! This will help avoid effort wasted in optimizing code-paths whose contribution to overall performance is negligible.
[NitPick: this recalls the maxim: “premature optimization is the root of all evil”? Yes, and this article examines the amusing confusion over its origin – Knuth or Hoare]
There are lots of profilers available, ranging in price from $0 up to $hundreds. Visual Studio provides profiling under the ANALYZE tab. You can experiment with this feature using the free Ultimate 2012 RC download.
Here then, are some factors to keep you awake at night:
Good luck!