Well it was so much fun the first time, it's time for an encore :)

Consider the functions Test1 and Test2 below.

Q1: On average which one will run faster?
Q2: Which offers the best case performance?
Q3: What assumptions did you make to answer the above?

namespace Test
{
class Test
{
static Random rand;
static int size = 65536;

struct P1
{
public double x;
public double y;
}

static double Test1()
{
int i,j;
double sum = 0;
P1[] p = new P1[size];

// initialize the array
for (i=0; i<size; i++)
{
p[i].x = rand.NextDouble();
p[i].y = rand.NextDouble();
}

// use the array
for (j=0; j<size*10; j++)
{
i = (int)(size*rand.NextDouble());
double z = Math.Sqrt(p[i].x + p[i].y);
sum += z;
}
return sum;
}

struct P2
{
public double x;
public double y;
public double z;
}

static double Test2()
{
int i,j;
double sum = 0;
P2[] p = new P2[size];

// initialize the array
for (i=0; i<size; i++)
{
p[i].x = rand.NextDouble();
p[i].y = rand.NextDouble();
p[i].z = Math.Sqrt(p[i].x + p[i].y);
}

// use the array
for (j=0; j<size*10; j++)
{
i = (int)(size*rand.NextDouble());
sum += p[i].z;
}
return sum;
}
}
}

Do not read past this line if you want to avoid spoilers.

Well this problem isn’t like the last one I gave you, it didn’t come up in the normal course of events, but rather I made it up specifically to see what kinds of answers my own team might give and to give them (and you) something to think about.  I posed the problem to my own team first because I thought their responses might be instructional for others.

Now in the analysis that follows you’ll see that the results are sensitive to a variety of factors, but I’m actually not terribly concerned about the particulars of the numbers but rather the contours of the behavior that you can expect to see and the behavior I’m going to talk about would be exhibited across a variety of processors and cache sizes at different points as we will see.

To help with this analysis I’ve produced the following chart in VML so that it can be fetched (hopefully) and viewed (also hopefully) by a variety of systems via RSS.  Sadly all I have to test it with is IE so I’m just crossing my fingers.  If you can’t see the chart, hopefully the discussion will be useful in any case.

Log2(scale) 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Time 70 65 60 55 50 45 40 35 30 25 20 15 10 Test1 Test2

OK having seen the graph here’s the answers, but as you’ll see the specifics are less interesting than the analysis.

Q1: On average which one will run faster?

Note that I had to ask “on average” because of the random number generation in both algorithms.  But having done enough runs to smooth out the random number generation we can see that, on my system, with the specific numbers as given in the problem statement it’s a dead heat.  Simply look up at the numbers under column 16, which corresponds to size = 1<<16 = 65536 which is the value in question.

The responses I got were in one of two camps:  there were those that felt that the extra cache misses that would be caused by the larger amount of data in Test2 would more than compensate for the additional double-precision square roots, giving Test1 the victory due to its more frugal use of space.  And there were those that felt the reverse.

Q2: Which offers the best case performance?

There is no good answer to this problem because it depends on the size but it was interesting to see how people thought about “best.”  Perhaps the most interesting points of view were those that considered size zero to be the best case for both because they run the fastest on an empty dataset.  I suppose this is true but it’s somewhat meaningless to consider an isolated case like that, especially one in which both algorithms vacuously run in near-zero time.  Might we not say then that an algorithm that runs in O(n^2) time has the same best case as one that runs in O(n^3) time because of what they do at n=0?  While that may be true from some perspective, it isn’t helpful.

But read on for the analysis for a better treatment of how they algorithms compare and why.

Q3: What assumptions did you make to answer the above?

Most people’s assumptions had to do with relative performance of floating point math vs. the cost of cache misses and the size of the cache as well as the size of doubles.

Analysis

A dead heat?  I’m such a stinker.  And some wondered why I did such unusual math in the test body, with no squaring and so forth.  I admit it, I rigged the fight.  But why?  Let’s look at the numbers because they’re very cool.

First, when reading the chart you should know that the amount of “work” done in each of the samples is fixed.  This is to say that when I did the timing I varied the number of iterations inversely with the size of the problem so that the total amount of summing was fixed.  I did this deliberately so that times would be comparable across all runs.

OK, looking at the graph, we can see that for sizes up to 2^14 Test2 is the clear winner.  Why is this?  Well, for those sizes in both cases all the data was fitting in my processors L2 cache, which means there was no additional time overhead associated with using the memory and so all that happens when cache the z value is that we save time.

It’s important to note just how nearly-flat the time is here in both cases.  Even though the smaller size arrays are doing many more allocations there is only a barely measurable speedup going from allocations of size 16 bytes to 32, 64, and so forth.  By the flatness of both graphs you can see that the amortized cost of allocating memory and collecting memory varies with the number of bytes allocated, which is fixed, and not with the number of allocations, which is going down geometrically.

As the size approaches 2^14 you can see the time going up gradually in both cases, corresponding to the cache becoming fully loaded and the necessary cache misses to get the cache primed.

At 2^15 the interesting things start to happen.  This is if you like the shining moment for Test1 – both algorithms took a big hit going to 32k entries in the array but Test1 is still getting a substantial number of cache hits while Test2 is now paying a hefty penalty for more misses.

At 2^16, which is the problem as given, Test1 is still getting enough extra cache hits to break even with the math.

And now we come to the most interesting phenomenon – at size 2^17 Test2 is back in the lead, and it won’t look back for a long time.   Why?  Well at this point, on my machine anyway, the data is too big for the cache in BOTH algorithms.  Neither is getting a substantial number of cache hits and so the distance between them is reverting to the previous savings of some double-precision math.

The real failing of the answers I got to this problem was here, the reason I couldn’t give anyone full marks was that despite notions that memory would soon dominate, nobody predicted that for larger sizes the cache would stop being a factor.

Again moving up to sizes 2^23 you can see that the graph is still very flat.  Remember the x-axis is logarithmic. With each new data point doubling the size of the array if I were to plot the graph in non-log scale you’d see a nearly flat line for almost all the graph and a very sharp almost vertical line at the extreme left near zero.

But even as I say it’s nearly flat, we can see that the green line for Test2 is growing a lot faster than the red Test1 line.  What is going on here?

Well what’s happening now is that as the test arrays are increasing in size they are requiring greater and greater percentages of the machines overall memory.  That doesn’t come so easily, and in fact the next level of the memory hierarchy starts to play a factor – to find the necessary physical memory the operating system is having to page out some data pages that were serving other processes, incurring some significant overhead.  Ultimately, at size 2^23 Test1’s more frugal algorithm starts winning again.

Conclusions

What do we learn from all of this?

Very often I’m shown new and potentially better algorithms along with a micro-benchmark that indicates how vastly superior the new algorithm is.  These have to be taken with a good deal of salt, and you yourself should be very careful about how much weight you put in your own isolated measurements.

Had you been shown a benchmark for Test1 and Test2 with some number of iterations and data-size in which either was shown to be decidedly better, you would only have been seeing a part of the puzzle.  The truth is there’s a real horse-race here.

Looking at the big picture you can more readily appreciate the demand that the algorithms are making on system resources to achieve their throughput.  And in fact, if these algorithms were running as part of a larger system Test2 would have a much greater collateral impact on the overall system even though when run in isolation it is faster for many more data sizes.

It’s vital to measure code in the manner in which it will be used, and not (only) in isolation.

Analyzing two functions with random numbers at their core may seem strange but there are many algorithms which either randomize their inputs, or deal with input that is nearly random in the first place – though perhaps not as evenly distributed randomness as this example.

And having seen all this data which do I recommend?

The performance architects of the CLR are again united in their opinion: Test1 is the winner.  The benchmarks show that it is in the neighborhood of 13% slower in the worst of times and consistently uses 1/3 less data.  We felt that when used in the context of a larger program the additional throughput was not worth the resource consumption.  Overall performance was more likely to benefit from the frugalness of Test1 than the caching of Test2.

Raw Data

 Log2 Size Test1 Test2 4 18.09 16.13 5 18.21 16.03 6 18.16 15.96 7 17.84 15.91 8 17.83 15.94 9 17.92 16.18 10 18.4 16.34 11 18.66 16.55 12 18.79 17.84 13 19.22 17.71 14 19.89 19.09 15 28.31 31.54 16 44.36 44.05 17 55.29 50.45 18 58.2 54.19 19 60.3 56.43 20 60.8 57.61 21 62.04 58.78 22 62.79 61.46 23 66.24 67.23