Sign In
Rico Mariani's Performance Tidbits
Implying no warranties and conferring no rights: "AS IS" since 1988
Translate This Page
Translate this page
Powered by
Microsoft® Translator
Options
Blog Home
Email Blog Author
Share this
RSS for posts
Atom
RSS for comments
Search
Advanced search options...
Search In:
Everything
Blogs
Forums
People
Groups
Places
Pages
Date range:
All Time
Last Year
Last 6 Months
Last 3 Months
Last Month
Last Week
Last Two Days
Tags
databases
debuggers
design advice
History of Visual Studio
locking
Pages
performance
quiz
ramblings
recommendations
signatures
using tools
visual studio
Archive
Archives
March 2012
(3)
February 2012
(3)
January 2012
(3)
December 2011
(1)
November 2011
(1)
September 2011
(1)
September 2010
(1)
June 2010
(1)
April 2010
(5)
December 2009
(1)
October 2009
(13)
September 2009
(2)
August 2009
(1)
June 2009
(3)
May 2009
(3)
December 2008
(1)
November 2008
(4)
September 2008
(2)
August 2008
(6)
June 2008
(2)
May 2008
(2)
February 2008
(2)
January 2008
(2)
November 2007
(5)
October 2007
(2)
September 2007
(3)
August 2007
(3)
July 2007
(3)
June 2007
(6)
May 2007
(1)
April 2007
(2)
March 2007
(1)
February 2007
(5)
January 2007
(7)
December 2006
(2)
November 2006
(1)
September 2006
(4)
August 2006
(4)
July 2006
(12)
June 2006
(3)
May 2006
(5)
April 2006
(6)
March 2006
(6)
February 2006
(2)
January 2006
(2)
December 2005
(2)
November 2005
(8)
October 2005
(4)
September 2005
(4)
August 2005
(7)
July 2005
(2)
June 2005
(3)
May 2005
(12)
April 2005
(3)
March 2005
(5)
February 2005
(2)
January 2005
(3)
December 2004
(2)
November 2004
(2)
October 2004
(3)
September 2004
(4)
August 2004
(3)
July 2004
(4)
June 2004
(8)
May 2004
(6)
April 2004
(4)
March 2004
(8)
February 2004
(4)
January 2004
(3)
December 2003
(11)
Performance Quiz #2
MSDN Blogs
>
Rico Mariani's Performance Tidbits
>
Performance Quiz #2
Performance Quiz #2
ricom
29 Mar 2004 9:37 PM
Comments
6
Well it was so much fun the first time, it's time for an encore :)
Consider the functions Test1 and Test2 below.
Q1: On average which one will run faster?
Q2: Which offers the best case performance?
Q3: What assumptions did you make to answer the above?
namespace Test
{
class Test
{
static Random rand;
static int size = 65536;
struct P1
{
public double x;
public double y;
}
static double Test1()
{
int i,j;
double sum = 0;
P1[] p = new P1[size];
// initialize the array
for (i=0; i<size; i++)
{
p[i].x = rand.NextDouble();
p[i].y = rand.NextDouble();
}
// use the array
for (j=0; j<size*10; j++)
{
i = (int)(size*rand.NextDouble());
double z = Math.Sqrt(p[i].x + p[i].y);
sum += z;
}
return sum;
}
struct P2
{
public double x;
public double y;
public double z;
}
static double Test2()
{
int i,j;
double sum = 0;
P2[] p = new P2[size];
// initialize the array
for (i=0; i<size; i++)
{
p[i].x = rand.NextDouble();
p[i].y = rand.NextDouble();
p[i].z = Math.Sqrt(p[i].x + p[i].y);
}
// use the array
for (j=0; j<size*10; j++)
{
i = (int)(size*rand.NextDouble());
sum += p[i].z;
}
return sum;
}
}
}
Do not read past this line if you want to avoid spoilers.
Well this problem isn’t like the last one I gave you, it didn’t come up in the normal course of events, but rather I made it up specifically to see what kinds of answers my own team might give and to give them (and you) something to think about. I posed the problem to my own team first because I thought their responses might be instructional for others.
Now in the analysis that follows you’ll see that the results are sensitive to a variety of factors, but I’m actually not terribly concerned about the particulars of the numbers but rather the contours of the behavior that you can expect to see and the behavior I’m going to talk about would be exhibited across a variety of processors and cache sizes at different points as we will see.
To help with this analysis I’ve produced the following chart in VML so that it can be fetched (hopefully) and viewed (also hopefully) by a variety of systems via RSS. Sadly all I have to test it with is IE so I’m just crossing my fingers. If you can’t see the chart, hopefully the discussion will be useful in any case.
Log2(scale)
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Time
70
65
60
55
50
45
40
35
30
25
20
15
10
Test1
Test2
OK having seen the graph here’s the answers, but as you’ll see the specifics are less interesting than the analysis.
Answers and Summary of Responses
Q1: On average which one will run faster?
Note that I had to ask “on average” because of the random number generation in both algorithms. But having done enough runs to smooth out the random number generation we can see that, on my system, with the specific numbers as given in the problem statement it’s a dead heat. Simply look up at the numbers under column 16, which corresponds to size = 1<<16 = 65536 which is the value in question.
The responses I got were in one of two camps: there were those that felt that the extra cache misses that would be caused by the larger amount of data in Test2 would more than compensate for the additional double-precision square roots, giving Test1 the victory due to its more frugal use of space. And there were those that felt the reverse.
Q2: Which offers the best case performance?
There is no good answer to this problem because it depends on the size but it was interesting to see how people thought about “best.” Perhaps the most interesting points of view were those that considered size zero to be the best case for both because they run the fastest on an empty dataset. I suppose this is true but it’s somewhat meaningless to consider an isolated case like that, especially one in which both algorithms vacuously run in near-zero time. Might we not say then that an algorithm that runs in O(n^2) time has the same best case as one that runs in O(n^3) time because of what they do at n=0? While that may be true from some perspective, it isn’t helpful.
But read on for the analysis for a better treatment of how they algorithms compare and why.
Q3: What assumptions did you make to answer the above?
Most people’s assumptions had to do with relative performance of floating point math vs. the cost of cache misses and the size of the cache as well as the size of doubles.
Analysis
A dead heat? I’m such a stinker. And some wondered why I did such unusual math in the test body, with no squaring and so forth. I admit it, I rigged the fight. But why? Let’s look at the numbers because they’re very cool.
First, when reading the chart you should know that the amount of “work” done in each of the samples is fixed. This is to say that when I did the timing I varied the number of iterations inversely with the size of the problem so that the total amount of summing was fixed. I did this deliberately so that times would be comparable across all runs.
OK, looking at the graph, we can see that for sizes up to 2^14 Test2 is the clear winner. Why is this? Well, for those sizes in both cases all the data was fitting in my processors L2 cache, which means there was no additional time overhead associated with using the memory and so all that happens when cache the z value is that we save time.
It’s important to note just how nearly-flat the time is here in both cases. Even though the smaller size arrays are doing many more allocations there is only a barely measurable speedup going from allocations of size 16 bytes to 32, 64, and so forth. By the flatness of both graphs you can see that the amortized cost of allocating memory and collecting memory varies with the number of bytes allocated, which is fixed, and not with the number of allocations, which is going down geometrically.
As the size approaches 2^14 you can see the time going up gradually in both cases, corresponding to the cache becoming fully loaded and the necessary cache misses to get the cache primed.
At 2^15 the interesting things start to happen. This is if you like the shining moment for Test1 – both algorithms took a big hit going to 32k entries in the array but Test1 is still getting a substantial number of cache hits while Test2 is now paying a hefty penalty for more misses.
At 2^16, which is the problem as given, Test1 is still getting enough extra cache hits to break even with the math.
And now we come to the most interesting phenomenon – at size 2^17 Test2 is back in the lead, and it won’t look back for a long time. Why? Well at this point, on my machine anyway, the data is too big for the cache in BOTH algorithms. Neither is getting a substantial number of cache hits and so the distance between them is reverting to the previous savings of some double-precision math.
The real failing of the answers I got to this problem was here, the reason I couldn’t give anyone full marks was that despite notions that memory would soon dominate, nobody predicted that for larger sizes the cache would stop being a factor.
Again moving up to sizes 2^23 you can see that the graph is still very flat. Remember the x-axis is logarithmic. With each new data point doubling the size of the array if I were to plot the graph in non-log scale you’d see a nearly flat line for almost all the graph and a very sharp almost vertical line at the extreme left near zero.
But even as I say it’s nearly flat, we can see that the green line for Test2 is growing a lot faster than the red Test1 line. What is going on here?
Well what’s happening now is that as the test arrays are increasing in size they are requiring greater and greater percentages of the machines overall memory. That doesn’t come so easily, and in fact the next level of the memory hierarchy starts to play a factor – to find the necessary physical memory the operating system is having to page out some data pages that were serving other processes, incurring some significant overhead. Ultimately, at size 2^23 Test1’s more frugal algorithm starts winning again.
Conclusions
What do we learn from all of this?
Very often I’m shown new and potentially better algorithms along with a micro-benchmark that indicates how vastly superior the new algorithm is. These have to be taken with a good deal of salt, and you yourself should be very careful about how much weight you put in your own isolated measurements.
Had you been shown a benchmark for Test1 and Test2 with some number of iterations and data-size in which either was shown to be decidedly better, you would only have been seeing a part of the puzzle. The truth is there’s a real horse-race here.
Looking at the big picture you can more readily appreciate the demand that the algorithms are making on system resources to achieve their throughput. And in fact, if these algorithms were running as part of a larger system Test2 would have a much greater collateral impact on the overall system even though when run in isolation it is faster for many more data sizes.
It’s vital to measure code in the manner in which it will be used, and not (only) in isolation.
Analyzing two functions with random numbers at their core may seem strange but there are many algorithms which either randomize their inputs, or deal with input that is nearly random in the first place – though perhaps not as evenly distributed randomness as this example.
And having seen all this data which do I recommend?
The performance architects of the CLR are again united in their opinion: Test1 is the winner. The benchmarks show that it is in the neighborhood of 13% slower in the worst of times and consistently uses 1/3 less data. We felt that when used in the context of a larger program the additional throughput was not worth the resource consumption. Overall performance was more likely to benefit from the frugalness of Test1 than the caching of Test2.
Raw Data
Log
2
Size
Test1
Test2
4
18.09
16.13
5
18.21
16.03
6
18.16
15.96
7
17.84
15.91
8
17.83
15.94
9
17.92
16.18
10
18.4
16.34
11
18.66
16.55
12
18.79
17.84
13
19.22
17.71
14
19.89
19.09
15
28.31
31.54
16
44.36
44.05
17
55.29
50.45
18
58.2
54.19
19
60.3
56.43
20
60.8
57.61
21
62.04
58.78
22
62.79
61.46
23
66.24
67.23
6 Comments
performance
,
quiz
Blog - Comment List MSDN TechNet
Comments
Loading...