How we test the compiler performance

Visual CPP Team

July 7th, 20100 0

The C++ back-end team is very conscious of the performance of our product. Today I will present to you an overview of how we define “performance of our product” and the way we measure it. Along the way I hope to introduce you to some new ideas that you can use to test your product’s performance as well. You can read Alex Thaman’s blog post on “How we test the compiler backend” for some background on compiler testing. Additionally, you can read Asmaa Taha’s blog post on VCBench, our performance test automation system.

Questions before answers

Some of the questions that you should answer before you start measuring the performance of your product are:

· What features of my product will customers need to be performant?

· What scenarios can I run to reliably measure performance?

· What metrics can I measure that will produce actionable results?

· How can I execute the tests to minimize variability?

Picking what features to measure

Understanding customer needs is the first step towards figuring out what feature areas they will need to be performant. Some of the avenues that we identify performance scenarios through are:

· Conversations with customers through conferences, meetings, surveys, blogs, etc.

· Connect bugs reported by you (a lot of attention given to connect bugs)

· Feedback from internal Microsoft teams

· Common historical performance bug reports

Based on this data we break down performance of the Visual C++ compiler into three areas:

· Compiler Throughput – seconds to compile a set of sources with a set of compiler parameters

o Especially important for customers who have to wait for builds to finish

· Generated Code Quality – seconds to run generated executable with a fixed workload

o Especially important for technical computing, graphics, gaming and other performance intensive applications

· Generated Code Size – number of bytes in the executable section(s) of the binary

o Especially important for applications that need to run on less memory

When deciding whether or not to add an optimization algorithm we look at the impact to these three areas on a number of benchmarks. In a perfect world the Visual C++ compiler could take infinite time to compile and generate the perfect binary. Realistically we make tradeoffs to generate the most optimized binary possible in a reasonable amount of time. We try to make sure that the optimization algorithms that we use have an appropriate compilation time cost to code quality & code size benefit ratio.

What scenarios do we run/what metrics do we measure?

Results for correctness tests are easy to report — pass or fail and in the case of fail give enough information about what failed to dig into the issue. Performance tests are trickier. Instead of a black or white pass/fail, you have “we’re fairly sure there is no appreciable change”, “we’re fairly sure there is a significant change”, or “there is too much variation to tell whether something changed”. To make matters worse, some areas of performance are changing while others are not; some are improving and some are regressing. On one extreme you can report thousands of different metrics for each tiny part of the performance that can change. This will drown your developers in numbers so that it takes them forever to interpret them. On the other extreme if you report too few numbers or the wrong numbers you will be hiding important changes and therefore missing regressions.

In general you want to report as few metrics as possible that give you the fullest coverage of your identified scenarios. Given that [large] caveat, here are the metrics that we have decided to monitor and report:

Throughput

We measure the compilation time of certain critical parts of Windows and SQL as well as smaller projects on a daily basis in order to track our throughput. On a less regular basis we time how long it takes to build all of Windows. The metrics that we gather are for the front-end, back-end and linker. This sums up to the entire compilation time, however it lets us be more granular in triaging where a performance regression came from.

Code Quality / Code Size

We build and run a set of industry standard integer and floating point benchmarks in order to monitor the code size and code quality of our optimized code generation. Each benchmark is real world code, but with the special constraints of being CPU bound (no waiting on user input or heavily reading/writing to the disk). Some of the benchmarks we run exercise: cryptographic algorithms, XML string processing, compression, mathematical modeling, and artificial intelligence. We also measure NT/SQL for code size changes.

For code size we use “dumpbin.exe /headers <generated_binary>” and then accumulate the virtual size of all the sections marked as executable. We do this instead of the simpler “how big is the entire binary” because if 98% of a binary is data (read: strings, images, etc) and we double code size then it only reads as a 2% code size regression. This is an example of where carefully choosing what metric you report allows us to more accurately see how our optimizations affect the size of the executable code sections of a binary. Dumpbin.exe is a tool that ships with Visual Studio.

For code quality we use whatever metric the benchmark deems as appropriate for measuring the quality of the generated code. In many instances this is execution time, but in some cases it may be something like “abstraction penalty” or “iterations per second”. It is important to note whether a larger number in the metric is better or worse so that developers know whether they are improving or regressing it!

Minimizing variability

One of the biggest problems in tracking performance is that, with the exception of code size, the results that are produced vary from one run to the next. Unlike correctness test cases which pass or fail, performance tests are susceptible to machine variances that can hide a performance change or show a performance change when there is none. To increase the fidelity of our results we run performance tests multiple times and merge the results into an aggregate value. This results in better accuracy of the results and removes outliers.

Use an appropriate aggregation method

We have found that performance results do not normally have a normal distribution; they are much closer to a skew distribution. Because of this using the median to aggregate results is better than using a mean of all of the results – it is less susceptible to large outliers. You can study the characteristic of the results you are getting out of your performance runs and determine the best method of aggregation. It may end up that using the minimum, maximum or a quartile result is best if you only have outliers in one direction. Our team primarily uses median.

Remove outliers

Here are some simple methods to remove outliers:

· Remove the top and bottom X% of results

· Remove all results outside of X standard deviations from the sample mean/median

In addition to throwing away general outliers, there can be a significant difference in the performance of a binary when it is not already loaded in the cache. Throwing out the first iteration as a warm up run can counteract this. Note: If your customer usage scenario is that they will primarily be executing it on a cold cache then you need to be concerned with this number!

Stabilize the machine before running the benchmark

The average windows machine has a lot of services running on it from SQL server to IIS to anti-virus. Shutting down as many applications and services as possible before executing your benchmark will help to reduce variability. Anti-virus in particular is important to disable because of all the places it can hook into the system and introduce additional overhead. Disabling your network adapters will also help to reduce noise.

Making results Actionable

At this point you have a set of benchmarks that run for a number of iterations on a stabilized machine, and you are aggregating the set of result iterations into two numbers that are your baseline results and your changed results. We now need to compare these results and determine whether they are:

1. The same (no performance change)

2. Different (a statistically significant change)

a. Significant improvement

b. Significant regression

3. Unactionable (we cannot tell if they are the same or different)

If the results are accurate enough, or the results are separated enough it can be easy to eyeball the numbers and tell what category the numbers fall in. However, if the numbers are fairly close or you are looking to fully automate this action you will need a more concrete algorithm for categorizing the results. This is where Confidence Intervals are useful – they allow you to say wi th certain confidence levels that two sets of results are either: the same, different, or that you need more iterations to tell one way or another.

How we test the compiler performance

Questions before answers

Picking what features to measure