How we test the compiler backend

How we test the compiler backend

Rate This
  • Comments 14

My name is Alex Thaman and I am a Senior Test Lead on the Visual C++ compiler team at Microsoft.  The focus of this blog is testing of the compiler backend where I’ve spent a good portion of my time here.  For those not aware, this is the part of the compiler that takes an intermediate representation as an input, does optimizations and code generation.

 

I will walk you through the compiler backend testing domain, the kinds of bugs that the backend compiler deals with and how we go about testing the backend compiler.

 

The Compiler testing domain

 

Compiler testing exists in a different domain than many other kinds of application or system testing.  Here are a few attributes of compilers that inform how we think about testing:

·        Compilers have short-lived execution times like most other command-line tools.  By “short” I mean that it runs, does some work, outputs some files, and exits.

·        There is no user interaction during execution. 

·        Compilers execute in phases, where at each phase a transformation is applied to the input and the output becomes the input of the next phase which makes each phase interdependent.  Also this means it can be difficult to construct test cases that to reach specific code paths, especially in the later compiler phases.

·        Many compilers do analysis at the entire program level, so data about the entire input set may be sitting in memory and may all be operated upon at once

·        With some exceptions, compiler outputs cannot easily be verified by inspection or other kind of test tool.  They need to actually be executed. One reason is that there are many correct outputs in terms of machine instructions, another is that the output can change from day to day and still be correct, and lastly the output is very large compounding the first two issues I mentioned.

 

Bugs

 

Below are the categories of bugs that we deal with on a day-to-day basis in the compiler:

·        Compiler crashes (also known as an ICE or Internal Compiler Error) – This is simply some kind of failure during execution of the compiler

·        Compiler hangs – Some kind of infinite loop in the compiler.  Because the compiler back-end is single-threaded, there is no possibility of application-level deadlocks

·        Incorrect error/warning output – This can be either an error/warning that fired when it should not have, or an error/warning that should have fired but did not.  The latter case is interesting because it is often not so much a bug but a limitation in the feature, at least in the case of warnings.  Some warnings require some extensive code analysis to fire in all the cases that they should.  We do make efforts to ensure that we are giving customers the best possible information when they have done something incorrect

·        Bad code generation – This is a result of incorrect compiler output and is by far the most devastating bug of any kind.  There are two classes of bad code generation:

a)      Bad code generation that leads to a crash in the application – These are the less problematic of the two cases because of the effects and due to the ease of discoverability.  The effect is an application crash which in many cases is resolved with a restart and does not corrupt data.  It is (typically) easy to pinpoint the bad code generation because the crash gives you a call stack and you can see in front of you what got corrupted.

b)      “Silent” bad code generation – These are the worst kind of bugs not only because these bugs can result in data loss but tracking them down in some cases is extremely difficult because it’s not always easy to find what got corrupted and where it got corrupted.  You can imagine that this problem is worse for a multithreaded app – we have seen silent bad codegen bugs where a variable’s volatile attribute has not been honored in a certain loop.  Sometimes silent bad codegen is a result of what should have been a crash due to overwriting invalid memory, but a memory location happens to contain valid data.  An example would be if you set i = 100 but instead of writing to i, it overwrites j.  This is just an example – it never manifests in this easy of a form

·        Compiler throughput issues – Issues that affect the amount of time the compiler takes to compile code

·        Code quality issues – Issues that affect the performance of the compiled application

·        Compiler feature correctness issues – This class of bugs involves the compiler generating correct code, but not doing what a particular feature specifies should be done.  An example here would be not adding a security cookie when the /GS switch is passed.  In this case, the code would execute just fine but would not have the same buffer overrun security protection that the user would expect

·        Other peripheral behavioral issues – There are things related to compiling that can be affected by compiler bugs.  The biggest example is debugging information, where the result is typically seen in the form of the debugger not doing what you would expect

 

One last thing to note is that the most interesting testing space for the backend is specifically optimized code.  A non-optimizing compiler does not do all that much work, and though we do test the /Od compilation, we don’t spend a lot of time with it and it generates far fewer bugs, and the same goes for monitoring /Od build times and /Od code generation.  Note that Debug build times in an end-to-end scenario are measured quite often, but the compiler backend generally contributes only a small portion to this.

 

How Do We Test?

 

Now that I’ve explained the compiler testing space and the kinds of bugs we deal with, I’ll explain how we actually test the backend.

 

Writing Tests

We create A LOT of tests, really A LOT of tests J.  We’re talking on the order of hundreds of thousands of small tests.  To understand why, try to think of how you might test exception handling (EH).  You might come up with simple cases involving a test throwing an exception and catching it, throwing and not catching, simple nested exceptions, etc.  These are pretty basic, and would constitute tests that a developer would run before every check-in just to make sure the product works at a fundamental level.  We also have a much larger suite of tests to verify that our compiler EH code generation is ready for production.  Without going into too many details, we have to ensure that throwing of various kinds of objects (including ones with copy constructors and destructors that get called during stack unwinding), dealing with weird control flow around EH (what happens when you have a goto from a handler to outside the try?), etc. all works as the user would expect.  You can see that this matrix of cases can explode.  I have not actually counted but I would guess that we have a few thousand tests that involve EH.

 

Test Permutations

To add to this matrix, the compiler also has a few switches that have big effects on what is done with the code.  The most interesting “set” of switches is /Od, /O1, /O2, /O2 /GL, /O2 /GL /link /ltcg:pgu.  We run this matrix as a permutation of most of our tests.  We have many other switches, but those are tested in a more localized fashion since applying them broadly to most tests is not as interesting as the optimization controls.  The last big dimension of our matrix is /clr – almost all of the tests that we have that are supported under /clr are tested with this switch as well.

 

Real-World Code

Given the infinite set of inputs, there isn’t a systematic way to test everything in the compiler in an efficient way.  However the C++ compiler has a big advantage in terms of testing – people have already written test cases for us!  Anyone who writes code has a test case.  As a result we rely heavily on what we call “real world code” (RWC), which are just real under-development applications.  You can be assured that the C++ compiler we ship to you in Visual Studio 2010 has already successfully built Visual Studio itself, Windows, SQL, Office, and many other large software applications.  Another advantage is that this code is under active development, which means that every day is a new test case as the developers churn on the code.  We frequently release our compiler to internal developers and fix the bugs that we get from these developers. What is released in Visual Studio 2010 has been through an extensive grind within Microsoft.

 

Performance Testing

Performance testing is a critical part of what we do.  For the optimizing backend, this tells us how well our features are working.  Because the output of the compiler involves performance, there are actually three forms that performance takes on the backend team. 

1.      The first is what we call “code quality”.  This is typically the speed of the code that is generated.  We use a diverse set of performance tests to cover various kinds of code that we might be compiling in the real world.  In most cases we are hyper-sensitive to noise, and even 1% noise on these benchmarks makes it difficult for us to see how we are doing, because true performance regressions can often come in 1% increments.

2.      The second is “code size”.  This is actually closely tied to code quality.  Code size alone is only somewhat interesting in that we don’t want to generate very large images, but it often correlates to code quality.  Optimizations will often trade more code for faster code, the two easiest examples being the inliner (which will reduce call overhead and provide additional optimization opportunities) and the loop unroller.  One disadvantage we have in measuring code quality is that it requires execution of the code in question.  Code size can be measured by just building.  We will not always try to reduce code size for every change we make, but we will watch it as an indicator of how we are changing things.  We will typically measure code size on the benchmarks that we already use for code quality plus some of our real-world code.

3.      The third is “throughput”.  This is a measure of how fast the compiler runs, i.e. the build time.  This is most interesting to watch when optimizations are turned on because this constitutes the bulk of the execution time in the compiler.  We realize that build time is important to you and this we keep an eye on this metric to make sure that you remain productive.

Stress Modes

With all of that said, because there are just so many code patterns and possibilities, we still can’t catch everything with those efforts alone.  This requires us to start getting more creative with the testing.  One area that has shown the most promise is running compiler stress modes.  There are two classes of stress modes that we have:

1.      Ones that actually change the input in some way, such as add a try/finally around the body of every function, mark every variable as volatile, etc.

2.      Ones that change heuristics for optimizations/analysis, such as inline every function instead of ones that give a benefit.

Stress modes are extremely effective in taking existing tests we do have (including real world code) and creating new and interesting cases out of them.  There are two main challenges with stress modes:

1.      Not increasing the size of the matrix too tremendously, so specifically determining what tests should be run with each stress mode.

2.      Making sure a stress mode does something that is legal.  That is, a test run under a stress mode might fail but it is because the stress mode did something that makes the code incorrect.  Depending on the case, it could be considered a bug in the stress mode or it could be an incompatibility between the test and the stress mode.

These aren’t major barriers, just things to keep in consideration as it requires that we be selective in which tests we want to use with our stress modes.

 

Auto Test Generation

 

In certain cases, we can use test case generation tools to assist us in testing parts of the compiler.  One of our team members created a generator for exception handling tests.  Because the matrix is so large for exception handling, and because the cases are easy to construct by modeling the EH code as a tree, we were able to create many different cases on the fly by generating trees where the nodes involved various constructs that later turned into C++ exception handling code.

 

There is a lot more to explore in this area that we would like to look at for the future.

 

Test Harnessing

 

One thing that should be clear from the above is that it is *extremely* easy to place our tests in a system that can execute them.  We don’t require complicated harnesses since our app is short-running and we don’t have to deal with the UI automation problem that many other testers in the world do because there is no user interaction.  Most of our tests are just batch or simple perl scripts with some .cpp and .h files.  Even for RWC, the applications we build are wrapped in either VS .sln files or some other build tool that the development team produced.  This allows us to focus on writing a large number of small tests very quickly.  The most difficult harnessing problem we face, which isn’t that bad, is the execution of real world code.  For instance, testing the CLR requires installing CLR on a machine, and though we have tools to do this, it is certainly much harder than our feature tests that typically compile to a simple .exe that can be run.

 

The downside of this is that our tests only get one entry into the application under test.  That is, we just provide some source files and some switches, and we have to make sure that this input tests exactly what we want even if it is in one of the later phases of the compiler.  This makes it very difficult to test something like the register allocator in a targeted fashion, which executes much later in the compiler.  In theory, the register allocator could be targeted by testing that phase in isolation assuming it had API’s to test with, but this is not something we currently have.  Even then, coming up with a full set of test cases for the register allocator is a fairly hard problem due to the complexity.

 

This provides you with an overview of how we approach testing in the compiler backend, and how we ensure that you are receiving high-quality compilers from us.

 

Thanks!

Alex Thaman

Senior Test Lead

Visual C++ Team

  • dfgdg

  • Very interesting post - thanks for giving us a peek into the world of compiler testing!

  • Excellent post. It would be very nice to see this kind of compiler-related posts more frequently.

  • Nice post. A couple of comments:

    Regarding new Phoenix backend. Here some parts of compiler's pipeline can be written in a managed code. Can you guess how it can affect codegeneration throughput? Seems like a whole model of compiler is nicely fit into request-response model and GC would not hinder performance.

    Another weird thing for me is that backend is completely sequential. That true that many projects can be compiled concurrently on a solution level... but I am not convinced. There must be a lot of places, including optimization phase mentioned by you, where concurrency is natural. Have you got some research about performance in/de-crease in case of entire backend or major pipeline stages would be concurrent.

    Compiler is written in C. Good. Can you describe your envision of evolution of tools/languages at backend level from testing perspective? And the same from historical view? (Its just plainly interesting

    blogs.msdn.com/.../my-history-of-visual-studio-part-1.aspx

    ).

    I am not sure that Ive got your idea about compiling solution (RWC) and checking... what actually do you check when you get foo.exe? Any work at this level sounds like a big pain. I would like to read more (perhaps another post) about testing generated code on an abstract program level, checking tree structures, etc.

    Thanks Alex.

  • @tivadj

    Re:  Phoenix compiler - In general managed code in a CPU intensive application like the compiler will incur some hit.  Additionally the structure of the code in the Phoenix compiler is very different to enable better extensibility and there is a slowdown from that as well. GC itself does not hinder performance, but things like additional indirect calls (a side effect of the extensibility) do.

    The tooling we provide tends to follow which development stories are the most important.  The testing evolves based on what kinds of things become important.  While I can't reveal any plans for our next release, we'd like to get a cadence of these kinds of testing posts and that will help explain how we are evolving our testing to match our new directions.  So stay tuned :)

    We have looked at concurrency opportunities within the backend and will continue investigating this.

    In the case of building real world code, we generally have to execute actual tests against the application.  As an example, we boot each Windows build we do and sometimes run a stress test that they use to stress the kernel.  For SQL we run the SQL "smoke" tests.  Yes this adds a lot of process for us but it provides great testing so we pay the cost.

    Hope that helps!

  • Recently needed to port code from GCC.   VC was giving me  "internal error - stack overflow" on file with 500 lines of code.   Found bug report, MS marked it as "won't be fixed".  It is only 500LOC!

    Ending up with splitting the file into two smaller files.   And BTW, why C99's  stdint.h still not supported?

  • Very interesting insight into the 'underneath it all' world :)

    Thumbsup from me!

  • Very interesting insight into the 'underneath it all' world :)

    Thumbsup from me!

  • Hi Alex,

    I have a quick question about the new code generation for SSE int->float conversion in VS2010.

    A project specifies /arch:SSE and converts 32 bit integers from memory to an xmm register using this:

     00012 66 0f 6e 44 24 04 movd xmm0, DWORD PTR _a$[esp-4]

     00018 0f 5b c0 cvtdq2ps xmm0, xmm0

     0001b f3 0f 11 46 10 movss DWORD PTR [esi+16], xmm0

    Which is fine on my C2D.  But isn't cvtdq2ps an SSE2 instruction?  I haven't had a chance to pull my Athlon XP out of storage to try it.

    Thanks and best regards,

    David

  • @Leonid, do you have the Connect bug number?

    Thanks,

    Vikas

    VC++ Team

  • @Leonid: Thanks for your feedback regarding C99. Unfortunately, we have not had the resources to implement C99 features. As of now, we do not intend on implementing C99 support, but that may change depending upon customer feedback.

  • I wish you had gone into specifics on compiler testing strategies and automated test generation. That would have been great. Nice post anyways. Eager to hear more stuff on the same topic.

    As as aside, does any compiler team at MS use SpecExplorer and Spec# for model based testing?

  • @David

    Yep this is a known issue and it will be addressed in the next full release of VS, and possibly the service pack for VS 2010.

    @Elroy

    Thanks for the feedback!  My intent was this to be the first in a series of blogs about testing of the compiler.  We have a post coming shortly with details of how we do benchmarking.  Hopefully we can address some of your suggestions in upcoming blogs as well.

    I'm not currently aware of a team that does model based testing for any of our compiler technologies but there are a few such teams throughout the company so I can't say for sure.

    Alex Thaman

    Senior Test Lead

    Visual C++ Team

  • ugh.  and I wanted VC++ 2010 because I thought it had C99 and stdint.h.  I guess I was wrong and I will have to wait until VC++ 2015.  at least gcc has it.  I would still like to have a microsoft compiler so I can compile ffmpeg.

Page 1 of 1 (14 items)