Finally I have some time to continue my series on how we did stress testing on the languages team to uncover crashes, hangs and leaks in Visual Studio 2010.
As a reminder, Part 1 gave a general overview of what we do for stress (8-hour runs) and Part 2 listed the parameters that we’re measuring (VM, working set, private memory, committed memory, managed memory, GC handles, CCW, GDI handles, USER objects, OS handles and the number of threads).
In this part I will tell more about what types of tests we run and how we look for bugs. As a reminder, I was responsible for stress-testing four parts of VS: C# IDE, VB IDE, F# IDE and the Hostable Editor (Workflow designer can host our compilers to provide IntelliSense and colorization for their expression editors).
On each team, we had one master test script that was invoking each feature sequentially in an end-to-end fashion. For example, for C# I had a test called ‘AllCSharpFeatures’ that invokes 28 features (we say it has 28 ‘scenarios’). Here they are:
The test is written in such a way that if all the 28 scenarios are run sequentially, it models a typical user working with features (the scenarios are augmented with typing, inserting and removing text to keep the flow). Moreover, if you disable any of the scenarios, the remaining workflow can still execute (they’re independent of each other). Finally, each scenario is repeatable, so that you can call it twice or more times (“idempotent”).
While the test is running, we record the parameters that I mentioned in Part 2 to an XML-based log file, before and after every scenario.
Now with all that set up, we run the test in one of several modes. First mode is called end-to-end stress iterations – we just repeat the entire sequence 30 or 100 times or 8 hours and after it’s done, the XML log is converted to an Excel spreadsheet with charts like this:
This is an 8-hour run from November 19 2009. The main test has looped 52 times (a total of 1474 executed scenarios). We clearly see that we’re leaking about 70 MB of working set and about the same amount of VM. However from this chart it’s hard to tell which feature is leaking.
Next step is that we run the test in ‘scenario iterations’ mode. First, we run the entire sequence once to “warm up” (load all modules, JIT all the code, fill the caches) and then we loop the first scenario (Formatting) 30 times, then we loop the second scenario (Colorization) 30 times, and so on. The fact that the scenarios are composable and repeatable allows us to do that. Now we get a different picture (notice the warm-up growth at the very beginning):
A handle is typically a native resource, such as a file, registry key, thread or synchronization object. This chart clearly shows that during our “Navigate To” scenario (that emulates the user pressing Ctrl+, searching for a member and navigating there) we lose around 600 OS handles, which, divided by 30 iterations, gives us a leak of 20 handles per iteration. Now we have a bug with repro steps (using Navigate To leaks 20 handles per iteration) and our developer can use htrace to find the leak.
The GC Handles page shows a leak of GC handles in += event handler generation code, as well as in our Edit-and-Continue scenario. We can then use !gchandles in sos.dll to find and isolate the bug. The huge spike in EnC means that when the debugger starts, we allocate a large amount of resources, and when it stops, we free almost all of them back. The difference (that the debugger didn’t release) is an actual leak. You can also see that the debugger leaks these continuously and not in one swoop because you can see the ragged upward slope when the debugger is running.
Finally, this sample shows a leak of Com-callable wrappers (CCW) in the Generate From Usage scenario.
To reiterate, we can run our test in two modes: sequential loop to stress end-to-end feature usage, and stress each feature separately (with one warm-up iteration to separate the resource initialization from real leaks). David Nelson asked in the comments to Part 1, if we’re running our tests in combination with each other, as well as in isolation. This gives the answer that, yes, we do. We can run each of the scenarios for 8 hours separately to stress it on its own and determine an exact amount of leak without external noise, as well as run a more real-world stress in a loop, where scenarios work together and are invoked one after each other in combination.
This way we can also measure the duration of each scenario and compare it to earlier builds to see if we have improved performance or if we have accidental slowdown in any of the scenarios. An interesting phenomena that I observed at the early stages is that all of our scenarios were mysteriously slowing down after hours of work. After a lot of head scratching I tracked it down to the function that was writing stress data to the XML log. After 5-6 hours of constant logging our XML stress log got into the sizes of 15-20 MB and we were using DOM to read the entire XML file, add an entry and write it all back. No wonder our scenarios slowed down! I replaced this code with a simple File.AppendAllText that works in O(1) and we were all good again (reminder: next time don’t forget to append </StressLog> at the end of the test!).
To conclude, I’d like to list some pointers to different tools we use to measure resources and track down leaks.
From your list I don't recognise number 14. Metadata As Source , any more info on what it is?
Is there a full listing of all the VS2010 features and how to invoke them?
Thanks, a great series! Part 2, btw was one of the most comprehensive explanations of what are the different process memory types about that I've seen :)
Arjang - Metadata As Source is when you for example put the cursor on the word "string" in your C# code, press F12 (Go To Definition) and it will display the source code for the System.String class in the BCL (no method bodies though unlike Reflector).
MSDN will likely have a full list of all C# IDE features: http://msdn.microsoft.com/en-us/library/ms173063(v=VS.100).aspx