I am a developer at Microsoft and work in the .NET Common Language Runtime (CLR) team. For the last 4 years I have been working on virtual machine technologies on a variety of form factors including desktops (Windows, Linux), tablets (Win8), gaming-consoles (Xbox 360), mobile devices (Windows Phone 7, Windows CE, Symbian).I have worked on various core pieces of the runtime including Garbage Collector, memory manager, platform abstraction layer, runtime-performance, etc.Before working on .NET I worked on Visual Studio Team Foundation Server, Visual Studio Team System, Adobe Framemaker, Adobe Acrobat, Texas Instrument's Code Composer Studio.
These days everyone is talking about being agile and test driven development (TDD). I wanted to share a success story of TDD that we employed for developing Generational Garbage Collector (GC) for Windows Phone Mango.
The .NET runtime on Windows Phone 7 shipped with a mark-sweep-compact; stop the world global non-generational GC. Once a GC was triggered, it stopped all managed execution and scanned the entire managed heap to look up all managed references and cleaned up objects that were not in use. Due to performance bottleneck we decided to enhance the GC by adding a generational GC (referred to as GenGC). However, post the General Availability or GA of WP7 we had a very short coding window. Replacing such a fundamental piece of the runtime in that short window was very risky. So we decided to build various kinds of stress infrastructure first, and then develop the GC. So essentially
Now building tests for a GC is not equivalent of traditional testing of features or APIs where you write tests to call into mocked up API, see it fail until you add the right functionality. Rather these tests where verifications modes and combination of runtime stresses that we wrote.
To appreciate the testing steps we took do read the Back To Basics: Generational Garbage Collection and WP7 Mango: Mark-Sweep collection and how does a Generational GC help posts
Essentially in a generational GC run all of the following references should be discovered by the GC
The first two were anyway heavily covered by our traditional GC tests. #3 being the new area being added.
To implement a correct generational GC we needed to ensure that at all places in the runtime where managed object references are updated they need to get reflected in the CardTable (#3 above). This is a daunting task and prone to bugs via omission as we need to ensure that
If a single instance is missed then it would result in valid/reachable Gen0 objects being collected (or deleted) and hence in the longer run result in memory corruption, crashes that will be hard if not impossible to debug. This was assessed to be the biggest risk to shipping generational GC.
The other problem is that these potential omissions can be only exposed by certain ordering of allocation and collection. E.g. only a missing tracked reference of A –> B can result in a GC issue only if a GC happened in between allocations of A and B (A is in higher generation than B). Also due to performance reasons (write atomicity for lock-less updates) for every assignment of A = B we do not update the card-table bit that covers the memory area of A. Rather we update the whole byte in the card-table. This means an update to A will cover other objects allocated adjacent to A. Hence if an update to an object just beside A in the memory is missed it will not be discovered until some other run where that object lands up being allocated farther away from A.
Our solution to all of these problems was to first create the GC verification mode. What this mode does is runs the traditional full mark-sweep GC. While running that GC it goes through all objects in the memory and as it traverses them for every reference A (Gen1) –> B(Gen0), it verifies that the card table bit for A is indeed set. This ensures that if a GenGC was to run, it would not miss that references
We used very high granular card-table resolution for test runs. For these special runs each bit of the card-table corresponded to almost one object (1 bit to 2 byte resolution). Even though the card-table size exploded it was fine because this wasn’t a shipping configuration. This spaced out objects covered by the card-table and exposed adjacent objects not being updated.
In addition we ran the GC stress mode, where we made the GC run extremely frequently (we could push it up to a GC in every allocation). The allocator was also updated to ensure that allocations were randomized so that objects moved around everywhere in the memory.
Hole finder moves all objects around in memory after a GC. This exposes stale pointer issues. If an object didn’t get updated properly due to the GC it would now point to invalid memory because all previous memory locations are now invalid memory. So a subsequent write will fail-fast with AV and we can easily detect that point of failure.
With all of these changes we ran the entire test suites. Also by throttling down the GC Stress mode we could still use the runtime to run real apps on the phone. Let me tell you playing NFS on a device with the verification mode, wasn’t fun :)
With this precaution we ensured that not a single GenGC bug has come in from the phone. It shipped rock solid and we were more confident with code churn because regressions would always be caught. I actually never blogged about this because I felt that if I do, it’ll jinx something :)
For some time now, my main box got a bit slow and was glitching all the time. After some investigation I found that some power profile imposed by our I T department enabled CPU parking on my machine. This effectively parks CPU on low load condition to save power. However,
Windows Task Manager (Ctrl + Shift + Esc and then Performance tab) clearly shows this parking feature. 3 parked cores show flatlines
You can also find out if your machine is behaving the same from Task Manager -> Performance Tab -> Resource Monitor -> CPU Tab. The CPU graphs on the right will show which cores if any are parked.
To disable this you need to
For detailed steps see the video http://www.youtube.com/watch?feature=player_detailpage&v=mL-w3X0PQnk#t=85s
Everything is back to normal once this is taken care off
I have worked on bunch of large scale software product development that spanned multiple teams, thousands of engineers, many iterations. Some of those products plug into even larger products. So obviously we deal with a lot of bugs. Bugs filed on us, bugs we file on other partner teams, integration bugs, critical bugs, good to have fixes and plan ol crazy bugs. However, I have noticed one approach to handling those bugs which kind of bugs me.
Some people just cannot make hard calls on the gray area bugs and keeps hanging them around. They bump the priority down and take those bugs from iteration to iteration. These bugs always hang around as Priority 2 or 3 (in a scale of Priority 0 through 3).
I personally believe that bugs should be dealt with the same way emails should be dealt with. I always follow the 4 Ds for real bugs.
If the bug meets the fix bar and is critical enough, just go fix it. This is the high priority critical bug which rarely gives rise to any confusion. Something like the “GC crashes on GC_Sweep for application X”
If you feel a bug shouldn’t be fixed then simply resolve the bug. Give it back to the person who created the bug with clear reasons why you feel this shouldn’t be fixed. I know it’s hard to say “won’t fix”, but the problem is if one was to fix every bug in the system then the product will never ship. So just don’t defer the thought of making this hard call. There is no benefit in just keeping the bug hanging around for 6 months only to make the same call 2 weeks prior to shipping, citing time constraint as a reason. Make the hard call upfront. You can always tag the bug appropriately so that for later releases you can relook.
Once you have made a call to fix the bug it should be correctly prioritized. With all bugs you know you won’t be fixing out of the way, its easy to make the right call of fixing it in the current development cycle or move it to the next iteration.
If you feel someone else needs to take a look at the bug because it’s not your area of expertise, QA needs to give better repro steps, someone else needs to make a call on it (see #1 and #2 above) then just assign it to them. No shame in accepting that you aren’t the best person to make a call.