I know the answer (it's 42)

A blog on coding, .NET, .NET Compact Framework and life in general....

October, 2012

Posts
  • I know the answer (it's 42)

    Test Driven Development of a Generational Garbage Collection

    • 2 Comments
    Crater Lake, OR panorama

    These days everyone is talking about being agile and test driven development (TDD). I wanted to share a success story of TDD that we employed for developing Generational Garbage Collector (GC) for Windows Phone Mango.

    The .NET runtime on Windows Phone 7 shipped with a mark-sweep-compact; stop the world global non-generational GC. Once a GC was triggered, it stopped all managed execution and scanned the entire managed heap to look up all managed references and cleaned up objects that were not in use. Due to performance bottleneck we decided to enhance the GC by adding a generational GC (referred to as GenGC). However, post the General Availability or GA of WP7 we had a very short coding window. Replacing such a fundamental piece of the runtime in that short window was very risky. So we decided to build various kinds of stress infrastructure first, and then develop the GC. So essentially

    1. Write the tests
    2. See those tests failing
    3. Write code for the generational GC
    4. Get tests to pass
    5. Use the tests for regression tracking as we refactor the code and make it run faster

     

    Now building tests for a GC is not equivalent of traditional testing of features or APIs where you write tests to call into mocked up API, see it fail until you add the right functionality. Rather these tests where verifications modes and combination of runtime stresses that we wrote.

    To appreciate the testing steps we took do read the Back To Basics: Generational Garbage Collection and WP7 Mango: Mark-Sweep collection and how does a Generational GC help posts

    Essentially in a generational GC run all of the following references should be discovered by the GC

    1. Gen0 Objects reachable via roots (root –> object OR recursively root –> object –> object )
    2. Objects accessible from runtime native code (e.g. pinned for PInvoke, COM interop, internal runtime references)
    3. Objects referenced via Gen1 –> Gen0 pointers

    The first two were anyway heavily covered by our traditional GC tests. #3 being the new area being added.

    To implement a correct generational GC we needed to ensure that at all places in the runtime where managed object references are updated they need to get reflected in the CardTable (#3 above). This is a daunting task and prone to bugs via omission as we need to ensure that

    1. All forms of assignments in MSIL that we JIT there are calls to update the CardTable.
    2. All places in the native runtime code where such references are directly and or indirectly updated the same thing is ensured. This includes all JIT worker-routines, COM, Marshallers.

     

    If a single instance is missed then it would result in valid/reachable Gen0 objects being collected (or deleted) and hence in the longer run result in memory corruption, crashes that will be hard if not impossible to debug. This was assessed to be the biggest risk to shipping generational GC.

    The other problem is that these potential omissions can be only exposed by certain ordering of allocation and collection. E.g. only a missing tracked reference of A –> B can result in a GC issue only if a GC happened in between allocations of A and B (A is in higher generation than B). Also due to performance reasons (write atomicity for lock-less updates) for every assignment of A = B we do not update the card-table bit that covers the memory area of A. Rather we update the whole byte in the card-table. This means an update to A will cover other objects allocated adjacent to A. Hence if an update to an object just beside A in the memory is missed it will not be discovered until some other run where that object lands up being allocated farther away from A.

    GC Verification mode

    Our solution to all of these problems was to first create the GC verification mode. What this mode does is runs the traditional full mark-sweep GC. While running that GC it goes through all objects in the memory and as it traverses them for every reference A (Gen1) –> B(Gen0), it verifies that the card table bit for A is indeed set. This ensures that if a GenGC was to run, it would not miss that references

    Granular CardTable

    We used very high granular card-table resolution for test runs. For these special runs each bit of the card-table corresponded to almost one object (1 bit to 2 byte resolution). Even though the card-table size exploded it was fine because this wasn’t a shipping configuration. This spaced out objects covered by the card-table and exposed adjacent objects not being updated.

    GC Stress

    In addition we ran the GC stress mode, where we made the GC run extremely frequently (we could push it up to a GC in every allocation). The allocator was also updated to ensure that allocations were randomized so that objects moved around everywhere in the memory.

    Hole Finder

    Hole finder moves all objects around in memory after a GC. This exposes stale pointer issues. If an object didn’t get updated properly due to the GC it would now point to invalid memory because all previous memory locations are now invalid memory. So a subsequent write will fail-fast with AV and we can easily detect that point of failure.

    With all of these changes we ran the entire test suites. Also by throttling down the GC Stress mode we could still use the runtime to run real apps on the phone. Let me tell you playing NFS on a device with the verification mode, wasn’t fun :)

    With this precaution we ensured that not a single GenGC bug has come in from the phone. It shipped rock solid and we were more confident with code churn because regressions would always be caught. I actually never blogged about this because I felt that if I do, it’ll jinx something :)

  • I know the answer (it's 42)

    Core Parking

    • 6 Comments
    Rainier

    For some time now, my main box got a bit slow and was glitching all the time. After some investigation I found that some power profile imposed by our I T department enabled CPU parking on my machine. This effectively parks CPU on low load condition to save power. However,

    1. This effects high load conditions as well. There is a perceptible delay between the load hitting the computer and the CPU’s being unparked. Also some CPU’s remain parked for spike loads (large but smaller duration usage spikes)
    2. This is a desktop and not a laptop and hence I really do not care about power consumption that much

    Windows Task Manager (Ctrl + Shift + Esc and then Performance tab) clearly shows this parking feature. 3 parked cores show flatlines

    clip_image002

    You can also find out if your machine is behaving the same from Task Manager -> Performance Tab -> Resource Monitor -> CPU Tab. The CPU graphs on the right will show which cores if any are parked.

    clip_image004

    To disable this you need to

    1. Find all occurrences of 0cc5b647-c1df-4637-891a-dec35c318583 in the Windows registry (there will be multiple of those)
    2. For all these occurrences set the ValueMin and ValueMax keys to 0
    3. Reboot

    For detailed steps see the video http://www.youtube.com/watch?feature=player_detailpage&v=mL-w3X0PQnk#t=85s

    Everything is back to normal once this is taken care off Smile

    clip_image006

  • I know the answer (it's 42)

    Managing your Bugs

    • 0 Comments
    Rainier

    I have worked on bunch of large scale software product development that spanned multiple teams, thousands of engineers, many iterations. Some of those products plug into even larger products. So obviously we deal with a lot of bugs. Bugs filed on us, bugs we file on other partner teams, integration bugs, critical bugs, good to have fixes and plan ol crazy bugs. However, I have noticed one approach to handling those bugs which kind of bugs me.

    Some people just cannot make hard calls on the gray area bugs and keeps hanging them around. They bump the priority down and take those bugs from iteration to iteration. These bugs always hang around as Priority 2 or 3 (in a scale of Priority 0 through 3).

    I personally believe that bugs should be dealt with the same way emails should be dealt with. I always follow the 4 Ds for real bugs.

    Do it

    If the bug meets the fix bar and is critical enough, just go fix it. This is the high priority critical bug which rarely gives rise to any confusion. Something like the “GC crashes on GC_Sweep for application X”

    Delete it (or rather resolve it)

    If you feel a bug shouldn’t be fixed then simply resolve the bug. Give it back to the person who created the bug with clear reasons why you feel this shouldn’t be fixed. I know it’s hard to say “won’t fix”, but the problem is if one was to fix every bug in the system then the product will never ship. So just don’t defer the thought of making this hard call. There is no benefit in just keeping the bug hanging around for 6 months only to make the same call 2 weeks prior to shipping, citing time constraint as a reason. Make the hard call upfront. You can always tag the bug appropriately so that for later releases you can relook.

    Defer it

    Once you have made a call to fix the bug it should be correctly prioritized. With all bugs you know you won’t be fixing out of the way, its easy to make the right call of fixing it in the current development cycle or move it to the next iteration.

    Delegate it

    If you feel someone else needs to take a look at the bug because it’s not your area of expertise, QA needs to give better repro steps, someone else needs to make a call on it (see #1 and #2 above) then just assign it to them. No shame in accepting that you aren’t the best person to make a call.

Page 1 of 1 (3 items)