June, 2009

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Thinking about Last Checkin Chicken

    • 8 Comments

    Raymond Chen’s post today started me thinking about “Last Check-in Chicken” again.  Back in the says when we were close to shipping Windows Vista, I wrote about ”Last Check-in Chicken”.  What I didn’t mention was who ultimately won the game for Windows Vista.

    It turns out that the very last change to Windows Vista was actually made by one of the developers on the sound team.

     

    When you reach the last few days of a project, the bar for taking changes is insanely high – the teams which approve changes to the product get increasingly more conservative about taking changes – every change taken is an opportunity for regression and resets some amount of the testing which has gone before.  So the number of bugs that are accepted towards the end of a product gets smaller and smaller. You can think of the ability to take bugs as a series of ever increasingly high barriers – it starts fairly low – just about any bug fix will be accepted into the tree.  This is the normal state during most of product development.  As time goes on and the team gets closer to shipping, the bug bar gets raised and the bugs that are considered are only those that are going to affect customers directly (as opposed to those bugs found during testing won’t necessarily be encountered by customers).  Then the bar gets raised again (and again, and again) until eventually it gets to the point where the only bugs that are accepted are “recall class” bugs[1].

    The idea behind a “recall class bug” is that it’s is a bug that is so bad that we’d be willing to call the manufacturer and pull the product off the assembly line (at a cost of millions of dollars) to fix.  These are the worst-of-the-worst bugs, and typically involve major scenarios not working.   When the bug bar is at “recall class only”, there are typically only two or three bugs that are considered each day across all of Windows and even then most of the bugs brought up to the triage team aren’t accepted.

    At some point the bug bar gets beyond even “recall class only” – this is when you’re REALLY close to being done (typically the last two or three days of a product).  Normally builds of the product are done daily because there are one or two “recall class” bugs still being accepted.  But eventually all those bugs are fixed and the build team stops doing daily builds because there have been no changes since the previous build.  The test team is hard at work doing it’s final sign-off of the bits and everybody is on tenterhooks waiting for the final build to come out.  When you’re at this stage of the product, every once in a while a change comes in that would be really nice to have because it fixes a critical issue with an important scenario, but it’s just just not important enough to justify cracking open the bits to take the change.  Raymond calls these type of changes “Remora Check-ins”.   The idea is that if another bug was discovered during the final testing phase that forced us to rebuild the system, we would take these “Remora Check-ins” along for the ride.

    In our case, the change we made was a Remora check-in – it was an important bug, but it wasn’t important enough to justify resetting the final test pass.  But someone else’s component had a critical bug that HAD to be fixed and our change came along for the ride (and no, I don’t remember exactly what either of the changes were, I just know that our check-in was chronologically the last one made).

     

    Nitpickers corner: None of the information in this post should be particularly controversial – much of what I’ve described here is software engineering 101.  There’s always a bar for taking bug fixes in every product – if there weren’t, you’d never ship the product (for example, the Mozilla Foundation shipped Firefox version 3.5 today (congrats!) and they still have several dozens of critical bugs active in their database – I’m sure that these are all bugs that didn’t meet their bug bar).  Heck, there’s even a book that’s all about the process of shipping NT 3.1 that covers much of this information.

     

    ----

    [1] In the past these bugs would be called “Show Stoppers”.

  • Larry Osterman's WebLog

    What’s wrong with this code, part 26 – the answer

    • 6 Comments

    Yesterday I posted a code snippet from inside a real piece of code inside the client side of a client/server utility in a Microsoft product.

    static DWORD WINAPI _PlayBeep(__in void* pv)
    {
        UNREFERENCED_PARAMETER(pv);
        PlaySound(L".Default", NULL, SND_SYNC | SND_ALIAS);
        return 0;
    }
    
    LRESULT WndProc(...)
    {
        :
        :
        case WM_KEYDOWN:
            if (!_AcceptInputKeys(wParam, lParam))
            {
                QueueUserWorkItem(_PlayBeep, NULL, 0);
            }
            break;
    }

    The bug here was that the call to PlaySound is synchronous.  Like many Win32 APIs, PlaySound is made thread-safe by taking a critical section around the code that does the work inside the call to PlaySound (since the semantics of PlaySound are that only one PlaySound call plays at any time, this is a reasonable thing to do).  Because the call to PlaySound specifies SND_SYNC, that processing is done on the thread that calls PlaySound and thus blocks the _PlayBeep call until the sound completes.  The “.Default” sound is approximately 1 second long, so the call to _PlayBeep takes about a second to complete.

    Unfortunately keyboards repeat at a lot faster than 1 keystroke/second and each keystroke causes a call to QueueUserWorkItem.  Because there’s no thread available to process the request, the thread pool logic creates a new thread to process the next work item for each keystroke.  This quickly backs up and eventually you run into the 500 thread limit in the default process threadpool.  So you’re going to have the machine play “Ding’s” for hours before this cleans itself up.

    But in this case the effects were was even worse. 

    Remember when I told you that it was a client/server utility?  In this case, it meant that it used async RPC to communicate with the server.  And async RPC uses the default thread pool for on certain applications.  When the response to a server request came in, there were no RPC threads running so RPC tried to create a thread in the default threadpool.  Which failed because the default threadpool was full.

    So the client/server utility stopped working.  It took about 15 seconds of hammering on invalid keys to get the utility into this state, NOT pretty.

     

    The first fix was to change the WM_KEYDOWN to WM_KEYUP (so that you actually had to release the keys instead of letting the typematic feature repeat for you).  That didn’t work because it still could be reproduced, it just took longer.

    The final fix was:

    static DWORD WINAPI _PlayBeep()
    {
        PlaySound(L".Default", NULL, SND_ASYNC | SND_ALIAS);
        return 0;
    }
    
    LRESULT WndProc(...)
    {
        :
        :
        case WM_KEYDOWN:
            if (!_AcceptInputKeys(wParam, lParam))
            {
                _PlayBeep();
            }
            break;
    }

    So instead of queueing a work item, the team changed the code to simply call PlaySound with SND_ASYNC and they were done.  When you specify SND_ASYNC, the PlaySound call queues the request to an internal-to-playsound worker thread, and it cancels the old sound before it starts playing a new sound (thus the sounds don’t back up).

     

    The object lesson here is: Don’t queue long running items to the thread pool because it can lead to unexpected results.  And even a 1 second “Ding” can count as “long running” if it can be queued rapidly enough.

  • Larry Osterman's WebLog

    What’s wrong with this code, part 26 – a real-world example

    • 28 Comments

    This is an example of a real-world bug that was recently fixed in an unreleased Microsoft product.  I was told about the bug because it involved the PlaySound API (and thus they asked me to code review the fix), but it could happen with any application.

    static DWORD WINAPI _PlayBeep(__in void* pv)
    {
        UNREFERENCED_PARAMETER(pv);
        PlaySound(L".Default"NULL, SND_SYNC | SND_ALIAS);
        return 0;
    }
    
    LRESULT WndProc(...)
    {
        :
        :
        case WM_KEYDOWN:
            if (!_AcceptInputKeys(wParam, lParam))
            {
                QueueUserWorkItem(_PlayBeep, NULL, 0);
            }
            break;
    }

     

    This is actual code from inside the client side of a client/server component in Windows that was attempting to “beep” on invalid input (I’ve changed the code slightly to hide the actual origin and undoubtedly introduced issues).  And it has a whopper of a bug in it.

    Given the simplicity of the code above, to get the answer right, it’s not enough to say what’s wrong with the code (the problem should be blindingly obvious).  You also need to be able to explain why this is so bad (in other words, what breaks when you do this).

     

    Bonus points if you can identify the fix that was eventually applied.

  • Larry Osterman's WebLog

    Windows 7 fixes the PlaySound(XXX, SND_MEMORY|SND_ASYNC) anti-pattern

    • 0 Comments

    A number of times in the past, I’ve mentioned that the PlaySound(xxx, xxx, SND_MEMORY|SND_ASYNC) pattern is almost always a bad idea.  After the last wave of crash dumps were received for this problem, our team decided to do something about it.  Starting with Windows 7, if you call PlaySound with SND_MEMORY|SND_ASYNC, instead of relying on the memory passed in by the application, we allocate our own buffer for the sound file on the heap and copy the file into that buffer.  We’ll only do it for WAV files that are smaller than 2M in size, and if the allocation of the buffer fails, we fall back on the original code path, but it should dramatically reduce the number of apps that crash while using this pattern.

    It’s a little thing, but it should make life much easier for those applications.

  • Larry Osterman's WebLog

    Good News! strlen isn’t a banned API after all.

    • 6 Comments

    We were doing some code reviews on the new Win7 SDK samples the other day and one of the code reviewers noticed that the code used wcslen to compute the length of a string.

    He pointed out that the SDL Banned API page calls out strlen/wcslen as being banned APIs:

    For critical functions, such as those accepting anonymous Internet connections, strlen must also be replaced:

    Table 19. Banned string length functions and replacements

    Banned APIs StrSafe Replacement Safe CRT Replacement
    strlen, wcslen, _mbslen, _mbstrlen, StrLen, lstrlen String*Length strnlen_s

    I was quite surprised to see this, since I’m not aware of any issues where the use of strlen/wcslen could cause security bugs.

     

    I asked Michael Howard about this and his response was that Table 19 has a typo – the word “server” is missing in the text, it should be “For critical server functions, such as those accepting anonymous Internet connections, strlen must also be replaced”. 

    Adding that one word makes all the difference.  And it makes sense – if you’re a server and accepting anonymous data over the internet, an attacker could cause you to crash by issuing a non null terminated string that was long enough – banning the API forces the developer to think about the length of the string.

    Somewhat OT, but I also think that the table is poorly formatted – the “For critical…” text should be AFTER the table title – the way the text is written, it appears to be a part of the previous section instead of being attached as explanatory text on Table 19 (but that’s just the editor in me).

     

    Apparently in SDL v5.0 (which hasn’t yet shipped) the *len functions are removed from the banned API list entirely.

Page 1 of 1 (5 items)