September, 2012

  • The Old New Thing

    When you transfer control across stack frames, all the frames in between need to be in on the joke

    • 13 Comments

    Chris Hill suggests discussing the use of structured exception handling as it relates to the window manager, and specifically the implications for applications which raise exceptions from a callback.

    If you plan on raising an exception and handling it in a function higher up the stack, all the stack frames in between need to be be in on your little scheme, because they need to be able to unwind. (And I don't mean "unwind" in the "have a beer and watch some football" sense of "unwind".)

    If you wrote all the code in between the point the exception is raised and the point it is handled, then you're in good shape, because at least then you have a chance of making sure they all unwind properly. This means either using RAII techniques (and possibly compiling with the /EHa flag to convert asynchronous exceptions to synchronous ones, so that Win32 exceptions will also trigger unwind; although that has its own problems since the C++ exception model is synchronous, not asynchronous) or judiciously using try/finally (or whatever equivalent exists in your programming language of choice) to clean up resources in the event of an unwind.

    But if you don't control all the frames in between, then you can't really guarantee that they were written in the style you want.

    In Win32, exceptions are considered to be horrific situations that usually indicate some sort of fatal error. There may be some select cases where exceptions can be handled, but those are more the unusual cases than the rule. Most of the time, an exception means that something terrible has happened and you're out of luck. The best you can hope for at this point is a controlled crash landing.

    As a result of this overall mindset, Win32 code doesn't worry too much about recovering from exceptions. If an exception happens, then it means your process is already toast and there's no point trying to fix it, because that would be trying to reason about a total breakdown of normal functioning. As a general rule generic Win32 code is not exception-safe.

    Consider a function like this:

    struct BLORP
    {
        int Type;
        int Count;
        int Data;
    };
    
    CRITICAL_SECTION g_csGlobal; // assume somebody initialized this
    BLORP g_Blorp; // protected by g_csGlobal
    
    void SetCurrentBlorp(const BLORP *pBlorp)
    {
        EnterCriticalSection(&g_csGlobal);
        g_Blorp = *pBlorp;
        LeaveCriticalSection(&g_csGlobal);
    }
    
    void GetCurrentBlorp(BLORP *pBlorp)
    {
        EnterCriticalSection(&g_csGlobal);
        *pBlorp = g_Blorp;
        LeaveCriticalSection(&g_csGlobal);
    }
    

    These are perfectly fine-looking functions from a traditional Win32 standpoint. They take a critical section, copy some data, and leave the critical section. The only thing¹ that could go wrong is that the caller passed a bad pointer. In the case of Terminate­Thread, we're already in the world of "don't do that" If that happens, a STATUS_ACCESS_VIOLATION exception is raised, and the application dies.

    But what if your program decides to handle the access violation? Maybe pBlorp points into a memory-mapped file, and there is an I/O error paging the memory in, say because it's a file on the network and there was a network hiccup. Now you have two problems: The critical section is orphaned, and the data is only partially copied. (The partial-copy case happens if the pBlorp points to a BLORP that straddles a page boundary, where the first page is valid but the second page isn't.) Just converting this code to RAII solves the first problem, but it doesn't solve the second, which is kind of bad because the second problem is what the critical section was trying to prevent from happening in the first place!

    void SetCurrentBlorp(const BLORP *pBlorp)
    {
        CriticalSectionLock lock(&g_csGlobal);
        g_Blorp = *pBlorp;
    }
    
    void GetCurrentBlorp(BLORP *pBlorp)
    {
        CriticalSectionLock lock(&g_csGlobal);
        *pBlorp = g_Blorp;
    }
    

    Suppose somebody calls Set­Current­Blorp with a BLORP whose Type and Count are in readable memory, but whose Data is not. The code enters the critical section, copies the Type and Count, but crashes when it tries to copy the Data, resulting in a STATUS_ACCESS_VIOLATION exception. Now suppose that somebody unwisely decides to handle this exception. The RAII code releases the critical section (assuming that you compiled with /EHa), but there's no code to try to patch up the now-corrupted g_Blorp. Since the critical section was probably added to prevent g_Blorp from getting corrupted, the result is that the thing you tried to protect against ended up happening anyway.

    Okay, that was a bit of a digression. The point is that unless everybody between the point the exception is raised and the point the exception is handled is in on the joke, you are unlikely to escape fully unscathed. This is particular true in the generalized Win32 case, since it is perfectly legal to write Win32 code in languages other than C++, as long as you adhere to the Win32 ABI. (I'm led to believe that Visual Basic is still a popular language.)

    There are a lot of ways of getting stack frames beyond your control between the point the exception is raised and the point it is handled. For example, you might call Enum­Windows and raise an exception in the callback function and try to catch it in the caller. Or you might raise an exception in a window procedure and try to catch it in your message loop. Or you might try to longjmp out of a window procedure. All of these end up raising an exception and catching it in another frame. And since you don't control all the frames in between, you can't guarantee that they are all prepared to resume execution in the face of an exception.

    Bonus reading: My colleague Paul Betts has written up a rather detailed study of one particular instance of this phenomenon.

    ¹Okay, another thing that could go wrong is that somebody calls Terminate­Thread on the thread, but whoever did that knew they were corrupting the process.

  • The Old New Thing

    The case of the asynchronous copy and delete

    • 29 Comments

    A customer reported some strange behavior in the Copy­File and Delete­File functions. They were able to reduce the problem to a simple test program, which went like this (pseudocode):

    // assume "a" is a large file, say, 1MB.
    
    while (true)
    {
      // Try twice to copy the file
      if (!CopyFile("a", "b", FALSE)) {
        Sleep(1000);
        if (!CopyFile("a", "b", FALSE)) {
          fatalerror
        }
      }
    
      // Try twice to delete the file
      if (!DeleteFile("b")) {
        Sleep(1000);
        if (!DeleteFile("b")) {
          fatalerror
        }
      }
    }
    

    When they ran the program, they found that sometimes the copy failed on the first try with error 5 (ERROR_ACCESS_DENIED) but if they waited a second and tried again, it succeeded. Similarly, sometimes the delete failed on the first try, but succeeded on the second try if you waited a bit.

    What's going on here? It looks like the Copy­File is returning before the file copy is complete, causing the Delete­File to fail because the copy is still in progress. Conversely, it looks like the Delete­File returns before the file is deleted, causing the Copy­File to fail because the destination exists.

    The operations Copy­File and Delete­File are synchronous. However, the NT model for file deletion is that a file is deleted when the last open handle is closed.¹ If Delete­File returns and the file still exists, then it means that somebody else still has an open handle to the file.

    So who has the open handle? The file was freshly created, so there can't be any pre-existing handles to the file, and we never open it between the copy and the delete.

    My psychic powers said, "The offending component is your anti-virus software."

    I can think of two types of software that goes around snooping on recently-created files. One of them is an indexing tool, but those tend not to be very aggressive about accessing files the moment they are created. They tend to wait until the computer is idle to do their work. Anti-virus software, however, runs in real-time mode, where they check every file as it is created. And that's more likely to be the software that snuck in and opened the file after the copy completes so it can perform a scan on it, and that open is the extra handle that is preventing the deletion from completing.

    But wait, aren't anti-virus software supposed to be using oplocks so that they can close their handle and get out of the way if somebody wants to delete the file?

    Well, um, yes, but "what they should do" and "what they actually do" are often not the same.

    We never did hear back from the customer whether the guess was correct, which could mean one of various things:

    1. They confirmed the diagnosis and didn't feel the need to reply.
    2. They determined that the diagnosis was incorrect but didn't bother coming back for more help, because "those Windows guys don't know what they're talking about."
    3. They didn't test the theory at all, so had nothing to report.

    We may never know what the answer is.

    Note

    ¹Every so often, the NT file system folks dream of changing the deletion model to be more Unix-like, but then they wonder if that would end up breaking more things than it fixes.

  • The Old New Thing

    You can't rule out a total breakdown of normal functioning, because a total breakdown of normal functioning could manifest itself as anything

    • 16 Comments

    A customer was attempting to study a problem that their analysis traced back to the malloc function returning NULL.

    Is it a valid conclusion that there is no heap corruption?

    While heap corruption may not be the avenue of investigation you'd first pursue, you can't rule it out. In the presence of a total breakdown of normal functioning, anything can happen, including appearing to be some other type of failure entirely.

    For example, the heap corruption might have corrupted the bookkeeping data in such a way as to make the heap behave as if it were a fixed-sized heap, say by corrupting the location where the heap manager remembered the dwMaximumSize parameter and changing it from zero to nonzero. Now, the next time the heap manager wants to expand the heap, it sees that the heap is no longer expandable and returns NULL.

    Or maybe the heap corruption tricked the heap manager into thinking that it was operating under low resource simulation, so it returned NULL even though there was plenty of memory available.

    Remember, once you've entered the realm of undefined behavior, anything is possible. Heck, one possible response to heap corruption is the installation of a rootkit.

    After all, that's how more advanced classes of malware work. They exploit a vulnerability to nudge a process into a subtle failure mode, and then push the failure mode over the edge into a breakdown, and then exploit the breakdown to get themselves installed onto your system, and then cover their tracks so you don't realize you've been pwned.

    Maybe the heap was corrupted in a way that cause a rootkit to become installed, and the rootkit patched the malloc function so it returned NULL.

    Like I said earlier, the possibility of heap corruption is probably not the avenue I would investigate first. But you can't rule it out either.

    Bonus chatter: Since heap corruption can in principle lead to anything, any bug that results in heap corruption automatically gets a default classification of Arbitrary Code Execution, and if the heap corruption can be triggered via the network, it gets an automatic default classification of Remote Code Execution (RCE). Even if the likelihood of transforming the heap corruption into remote code execution is exceedingly low, you still have to classify it as RCE until you can rule out all possibility of code execution. (And it is extremely rare that one can successfully prove that a heap overflow is not exploitable under any possible conditions.)

  • The Old New Thing

    How did the X-Mouse setting come to be?

    • 34 Comments

    Commenter HiTechHiTouch wants to know whether the "X-Mouse" feature went through the "every request starts at −100 points filter", and if so, how did it manage to gain 99 points?

    The X-Mouse feature is ancient and long predates the "−100 points" rule. It was added back in the days when a developer could add a random rogue feature because he liked it.

    But I'm getting ahead of myself.

    Rewind back to 1995. Windows 95 had just shipped, and some of the graphics people had shifted their focus to DirectX. The DirectX team maintained a very close relationship with the video game software community, and a programmer at one of the video game software companies mentioned in passing as part of some other conversation, "Y'know, one thing I miss from my X-Windows workstation is the ability to set focus to a window by just moving the mouse into it."

    As it happened, that programmer happened to mention it to a DirectX team member who used to be on the shell team, so the person he mentioned it to actually knew a lot about all this GUI programming stuff. Don't forget, in the early days of DirectX, it was a struggle convincing game vendors to target this new Windows 95 operating system; they were all accustomed to writing their games to run under MS-DOS. Video game programmers didn't know much about programming for Windows because they had never done it before.

    That DirectX team member sat down and quickly pounded out the first version of what eventually became known to the outside world as the X-Mouse PowerToy. He gave a copy to that programmer whose request was made almost as an afterthought, and he was thrilled that he could move focus around with the mouse the way he was used to.

    "Hey, great little tool you got there. Could you tweak it so that when I move the mouse into a window, it gets focus but doesn't come to the top? Sorry I didn't mention that originally; I didn't realize you were going to interpret my idle musing as a call to action!"

    The DirectX team member added the feature and added a check-box to the X-Mouse PowerToy to control whether the window is brought to the top when it is activated by mouse motion.

    "This is really sweet. I hate to overstay my welcome, but could you tweak it so that it doesn't change focus until my mouse stays in the window for a while? Again, sorry I didn't mention that originally."

    Version three of X-Mouse added the ability to set a delay before it moved the focus. And that was the version of X-Mouse that went into the PowerToys.

    When the Windows NT folks saw the X-Mouse PowerToy, they said, "Aw shucks, we can do that too!" And they added the three System­Parameters­Info values I described in an earlier article so as to bring Windows NT up to feature parity with X-Mouse.

    It was a total rogue feature.

  • The Old New Thing

    Why don't the shortcuts I put in the CSIDL_COMMON_FAVORITES folder show up in the Favorites menu?

    • 25 Comments

    A customer created some shortcuts in the CSIDL_COMMON_FAVORITES folder, expecting them to appear in the Favorites menu for all users. Instead, they appeared in the Favorites menu for no users. Why isn't CSIDL_COMMON_FAVORITES working?

    The CSIDL_COMMON_FAVORITES value was added at the same time as the other CSIDL_COMMON_* values, and its name strongly suggests that its relationship to CSIDL_FAVORITES is the same as the relationship between CSIDL_COMMON_STARTMENU and CSIDL_STARTMENU, or between CSIDL_COMMON_PROGRAMS and CSIDL_PROGRAMS, or between CSIDL_COMMON_DESKTOP­DIRECTORY and CSIDL_DESKTOP­DIRECTORY.

    That suggestion is a false one.

    In fact, CSIDL_COMMON_FAVORITES is not hooked up to anything. It's another of those vestigial values that got created with the intent of actually doing something but that thing never actually happened. I don't think any version of Internet Explorer ever paid any attention to that folder. Maybe the designers decided that it was a bad idea and cut the feature. Maybe it was an oversight. Whatever the reason, it's just sitting there wasting space.

    Sorry for the fake-out.

    Exercise: Another customer wanted to know why creating a %ALL­USERS­PROFILE%\Microsoft\Internet Explorer\Quick Launch directory and putting shortcuts into it did not result in those shortcuts appearing in every user's Quick Launch bar. Explain.

  • The Old New Thing

    Buzzword-filled subject line easily misinterpreted by unsuspecting manager

    • 11 Comments

    A colleague of mine submitted some paperwork regarding the end-date of his college intern. The automated response combined HR buzzwords in an unfortunate way:

    Subject: Intern Termination Report was executed

    Just to be sure, my colleague stopped by his intern's office. He's still there. And still alive.

    For now.

Page 3 of 3 (26 items) 123