September, 2012

  • The Old New Thing

    Does the CopyFile function verify that the data reached its final destination successfully?

    • 30 Comments

    A customer had a question about data integrity via file copying.

    I am using the File.Copy to copy files from one server to another. If the call succeeds, am I guaranteed that the data was copied successfully? Does the File.Copy method internally perform a file checksum or something like that to ensure that the data was written correctly?

    The File.Copy method uses the Win32 Copy­File function internally, so let's look at Copy­File.

    Copy­File just issues Read­File calls from the source file and Write­File calls to the destination file. (Note: Simplification for purposes of discussion.) It's not clear what you are hoping to checksum. If you want Copy­File to checksum the bytes when the return from Read­File, and checksum the bytes as they are passed to Write­File, and then compare them at the end of the operation, then that tells you nothing, since they are the same bytes in the same memory.

    while (...) {
     ReadFile(sourceFile, buffer, bufferSize);
     readChecksum.checksum(buffer, bufferSize);
    
     writeChecksum.checksum(buffer, bufferSize);
     WriteFile(destinationFile, buffer, buffer,Size);
    }
    

    The read­Checksum and write­Checksum are identical because they operate on the same bytes. (In fact, the compiler might even optimize the code by merging the calculations together.) The only way something could go awry is if you have flaky memory chips that change memory values spontaneously.

    Maybe the question was whether Copy­File goes back and reads the file it just wrote out to calculate the checksum. But that's not possible in general, because you might not have read access on the destination file. I guess you could have it do a checksum if the destination were readable, and skip it if not, but then that results in a bunch of weird behavior:

    • It generates spurious security audits when it tries to read from the destination and gets ERROR_ACCESS_DENIED.
    • It means that Copy­File sometimes does a checksum and sometimes doesn't, which removes the value of any checksum work since you're never sure if it actually happened.
    • It doubles the network traffic for a file copy operation, leading to weird workarounds from network administrators like "Deny read access on files in order to speed up file copies."

    Even if you get past those issues, you have an even bigger problem: How do you know that reading the file back will really tell you whether the file was physically copied successfully? If you just read the data back, it may end up being read out of the disk cache, in which case you're not actually verifying physical media. You're just comparing cached data to cached data.

    But if you open the file with caching disabled, this has the side effect of purging the cache for that file, which means that the system has thrown away a bunch of data that could have been useful. (For example, if another process starts reading the file at the same time.) And, of course, you're forcing access to the physical media, which is slowing down I/O for everybody else.

    But wait, there's also the problem of caching controllers. Even when you tell the hard drive, "Now read this data from the physical media," it may decide to return the data from an onboard cache instead. You would have to issue a "No really, flush the data and read it back" command to the controller to ensure that it's really reading from physical media.

    And even if you verify that, there's no guarantee that the moment you declare "The file was copied successfully!" the drive platter won't spontaneously develop a bad sector and corrupt the data you just declared victory over.

    This is one of those "How far do you really want to go?" type of questions. You can re-read and re-validate as much as you want at copy time, and you still won't know that the file data is valid when you finally get around to using it.

    Sometimes, you're better off just trusting the system to have done what it says it did.

    If you really want to do some sort of copy verification, you'd be better off saving the checksum somewhere and having the ultimate consumer of the data validate the checksum and raise an integrity error if it discovers corruption.

  • The Old New Thing

    Data in crash dumps are not a matter of opinion

    • 24 Comments

    A customer reported a problem with the System­Time­To­Tz­Specific­Local­Time function. (Gosh, why couldn't they have reported a problem with a function with a shorter name! Now I have to type that thing over and over again.)

    We're having a problem with the System­Time­To­Tz­Specific­Local­Time function. We call it like this:

    s_pTimeZones->SystemTimeToTzSpecificLocalTime((BYTE)timeZoneId,
                                     &sysTime, &localTime);
    

    On some but not all of our machines, our program crashes with the following call stack:

    ExceptionAddress: 77d4a0d0 (kernel32!SystemTimeToTzSpecificLocalTime+0x49)
       ExceptionCode: c0000005 (Access violation)
      ExceptionFlags: 00000000
    NumberParameters: 2
       Parameter[0]: 00000000
       Parameter[1]: 000000ac
    Attempt to read from address 000000ac
     
    kernel32!SystemTimeToTzSpecificLocalTime+0x49
    Contoso!CTimeZones::SystemTimeToTzSpecificLocalTime+0x26
    Contoso!CContoso::ResetTimeZone+0xc0
    Contoso!ResetTimeZoneThreadProc+0x32
    

    This problem appears to occur only with the release build; the debug build does not have this problem. Any ideas?

    Notice that in the line of code the customer provided, they are not calling System­Time­To­Tz­Specific­Local­Time; they are instead calling some application-defined method with the same name, which takes different parameters from the system function.

    The customer apologized and included the source file they were using, as well as a crash dump file.

    void CContoso::ResetTimeZone()
    {
     SYSTEMTIME sysTime, localTime;
     GetLastModifiedTime(&sysTime);
    
     for (int timeZoneId = 1;
          timeZoneId < MAX_TIME_ZONES;
          timeZoneId++) {
      if (!s_pTimeZones->SystemTimeToTzSpecificLocalTime((BYTE)timeZoneId,
                                      &sysTime, &localTime)) {
        LOG_ERROR("...");
        return;
      }
      ... do something with localTime ...
     }
    }
    
    BOOL CTimeZones::SystemTimeToTzSpecificLocalTime(
        BYTE bTimeZoneID,
        LPSYSTEMTIME lpUniversalTime,
        LPSYSTEMTIME lpLocalTime)
    {
        return ::SystemTimeToTzSpecificLocalTime(
            &m_pTimeZoneInfo[bTimeZoneID],
            lpUniversalTime, lpLocalTime);
    }
    

    According to the crash dump, the first parameter passed to CTime­Zones::System­Time­To­Tz­Specific­Local­Time was 1, and the m_pTimeZoneInfo member was nullptr. As a result, a bogus non-null pointer was passed as the first parameter to System­Time­To­Tz­Specific­Local­Time, which resulted in a crash when the function tried to dereference it.

    This didn't require any special secret kernel knowledge; all I did was look at the stack trace and the value of the member variable.

    So far, it was just a case of a lazy developer who didn't know how to debug their own code. But the reply from the customer was most strange:

    I don't think so, for two reasons.

    1. The exact same build on another machine does not crash, so it must be a machine-specific or OS-specific bug.
    2. The code in question has not changed in several months, so if the problem were in that code, we would have encountered it much earlier.

    I was momentarily left speechless by this response. It sounds like the customer simply refuses to believe the information that's right there in the crash dump. "La la la, I can't hear you."

    Memory values are not a matter of opinion. If you look in memory and find that the value 5 is on the stack, then the value 5 is on the stack. You can't say, "No it isn't; it's 6." You can have different opinions on how the value 5 ended up on the stack, but the fact that the value is 5 is beyond dispute.

    It's like a box of cereal that has been spilled on the floor. People may argue over who spilled the cereal, or who placed the box in such a precarious position in the first place, but to take the position "There is no cereal on the floor" is a pretty audacious move.

    Whether you like it or not, the value is not correct. You can't deny what's right there in the dump file. (Well, unless you think the dump file itself is incorrect.)

    A colleague studied the customer's code more closely and pointed out a race condition where the thread which calls CContoso::ResetTimeZone may do so before the CTimeZone object has allocated the m_pTimeZoneInfo array. And it wasn't anything particularly subtle either. It went like this, in pseudocode:

    CreateThread(ResetTimeZoneThreadProc);
    
    s_pTimeZones = new CTimeZones;
    s_pTimeZones->Initialize();
    
    // the CTimeZones::Initialize method allocates m_pTimeZoneInfo
    // among other things
    

    The customer never wrote back once the bug was identified. Perhaps the sheer number of impossible things all happening at once caused their head to explode.

    I discussed this incident later with another colleague, who remarked

    Frequently, some problem X will occur, and the people debugging it will say, "The only way that problem X to occur is if we are in situation Y, but we know that situation Y is impossible, so we didn't bother investigating that possibility. Can you suggest another idea?"

    Yeah, I can suggest another idea. "The computer is always right." You already saw that problem X occurred. If the only way that problem X can occur is if you are in situation Y, then the first thing you should do is assume that you are in situation Y and work from there."

    Teaching people to follow this simple axiom has avoid a lot of fruitless misdirected speculative debugging. People seem hard-wired to prefer speculation, though, and it's common to slip back into forgetting simple logic.

    To put it another way:

    • If X, then Y.
    • X is true.
    • Y cannot possible be true.

    In order for these three statements to hold simultaneously, you must have found a fundamental flaw in the underlying axioms of logic as they have been understood for thousands of years.

    This is unlikely to be the case.

    Given that you have X right in front of you, X is true by observation. That leaves the other two statements. Maybe there's a case where X does not guarantee Y. Maybe Y is true after all.

    As Sherlock Holmes is famous for saying, "When you have eliminated the impossible, whatever remains, however improbable, must be the truth." But before you rule out the impossible, make sure it's actually impossible.

    Bonus chatter: Now that I've told you that the debugger never lies, I get to confuse you in a future entry by debugging a crash where the debugger lied. (Or at least wasn't telling the whole truth.)

  • The Old New Thing

    WM_CTLCOLOR vs GetFileVersionInfoSize: Just because somebody else screwed up doesn't mean you're allowed to screw up too

    • 16 Comments

    In a discussion of the now-vestigial lpdwHandle parameter to the Get­File­Version­Info­Size function, Neil asks, "Weren't there sufficient API differences (e.g. WM_CTLCOLOR) between Win16 and Win32 to justify changing the definitions to eliminate the superfluous handle?"

    The goal of Win32 was to provide as much backward compatibility with existing 16-bit source code as can be practically achieved. Not all of the changes were successful in achieving this goal, but just because one person fails to meet that goal doesn't mean that everybody else should abandon the goal, too.

    The Win32 porting tool PORTTOOL.EXE scanned for things which had changed and inserted comments saying things like

    • "No Win32 API equivalent" -- these were for the 25 functions which were very tightly coupled to the 16-bit environment, like selector management functions.
    • "Replaced by OtherFunction" -- these were used for the 38 functions which no longer existed in Win32, but for which corresponding function did exist, but the parameters were different so a simple search-and-replace was not sufficient.
    • "Replaced by XYZ system" -- these were for functions that used an interface that was completely redesigned: the 16 old sound functions that buzzed your tinny PC speaker being replaced by the new multimedia system, and the 8 profiling functions.
    • "This function is now obsolete" -- these were for the 16 functions that no longer had any effect, like Global­LRU­Newest and Limit­EMS­Pages.
    • "wParam/lParam repacking" -- these were for the 21 messages that packed their parameters differently.
    • Special remarks for eight functions whose parameters changed meaning and therefore required special attention.
    • A special comment just for window procedures.

    If you add it up, you'll see that this makes for a total of 117 breaking changes. And a lot of these changes were in rarely-used parts of Windows like the selector-management stuff, the PC speaker stuff, the profiling stuff, and the serial port functions. The number of breaking changes that affected typical developers was more like a few dozen.

    Not bad for a total rewrite of an operating system.

    If somebody said, "Hey, you should port to this new operating system. Here's a list of 117 things you need to change," you're far more likely to respond, "Okay, I guess I can do that," than if somebody said, "Here's a list of 3,000 things you need to change." Especially if some of the changes were not absolutely necessary, but were added merely to annoy you. (I would argue that the handling of many GDI functions like Move­To fell into the added merely to annoy you category, but at least a simple macro smooths over most of the problems.)

    One of the messages that required special treatment was WM_COMMAND. In 16-bit Windows, the parameters were as follows:

    WPARAM int nCode
    LPARAM HWND hwndCtl (low word)
    int id (high word)

    Observe that this message violated the rule that handle-sized things go in the WPARAM. As a result, this parameter packing method could not be maintained in Win32. If it had been packed as

    WPARAM HWND hwndCtl
    LPARAM int id (low word)
    int nCode (high word)

    then the message would have ported cleanly to Win32. But Win32 handles are 32-bit values, so there's no room for both an HWND and an integer in a 32-bit LPARAM; as a result, the message had to be repacked in Win32.

    The WM_CTL­COLOR message was an extra special case of a message that required changes, because it was the only one that changed in a way that required more than just mechanical twiddling of the way the parameters were packaged. Instead, it got split out into several messages, one for each type of control.

    In 16-bit Windows, the parameters to the WM_CTL­COLOR message were as follows:

    WPARAM HDC hdc
    LPARAM HWND hwndCtl (low word)
    int type (high word)

    The problem with this message was that it had two handle-sized values. One of them went into the WPARAM, like all good handle-sized parameters, but the second one was forced to share a bunk bed with the type code in the LPARAM. This arrangement didn't survive in Win32 because handles expanded to 32-bit values, but unlike WM_COMMAND, there was nowhere to put the now-ousted type, since both the WPARAM and LPARAM were full with the two handles. Solution: Encode the type code in the message number. The WM_CTL­COLOR message became a collection of messages, all related by the formula

    WM_CTLCOLORtype = WM_CTLCOLORMSGBOX + CTLCOLOR_type
    

    The WM_CTL­COLOR message was the bad boy in the compatibility contest, falling pretty badly on its face. (How many metaphors can I mix in one article?)

    But just because there's somebody who screwed up doesn't mean that you're allowed to screw up too. If there was a parameter that didn't do anything any more, just declare it a reserved parameter. That way, you didn't have to go onto the "wall of shame" of functions that didn't port cleanly. The Get­File­Version­Info­Size function kept its vestigial lpdwHandle parameter, Win­Main kept its vestigial hPrev­Instance parameter, and Co­Initialize kept its vestigial lpReserved parameter.

    This also explains why significant effort was made in the 32-bit to 64-bit transition not to make breaking changes just because you can. As much as practical, porting issues were designed in such a way that they could be detected at compile time. Introducing gratuitous changes in behavior makes the porting process harder than it needs to be.

  • The Old New Thing

    Rogue feature: Docking a folder at the edge of the screen

    • 31 Comments

    Starting in Windows 2000 and continuing through Windows Vista, you could drag a folder out of Explorer and slam it into the edge of the screen. When you let go, it docked itself to that edge of the screen like a toolbar. A customer noticed that this stopped working in Windows 7 and asked, "Was this feature dropped in Windows 7, and is there a way to turn it back on?"

    Yes, the feature was dropped in Windows 7, and there is no way to turn it back on because the code to implement it was deleted from the product. (Well, okay, you could "turn it back on" by working with your support representative to file a Design Change Request with the Windows Sustained Engineering team and asking them to restore the code. But they'll probably cackle with glee as they click REQUEST DENIED. They will also probably add a buzzing sound just for extra oomph.)

    The introduction of this feature took place further back in history than I have permission to access the Windows source code history database, so I can't explain how it was introduced, but I can guess, and then the person who removed the feature confirmed that my guess was correct.

    First of all, very few people were actually using the feature. And of the people who activated it, most of them did so by mistake and couldn't figure out how to undo it. (Sound familiar?) The feature was creating far more trouble than benefit, and by that calculation alone, it was a strong candidate for removal. Furthermore, the design team was interested in a new way to use the edges of the screen. Nobody could figure out how the docking feature actually got added. We strongly suspect that it was another rogue feature added by a specific developer who had a history of slipping in rogue features.

  • The Old New Thing

    Why can't I set "Size all columns to fit" as the default?

    • 30 Comments

    A customer wanted to know how to set Size all columns to fit as the default for all Explorer windows. (I found an MSDN forum thread on the same subject, and apparently, the inability to set Size all columns to fit as the default is "an enormous oversight and usability problem.")

    The confusion stems from the phrasing of the option; it's not clear whether it is a state or a verb. The option could mean

    • "Refresh the size of all the columns so that they fit the content" (verb)
    • "Maintain the size of all the columns so that they fit the content" (state)

    As it happens, the option is a verb, which means that it is not part of the state, and therefore can't be made the default. (The cue that it is a verb is that when you select it, you don't get a check-mark next to the menu option the next time you go to the menu.)

    Mind you, during the development cycle, we did try addressing the oversight part of the enormous oversight and usability problem, but we discovered that fixing the oversight caused an enormous usability problem.

    After changing Size all columns to fit from a verb to a state, the result was unusable: The constantly-changing column widths (which were often triggered spontaneously as the contents of the view were refreshed or updated) were unpredictable and consequently reduced user confidence since it's hard to have the confidence to click the mouse if there is an underlying threat that the thing you're trying to click will move around of its own volition.

    Based on this strong negative feedback, we changed it back to a verb. Now the columns shift around only when you tell them to.

    I find it interesting that even a decision that was made by actually implementing it and then performing actual usability research gets dismissed as something that was "an enormous oversight and usability problem."

    Sigh: Comments closed due to insults and name-calling.

  • The Old New Thing

    Why can't I use Magnifier in Full Screen or Lens mode?

    • 16 Comments

    A customer liaison asked why their customer's Windows 7 machines could run Magnifier only in Docked mode. Full Screen and Lens mode were disabled. The customer liaison was unable to reproduce the problem on a physical machine, but was able to reproduce it in a virtual machine.

    Full Screen and Lens mode require that desktop composition be enabled. Windows will enable desktop composition by default if it thinks your video card is capable of handling it. (Finding the minimum hardware requirements for desktop composition is left as an exercise.)

    This was visible in the screen shots provided by the customer liaison. In the screen shot where Full Screen and Lens modes were enabled, the Aero theme was being used, whereas in the screen shot where they were disabled, the theme was Windows 7 Basic. The Windows 7 Basic theme is used when desktop composition is disabled.

    A quick way to check whether desktop composition is enabled is to hit Alt+Tab and see whether windows get the Aero Peek effect when you select them. Aero Peek is a feature that is provided by the desktop compositor.

  • The Old New Thing

    The case of the asynchronous copy and delete

    • 29 Comments

    A customer reported some strange behavior in the Copy­File and Delete­File functions. They were able to reduce the problem to a simple test program, which went like this (pseudocode):

    // assume "a" is a large file, say, 1MB.
    
    while (true)
    {
      // Try twice to copy the file
      if (!CopyFile("a", "b", FALSE)) {
        Sleep(1000);
        if (!CopyFile("a", "b", FALSE)) {
          fatalerror
        }
      }
    
      // Try twice to delete the file
      if (!DeleteFile("b")) {
        Sleep(1000);
        if (!DeleteFile("b")) {
          fatalerror
        }
      }
    }
    

    When they ran the program, they found that sometimes the copy failed on the first try with error 5 (ERROR_ACCESS_DENIED) but if they waited a second and tried again, it succeeded. Similarly, sometimes the delete failed on the first try, but succeeded on the second try if you waited a bit.

    What's going on here? It looks like the Copy­File is returning before the file copy is complete, causing the Delete­File to fail because the copy is still in progress. Conversely, it looks like the Delete­File returns before the file is deleted, causing the Copy­File to fail because the destination exists.

    The operations Copy­File and Delete­File are synchronous. However, the NT model for file deletion is that a file is deleted when the last open handle is closed.¹ If Delete­File returns and the file still exists, then it means that somebody else still has an open handle to the file.

    So who has the open handle? The file was freshly created, so there can't be any pre-existing handles to the file, and we never open it between the copy and the delete.

    My psychic powers said, "The offending component is your anti-virus software."

    I can think of two types of software that goes around snooping on recently-created files. One of them is an indexing tool, but those tend not to be very aggressive about accessing files the moment they are created. They tend to wait until the computer is idle to do their work. Anti-virus software, however, runs in real-time mode, where they check every file as it is created. And that's more likely to be the software that snuck in and opened the file after the copy completes so it can perform a scan on it, and that open is the extra handle that is preventing the deletion from completing.

    But wait, aren't anti-virus software supposed to be using oplocks so that they can close their handle and get out of the way if somebody wants to delete the file?

    Well, um, yes, but "what they should do" and "what they actually do" are often not the same.

    We never did hear back from the customer whether the guess was correct, which could mean one of various things:

    1. They confirmed the diagnosis and didn't feel the need to reply.
    2. They determined that the diagnosis was incorrect but didn't bother coming back for more help, because "those Windows guys don't know what they're talking about."
    3. They didn't test the theory at all, so had nothing to report.

    We may never know what the answer is.

    Note

    ¹Every so often, the NT file system folks dream of changing the deletion model to be more Unix-like, but then they wonder if that would end up breaking more things than it fixes.

  • The Old New Thing

    How did the X-Mouse setting come to be?

    • 34 Comments

    Commenter HiTechHiTouch wants to know whether the "X-Mouse" feature went through the "every request starts at −100 points filter", and if so, how did it manage to gain 99 points?

    The X-Mouse feature is ancient and long predates the "−100 points" rule. It was added back in the days when a developer could add a random rogue feature because he liked it.

    But I'm getting ahead of myself.

    Rewind back to 1995. Windows 95 had just shipped, and some of the graphics people had shifted their focus to DirectX. The DirectX team maintained a very close relationship with the video game software community, and a programmer at one of the video game software companies mentioned in passing as part of some other conversation, "Y'know, one thing I miss from my X-Windows workstation is the ability to set focus to a window by just moving the mouse into it."

    As it happened, that programmer happened to mention it to a DirectX team member who used to be on the shell team, so the person he mentioned it to actually knew a lot about all this GUI programming stuff. Don't forget, in the early days of DirectX, it was a struggle convincing game vendors to target this new Windows 95 operating system; they were all accustomed to writing their games to run under MS-DOS. Video game programmers didn't know much about programming for Windows because they had never done it before.

    That DirectX team member sat down and quickly pounded out the first version of what eventually became known to the outside world as the X-Mouse PowerToy. He gave a copy to that programmer whose request was made almost as an afterthought, and he was thrilled that he could move focus around with the mouse the way he was used to.

    "Hey, great little tool you got there. Could you tweak it so that when I move the mouse into a window, it gets focus but doesn't come to the top? Sorry I didn't mention that originally; I didn't realize you were going to interpret my idle musing as a call to action!"

    The DirectX team member added the feature and added a check-box to the X-Mouse PowerToy to control whether the window is brought to the top when it is activated by mouse motion.

    "This is really sweet. I hate to overstay my welcome, but could you tweak it so that it doesn't change focus until my mouse stays in the window for a while? Again, sorry I didn't mention that originally."

    Version three of X-Mouse added the ability to set a delay before it moved the focus. And that was the version of X-Mouse that went into the PowerToys.

    When the Windows NT folks saw the X-Mouse PowerToy, they said, "Aw shucks, we can do that too!" And they added the three System­Parameters­Info values I described in an earlier article so as to bring Windows NT up to feature parity with X-Mouse.

    It was a total rogue feature.

  • The Old New Thing

    How do you deal with an input stream that may or may not contain Unicode data?

    • 27 Comments

    Dewi Morgan reinterpreted a question from a Suggestion Box of times past as "How do you deal with an input stream that may or may not contain Unicode data?" A related question from Dave wondered how applications that use CP_ACP to store data could ensure that the data is interpreted in the same code page by the recipient. "If I send a .txt file to a person in China, do they just go through code pages until it seems to display correctly?"

    These questions are additional manifestations of Keep your eye on the code page.

    When you store data, you need to have some sort of agreement (either explicit or implicit) with the code that reads the data as to how the data should be interpreted. Are they four-byte sign-magnitude integers stored in big-endian format? Are they two-byte ones-complement signed integers stored in little-endian format? Or maybe they are IEEE floating-point data stored in 80-bit format. If there is no agreement between the two parties, then confusion will ensue.

    That your data consists of text does not exempt you from this requirement. Is the text encoded in UTF-16LE? Or maybe it's UTF-8. Or perhaps it's in some other 8-bit character set. If the two sides don't agree, then there will be confusion.

    In the case of files encoded in CP_ACP, you have a problem if the source and destination have different values for CP_ACP. That text file you generate on a US-English system (where CP_ACP is 1252) may not make sense when decoded on a Chinese-Simplified system (where CP_ACP is 936). It so happens that all Windows 8-bit code pages agree on code points 0 through 127, so if you restrict yourself to that set, you are safe. The Windows shell team was not so careful, and they slipped some characters into a header file which are illegal when decoded in code page 932 (the CP_ACP used in Japan). The systems in Japan do not cycle through all the code pages looking for one that decodes without errors; they just use their local value of CP_ACP, and if the file makes no sense, then I guess it makes no sense.

    If you are in the unfortunate situation of having to consume data where the encoding is unspecified, you will find yourself forced to guess. And if you guess wrong, the result can be embarrassing.

    Bonus chatter: I remember one case where a customer asked, "We need to convert a string of chars into a string of wchars. What code page should we pass to the Multi­Byte­To­Wide­Char function?"

    I replied, "What code page is your char string in?"

    There was no response. I guess they realized that once they answered that question, they had their answer.

  • The Old New Thing

    When you transfer control across stack frames, all the frames in between need to be in on the joke

    • 13 Comments

    Chris Hill suggests discussing the use of structured exception handling as it relates to the window manager, and specifically the implications for applications which raise exceptions from a callback.

    If you plan on raising an exception and handling it in a function higher up the stack, all the stack frames in between need to be be in on your little scheme, because they need to be able to unwind. (And I don't mean "unwind" in the "have a beer and watch some football" sense of "unwind".)

    If you wrote all the code in between the point the exception is raised and the point it is handled, then you're in good shape, because at least then you have a chance of making sure they all unwind properly. This means either using RAII techniques (and possibly compiling with the /EHa flag to convert asynchronous exceptions to synchronous ones, so that Win32 exceptions will also trigger unwind; although that has its own problems since the C++ exception model is synchronous, not asynchronous) or judiciously using try/finally (or whatever equivalent exists in your programming language of choice) to clean up resources in the event of an unwind.

    But if you don't control all the frames in between, then you can't really guarantee that they were written in the style you want.

    In Win32, exceptions are considered to be horrific situations that usually indicate some sort of fatal error. There may be some select cases where exceptions can be handled, but those are more the unusual cases than the rule. Most of the time, an exception means that something terrible has happened and you're out of luck. The best you can hope for at this point is a controlled crash landing.

    As a result of this overall mindset, Win32 code doesn't worry too much about recovering from exceptions. If an exception happens, then it means your process is already toast and there's no point trying to fix it, because that would be trying to reason about a total breakdown of normal functioning. As a general rule generic Win32 code is not exception-safe.

    Consider a function like this:

    struct BLORP
    {
        int Type;
        int Count;
        int Data;
    };
    
    CRITICAL_SECTION g_csGlobal; // assume somebody initialized this
    BLORP g_Blorp; // protected by g_csGlobal
    
    void SetCurrentBlorp(const BLORP *pBlorp)
    {
        EnterCriticalSection(&g_csGlobal);
        g_Blorp = *pBlorp;
        LeaveCriticalSection(&g_csGlobal);
    }
    
    void GetCurrentBlorp(BLORP *pBlorp)
    {
        EnterCriticalSection(&g_csGlobal);
        *pBlorp = g_Blorp;
        LeaveCriticalSection(&g_csGlobal);
    }
    

    These are perfectly fine-looking functions from a traditional Win32 standpoint. They take a critical section, copy some data, and leave the critical section. The only thing¹ that could go wrong is that the caller passed a bad pointer. In the case of Terminate­Thread, we're already in the world of "don't do that" If that happens, a STATUS_ACCESS_VIOLATION exception is raised, and the application dies.

    But what if your program decides to handle the access violation? Maybe pBlorp points into a memory-mapped file, and there is an I/O error paging the memory in, say because it's a file on the network and there was a network hiccup. Now you have two problems: The critical section is orphaned, and the data is only partially copied. (The partial-copy case happens if the pBlorp points to a BLORP that straddles a page boundary, where the first page is valid but the second page isn't.) Just converting this code to RAII solves the first problem, but it doesn't solve the second, which is kind of bad because the second problem is what the critical section was trying to prevent from happening in the first place!

    void SetCurrentBlorp(const BLORP *pBlorp)
    {
        CriticalSectionLock lock(&g_csGlobal);
        g_Blorp = *pBlorp;
    }
    
    void GetCurrentBlorp(BLORP *pBlorp)
    {
        CriticalSectionLock lock(&g_csGlobal);
        *pBlorp = g_Blorp;
    }
    

    Suppose somebody calls Set­Current­Blorp with a BLORP whose Type and Count are in readable memory, but whose Data is not. The code enters the critical section, copies the Type and Count, but crashes when it tries to copy the Data, resulting in a STATUS_ACCESS_VIOLATION exception. Now suppose that somebody unwisely decides to handle this exception. The RAII code releases the critical section (assuming that you compiled with /EHa), but there's no code to try to patch up the now-corrupted g_Blorp. Since the critical section was probably added to prevent g_Blorp from getting corrupted, the result is that the thing you tried to protect against ended up happening anyway.

    Okay, that was a bit of a digression. The point is that unless everybody between the point the exception is raised and the point the exception is handled is in on the joke, you are unlikely to escape fully unscathed. This is particular true in the generalized Win32 case, since it is perfectly legal to write Win32 code in languages other than C++, as long as you adhere to the Win32 ABI. (I'm led to believe that Visual Basic is still a popular language.)

    There are a lot of ways of getting stack frames beyond your control between the point the exception is raised and the point it is handled. For example, you might call Enum­Windows and raise an exception in the callback function and try to catch it in the caller. Or you might raise an exception in a window procedure and try to catch it in your message loop. Or you might try to longjmp out of a window procedure. All of these end up raising an exception and catching it in another frame. And since you don't control all the frames in between, you can't guarantee that they are all prepared to resume execution in the face of an exception.

    Bonus reading: My colleague Paul Betts has written up a rather detailed study of one particular instance of this phenomenon.

    ¹Okay, another thing that could go wrong is that somebody calls Terminate­Thread on the thread, but whoever did that knew they were corrupting the process.

Page 1 of 3 (26 items) 123