January, 2005

  • The Old New Thing

    Cleaner, more elegant, and harder to recognize

    • 116 Comments

    It appears that some people interpreted the title of one of my rants from many months ago, "Cleaner, more elegant, and wrong", to be a reference to exceptions in general. (See bibliography reference [35]; observe that the citer even changed the title of my article for me!)

    The title of the article was a reference to a specific code snippet that I copied from a book, where the book's author claimed that the code he presented was "cleaner and more elegant". I was pointing out that the code fragment was not only cleaner and more elegant, it was also wrong.

    You can write correct exception-based programming.

    Mind you, it's hard.

    On the other hand, just because something is hard doesn't mean that it shouldn't be done.

    Here's a breakdown:

    Really easy Hard Really hard
    Writing bad error-code-based code
    Writing bad exception-based code
    Writing good error-code-based code Writing good exception-based code

    It's easy to write bad code, regardless of the error model.

    It's hard to write good error-code-based code since you have to check every error code and think about what you should do when an error occurs.

    It's really hard to write good exception-based code since you have to check every single line of code (indeed, every sub-expression) and think about what exceptions it might raise and how your code will react to it. (In C++ it's not quite so bad because C++ exceptions are raised only at specific points during execution. In C#, exceptions can be raised at any time.)

    But that's okay. Like I said, just because something is hard doesn't mean it shouldn't be done. It's hard to write a device driver, but people do it, and that's a good thing.

    But here's another table:

    Really easy Hard Really hard
    Recognizing that error-code-based code is badly-written
    Recognizing the difference between bad error-code-based code and not-bad error-code-based code.
    Recognizing that error-code-base code is not badly-written
    Recognizing that exception-based code is badly-written
    Recognizing that exception-based code is not badly-written
    Recognizing the difference between bad exception-based code and not-bad exception-based code

    Here's some imaginary error-code-based code. See if you can classify it as "bad" or "not-bad":

    BOOL ComputeChecksum(LPCTSTR pszFile, DWORD* pdwResult)
    {
      HANDLE h = CreateFile(pszFile, GENERIC_READ, FILE_SHARE_READ,
           NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
      HANDLE hfm = CreateFileMapping(h, NULL, PAGE_READ, 0, 0, NULL);
      void *pv = MapViewOfFile(hfm, FILE_MAP_READ, 0, 0, 0);
      DWORD dwHeaderSum;
      CheckSumMappedFile(pvBase, GetFileSize(h, NULL),
               &dwHeaderSum, pdwResult);
      UnmapViewOfFile(pv);
      CloseHandle(hfm);
      CloseHandle(h);
      return TRUE;
    }
    

    This code is obviously bad. No error codes are checked. This is the sort of code you might write when in a hurry, meaning to come back to and improve later. And it's easy to spot that this code needs to be improved big time before it's ready for prime time.

    Here's another version:

    BOOL ComputeChecksum(LPCTSTR pszFile, DWORD* pdwResult)
    {
      BOOL fRc = FALSE;
      HANDLE h = CreateFile(pszFile, GENERIC_READ, FILE_SHARE_READ,
           NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
      if (h != INVALID_HANDLE_VALUE) {
        HANDLE hfm = CreateFileMapping(h, NULL, PAGE_READ, 0, 0, NULL);
        if (hfm) {
          void *pv = MapViewOfFile(hfm, FILE_MAP_READ, 0, 0, 0);
          if (pv) {
            DWORD dwHeaderSum;
            if (CheckSumMappedFile(pvBase, GetFileSize(h, NULL),
                                   &dwHeaderSum, pdwResult)) {
              fRc = TRUE;
            }
            UnmapViewOfFile(pv);
          }
          CloseHandle(hfm);
        }
        CloseHandle(h);
      }
      return fRc;
    }
    

    This code is still wrong, but it clearly looks like it's trying to be right. It is what I call "not-bad".

    Now here's some exception-based code you might write in a hurry:

    NotifyIcon CreateNotifyIcon()
    {
     NotifyIcon icon = new NotifyIcon();
     icon.Text = "Blah blah blah";
     icon.Visible = true;
     icon.Icon = new Icon(GetType(), "cool.ico");
     return icon;
    }
    

    (This is actual code from a real program in an article about taskbar notification icons, with minor changes in a futile attempt to disguise the source.)

    Here's what it might look like after you fix it to be correct in the face of exceptions:

    NotifyIcon CreateNotifyIcon()
    {
     NotifyIcon icon = new NotifyIcon();
     icon.Text = "Blah blah blah";
     icon.Icon = new Icon(GetType(), "cool.ico");
     icon.Visible = true;
     return icon;
    }
    

    Subtle, isn't it.

    It's easy to spot the difference between bad error-code-based code and not-bad error-code-based code: The not-bad error-code-based code checks error codes. The bad error-code-based code never does. Admittedly, it's hard to tell whether the errors were handled correctly, but at least you can tell the difference between bad code and code that isn't bad. (It might not be good, but at least it isn't bad.)

    On the other hand, it is extraordinarily difficult to see the difference between bad exception-based code and not-bad exception-based code.

    Consequently, when I write code that is exception-based, I do not have the luxury of writing bad code first and then making it not-bad later. If I did that, I wouldn't be able to find the bad code again, since it looks almost identical to not-bad code.

    My point isn't that exceptions are bad. My point is that exceptions are too hard and I'm not smart enough to handle them. (And neither, it seems, are book authors, even when they are trying to teach you how to program with exceptions!)

    (Yes, there are programming models like RAII and transactions, but rarely do you see sample code that uses either.)

  • The Old New Thing

    Why did the Win64 team choose the LLP64 model?

    • 110 Comments

    Over on Channel 9, member Beer28 wrote, "I can't imagine there are too many problems with programs that have type widths changed." I got a good chuckle out of that and made a note to write up an entry on the Win64 data model.

    The Win64 team selected the LLP64 data model, in which all integral types remain 32-bit values and only pointers expand to 64-bit values. Why?

    In addition to the reasons give on that web page, another reason is that doing so avoids breaking persistence formats. For example, part of the header data for a bitmap file is defined by the following structure:

    typedef struct tagBITMAPINFOHEADER {
            DWORD      biSize;
            LONG       biWidth;
            LONG       biHeight;
            WORD       biPlanes;
            WORD       biBitCount;
            DWORD      biCompression;
            DWORD      biSizeImage;
            LONG       biXPelsPerMeter;
            LONG       biYPelsPerMeter;
            DWORD      biClrUsed;
            DWORD      biClrImportant;
    } BITMAPINFOHEADER, FAR *LPBITMAPINFOHEADER, *PBITMAPINFOHEADER;
    

    If a LONG expanded from a 32-bit value to a 64-bit value, it would not be possible for a 64-bit program to use this structure to parse a bitmap file.

    There are persistence formats other than files. In addition to the obvious things like RPC and DCOM, registry binary blobs and shared memory blocks can also be used to transfer information between processes. If the source and destination processes are different bitness, any change to the integer sizes would result in a mismatch.

    Notice that in these inter-process communication scenarios, we don't have to worry as much about the effect of a changed pointer size. Nobody in their right mind would transfer a pointer across processes: Separate address spaces mean that the pointer value is useless in any process other than the one that generated it, so why share it?

  • The Old New Thing

    The importance of error code backwards compatibility

    • 83 Comments

    I remember a bug report that came on in an old MS-DOS program (from a company that is still in business so don't ask me to identify them) that attempted to open the file "". That's the file with no name.

    This returned error 2 (file not found). But the program didn't check the error code and though that 2 was the file handle. It then began writing data to handle 2, which ended up going to the screen because handle 2 is the standard error handle, which by default goes to the screen.

    It so happened that this program wanted to print the message to the screen anyway.

    In other words, this program worked completely by accident.

    Due to various changes to the installable file system in Windows 95, the error code for attempting to open the null file changed from 2 (file not found) to 3 (path not found) as a side-effect.

    Watch what happens.

    The program tries to open the file "". Now it gets error 3 back. It mistakenly treats the 3 as a file handle and writes to it.

    What is handle 3?

    The standard MS-DOS file handles are as follows:

    handle name meaning
    0stdinstandard input
    1stdoutstandard output
    2stderrstandard error
    3stdauxstandard auxiliary (serial port)
    4stdprnstandard printer

    What happens when the program writes to handle 3?

    It tries to write to the serial port.

    Most computers don't have anything hooked up to the serial port. The write hangs.

    Result: Dead program.

    The file system folks had to tweak their parameter validation so they returned error 2 in this case.

  • The Old New Thing

    User interface design for vending machines

    • 75 Comments

    How hard can it be to design the user interface of a vending machine?

    You accept money, you have some buttons, the user pushes the button, they get their product and their change.

    At least in the United States, many vending machines arrange their product in rows and columns (close-up view). To select a product, you type the letter of the row and the number of the column. Could it be any simpler?

    Take a closer look at that vending machine design. Do you see the flaw?

    (Ignore the fact that the picture is a mock-up and repeats row C over and over again.)

    The columns are labelled 1 through 10. That means that if you want to buy product C10, you have to push the buttons "C" and "10". But in our modern keyboard-based world, there is no "10" key. Instead, people type "1" followed by "0".

    What happens if you type "C"+"1"+"0"? After you type the "1", product C1 drops. Then you realize that there is no "0" key. And you bought the wrong product.

    This is not a purely theoretical problem. I have seen this happen myself.

    How would you fix this?

    One solution is simply not to put so many items on a single row, considering that people have difficulty making decisions if given too many options. On the other hand, the vendor might not like that design, since their goal is to maximize the number of products.

    Another solution is to change the labels so that there are no items where the number of button presses needed do not match the number of characters in the label. In other words, no buttons with two characters on them (like the "10" button).

    Switch the rows and columns, so that the products are labelled "1A" through "1J" across the top row, and "9A" through "9J" across the bottom. This assumes you don't have more than nine rows. (This won't work for super size vending machines - look at the buttons on that thing; they go up to "U"!

    You can see another solution in that most recent vending machine: Instead of calling the tenth column "10", call it "0". Notice that they also removed rows "I" and "O" to avoid possible confusion with "1" and "0".

    A colleague of mine pointed out that some vending machines use numeric codes for all items rather than a letter and a digit. For example, if the cookies are product number 23, you punch "2" "3". If you want the chewing gum (product code 71), you punch "7" "1". He poses the following question:

    What are some problems with having your products numbered from 1 to 99?

    Answers next time.

  • The Old New Thing

    User interface design for interior door locks

    • 75 Comments

    How hard can it be to design the user interface of an interior door lock?

    Locking or unlocking the door from the inside is typically done with a latch that you turn. Often, the latch handle is in the shape of a bar that turns.

    Now, there are two possible ways you can set up your lock. One is that the a horizontal bar represents the locked position and a vertical bar represents the unlocked position. The other is to have a horizontal bar represent the unlocked position and a vertical bar represent the locked position.

    For some reason, it seems that most lock designers went for the latter interpretation. A horizontal bar means unlocked.

    This is wrong.

    Think about what the bar represents. When the deadbolt is locked, a horizontal bar extends from the door into the door jamb. Clearly, the horizontal bar position should recapitulate the horizontal position of the deadbolt. It also resonates with the old-fashioned way of locking a door by placing a wooden or metal bar horizontally across the face. (Does no one say "bar the door" any more?)

    Car doors even followed this convention, back when car door locks were little knobs that popped up and down. The up position represented the removal of the imaginary deadbolt from the door/jamb interface. Pushing the button down was conceptually the same as sliding the deadbolt into the locked position.

    But now, many car door locks don't use knobs. Instead, they use rocker switches. (Forwards means lock. Or is it backwards? What is the intuition there?) The visual indicator of the door lock is a red dot. But what does it mean? Red clearly means "danger", so is it more dangerous to have a locked door or an unlocked door? I can never remember; I always have to tug on the door handle.

    (Horizontally-mounted power window switches have the same problem. Does pushing the switch forwards raise the window or lower it?)

  • The Old New Thing

    Taskbar notification balloon tips don't penalize you for being away from the keyboard

    • 70 Comments

    The Shell_NotifyIcon function is used to do various things, among them, displaying a balloon tip to the user. As discussed in the documentation for the NOTIFYICONDATA structure, the uTimeout member specifies how long the balloon should be displayed.

    But what if the user is not at the computer when you display your balloon? After 30 seconds, the balloon will time out, and the user will have missed your important message!

    Never fear. The taskbar keeps track of whether the user is using the computer (with the help of the GetLastInputInfo function) and doesn't "run the clock" if it appears that the user isn't there. You will get your 30 seconds of "face time" with the user.

    But what if you want to time out your message even if the user isn't there?

    You actually have the information available to you to solve this puzzle on the web pages I linked above. See if you can put the pieces together and come up with a better solution than simulating a click on the balloon. (Hint: Look carefully at what it means if you set your balloon text to an empty string.)

    And what if you want your message to stay on the screen longer than 30 seconds?

    You can't. The notification area enforces a 30 second limit for any single balloon. Because if they user hasn't done anything about it for 30 seconds, they probably aren't interested. If your message is so critical that the user shouldn't be allowed to ignore it, then don't use a notification balloon. Notification balloons are for non-critical transient messages to the user.

  • The Old New Thing

    How can code that tries to prevent a buffer overflow end up causing one?

    • 65 Comments

    If you read your language specification, you'll find that the ...ncpy functions have extremely strange semantics.

    The strncpy function copies the initial count characters of strSource to strDest and returns strDest. If count is less than or equal to the length of strSource, a null character is not appended automatically to the copied string. If count is greater than the length of strSource, the destination string is padded with null characters up to length count.

    In pictures, here's what happens in various string copying scenarios.

    strncpy(strDest, strSrc, 5)
    strSource
    W e l c o m e \0
    strDest
    W e l c o
    observe no null terminator
     
    strncpy(strDest, strSrc, 5)
    strSource
    H e l l o \0
    strDest
    H e l l o
    observe no null terminator
     
    strncpy(strDest, strSrc, 5)
    strSource
    H i \0
    strDest
    H i \0 \0 \0
    observe null padding to end of strDest

    Why do these functions have such strange behavior?

    Go back to the early days of UNIX. Personally, I only go back as far as System V. In System V, file names could be up to 14 characters long. Anything longer was truncated to 14. And the field for storing the file name was exactly 14 characters. Not 15. The null terminator was implied. This saved one byte.

    Here are some file names and their corresponding directory entries:

    passwd
    p a s s w d \0 \0 \0 \0 \0 \0 \0 \0
    newsgroups.old
    n e w s g r o u p s . o l d
    newsgroups.old.backup
    n e w s g r o u p s . o l d

    Notice that newsgroups.old and newsgroups.old.backup are actually the same file name, due to truncation. The too-long name was silently truncated; no error was raised. This has historically been the source of unintended data loss bugs.

    The strncpy function was used by the file system to store the file name into the directory entry. This explains one part of the odd behavior of strcpy, namely why it does not null-terminate when the destination fills. The null terminator was implied by the end of the array. (It also explains the silent file name truncation behavior.)

    But why null-pad short file names?

    Because that makes scanning for file names faster. If you guarantee that all the "garbage bytes" are null, then you can use memcmp to compare them.

    For compatibility reasons, the C language committee decided to carry forward this quirky behavior of strncpy.

    So what about the title of this entry? How did code that tried to prevent a buffer overflow end up causing one?

    Here's one example. (Sadly I don't read Japanese, so I am operating only from the code.) Observe that it uses _tcsncpy to fill the lpstrFile and lpstrFileTitle, being careful not to overflow the buffers. That's great, but it also leaves off the null terminator if the string is too long. The caller may very well copy the result out of that buffer to a second buffer. But the lstrFile buffer lacks a proper null terminator and therefore exceeds the length the caller specified. Result: Second buffer overflows.

    Here's another example. Observe that the function uses _tcsncpy to copy the result into the output buffer. This author was mindful of the quirky behavior of the strncpy family of functions and manually slapped a null terminator in at the end of the buffer.

    But what if ccTextMax = 0? Then the attempt to force a null terminator dereferences past the beginning of the array and corrupts a random character.

    What's the conclusion of all this? Personally, my conclusion is simply to avoid strncpy and all its friends if you are dealing with null-terminated strings. Despite the "str" in the name, these functions do not produce null-terminated strings. They convert a null-terminated string into a raw character buffer. Using them where a null-terminated string is expected as the second buffer is plain wrong. Not only do you fail to get proper null termination if the source is too long, but if the source is short you get unnecessary null padding.

  • The Old New Thing

    Why doesn't \\ autocomplete to all the computers on the network?

    • 63 Comments

    Wes Haggard wishes that \\ would autocomplete to all the computers on the network. [Link fixed 10am.] An early beta of Windows 95 actually did something similar to this, showing all the computers on the network when you opened the Network Neighborhood folder. And the feature was quickly killed.

    Why?

    Corporations with large networks were having conniptions because needlessly enumerating all the machines on the network can bring a large network to its knees. Think about all the times you type "\\". Now imagine if every single time you did that, Explorer started enumerating all the machines on the network. And imagine how your network administrator would feel if their network traffic saturated with enumerations each time you did that.

    Network administrators made it clear in no uncertain terms that having Windows casually enumerate all the machines on their LAN was totally unacceptable.

    The needs of the corporate environment are very different from those of the home network, and Windows needs to operate in both worlds.

  • The Old New Thing

    How did MS-DOS report error codes?

    • 62 Comments

    The old MS-DOS function calls (ah, int 21h), typically indicated error by returning with carry set and putting the error code in the AX register. These error codes will look awfully familiar today: They are the same error codes that Windows uses. All the small-valued error codes like ERROR_FILE_NOT_FOUND go back to MS-DOS (and possibly even further back).

    Error code numbers are a major compatibility problem, because you cannot easily add new error code numbers without breaking existing programs. For example, it became well-known that "The only errors that can be returned from a failed call to OpenFile are 3 (path not found), 4 (too many open files), and 5 (access denied)." If MS-DOS ever returned an error code not on that list, programs would crash because they used the error number as an index into a function table without doing a range check first. Returning a new error like 32 (sharing violation) meant that the programs would jump to a random address and die.

    More about error number compatibility next time.

    When it became necessary to add new error codes, compatibility demanded that the error codes returned by the functions not change. Therefore, if a new type of error occurred (for example, a sharing violation), one of the previous "well-known" error codes was selected that had the most similar meaning and that was returned as the error code. (For "sharing violation", the best match is probably "access denied".) Programs which were "in the know" could call a new function called "get extended error" which returned one of the newfangled error codes (in this case, 32 for sharing violation).

    The "get extended error" function returned other pieces of information. It gave you an "error class" which gave you a vague idea of what type of problem it is (out of resources? physical media failure? system configuration error?), an "error locus" which told you what type of device caused the problem (floppy? serial? memory?), and what I found to be the most interesting meta-information, the "suggested action". Suggested actions were things like "pause, then retry" (for temporary conditions), "ask user to re-enter input" (for example, file not found), or even "ask user for remedial action" (for example, check that the disk is properly inserted).

    The purpose of these meta-error values is to allow a program to recover when faced with an error code it doesn't understand. You could at least follow the meta-data to have an idea of what type of error it was (error class), where the error occurred (error locus), and what you probably should do in response to it (suggested action).

    Sadly, this type of rich error information was lost when 16-bit programming was abandoned. Now you get an error code or an exception and you'd better know what to do with it. For example, if you call some function and an error comes back, how do you know whether the error was a logic error in your program (using a handle after closing it, say) or was something that is externally-induced (for example, remote server timed out)? You don't.

    This is particularly gruesome for exception-based programming. When you catch an exception, you can't tell by looking at it whether it's something that genuinely should crash the program (due to an internal logic error - a null reference exception, for example) or something that does not betray any error in your program but was caused externally (connection failed, file not found, sharing violation).

  • The Old New Thing

    A rant against flow control macros

    • 53 Comments

    I try not to rant, but it happens sometimes. This time, I'm ranting on purpose: to complain about macro-izing flow control.

    No two people use the same macros, and when you see code that uses them you have to go dig through header files to figure out what they do.

    This is particularly gruesome when you're trying to debug a problem with some code that somebody else wrote. For example, say you see a critical section entered and you want to make sure that all code paths out of the function release the critical section. It would normally be as simple as searching for "return" and "goto" inside the function body, but if the author of the program hid those operations behind macros, you would miss them.

    HRESULT SomeFunction(Block *p)
    {
     HRESULT hr;
     EnterCriticalSection(&g_cs);
     VALIDATE_BLOCK(p);
     MUST_SUCCEED(p->DoSomething());
     if (andSomethingElse) {
      LeaveCriticalSection(&g_cs);
      TRAP_FAILURE(p->DoSomethingElse());
      EnterCriticalSection(&g_cs);
     }
     hr = p->DoSomethingAgain();
    Cleanup:
     LeaveCriticalSection(&g_cs);
     return hr;
    }
    

    [Update: Fixed missing parenthesis in code that was never meant to be compiled anyway. Some people are so picky. - 10:30am]

    Is the critical section leaked? What happens if the BLOCK fails to validate? If DoSomethingElse fails, does DoSomethingAgain get called? What's with that unused "Cleanup" label? Is there a code path that leaves the "hr" variable uninitialized?

    You won't know until you go dig up the header file that defined the VALIDATE_BLOCK, TRAP_FAILURE, and MUST_SUCCEED macros.

    (Yes, the critical section question could be avoided by using a lock object with destructor, but that's not my point. Note also that this function temporarily exits the critical section. Most lock objects don't support that sort of thing, though it isn't usually that hard to add, at the cost of a member variable.)

    When you create a flow-control macro, you're modifying the language. When I fire up an editor on a file whose name ends in ".cpp" I expect that what I see will be C++ and not some strange dialect that strongly resembles C++ except in the places where it doesn't. (For this reason, I'm pleased that C# doesn't support macros.)

    People who still prefer flow-control macros should be sentenced to maintaining the original Bourne shell. Here's a fragment:

    ADDRESS	alloc(nbytes)
        POS	    nbytes;
    {
        REG POS	rbytes = round(nbytes+BYTESPERWORD,BYTESPERWORD);
    
        LOOP    INT	    c=0;
    	REG BLKPTR  p = blokp;
    	REG BLKPTR  q;
    	REP IF !busy(p)
    	    THEN    WHILE !busy(q = p->word) DO p->word = q->word OD
    		IF ADR(q)-ADR(p) >= rbytes
    		THEN	blokp = BLK(ADR(p)+rbytes);
    		    IF q > blokp
    		    THEN    blokp->word = p->word;
    		    FI
    		    p->word=BLK(Rcheat(blokp)|BUSY);
    		    return(ADR(p+1));
    		FI
    	    FI
    	    q = p; p = BLK(Rcheat(p->word)&~BUSY);
    	PER p>q ORF (c++)==0 DONE
    	addblok(rbytes);
        POOL
    }
    

    Back in its day, this code was held up as an example of "death by macros", code that relied so heavily on macros that nobody could understand it. What's scary is that by today's standards, it's quite tame.

    (This rant is a variation on one of my earlier rants, if you think about it. Exceptions are a form of nonlocal control flow.)

Page 1 of 3 (27 items) 123