January, 2007

  • The Old New Thing

    What('s) a character!

    • 53 Comments

    Norman Diamond seems to have made a side career of harping on this topic on a fairly regular basis, although he never comes out and says that this is what he's complaining about. He just assumes everybody knows. (This usually leads to confusion, as you can see from the follow-ups.)

    Back in the ANSI days, terminology was simpler. Windows operated on CHARs, which are one byte in size. Buffer sizes were documented as specified in bytes, even for textual information. For example, here's a snippet from the 16-bit documentation for the GetWindowTextLength function:

    The return value specifies the text length, in bytes, not including any null terminating character, if the function is successful. Otherwise, it is zero.

    The use of the term byte throughout permitted the term character to be used for other purposes, and in 16-bit Windows, the term was repurposed to represent "one or bytes which together represent one (what I will call) linguistic character." For single-byte character sets, a linguistic character was the same as a byte, but for multi-byte character sets, a linguistic character could be one or two bytes.

    Documentation for functions that operated on linguistic characters said characters, and functions that operated on CHARs, said bytes, and everybody knew what the story was. (Mind you, even in this nostalgic era, documentation would occasionally mess up and say character when they really meant byte, but the convention was adhered to with some degree of consistentcy.)

    With the introduction of Unicode, things got ugly.

    All documentation that previously used byte to describe the size of textual data had to be changed to read "the size of the buffer in bytes if calling the ANSI version of the function or in WCHARs if calling the Unicode version of the function." A few years ago the Platform SDK team accepted my suggestion to adopt the less cumbersome "the size of the buffer in TCHARs." Newer documentation from the core topics of the Platform SDK tends to use this alternate formulation.

    Unfortunately, most documentation writers (and 99% of software developers, who provide the raw materials for the documentation writers) aren't familiar with the definition of character that was set down back in 1983, and they tend to use the term to mean storage character, which is a term I invented just now to mean "a unit of storage sufficient to hold a single TCHAR." (The Platform SDK uses what I consider to be the fantastically awkward term normal character widths.) For example, the lstrlen function returns the length of the string in storage characters, not linguistic characters. And any function that accepts a sized output buffer obviously specifies the size in storage characters because the alternative is nonsense: How could you pass a buffer and say "Please fill this buffer with data. Its size is five linguistic characters"? You don't know what is going into the buffer, and a linguistic character is variable-sized, so how can you say how many linguistic characters will fit? Michael Kaplan enjoys making rather outrageous strings which result in equally outrageous sort keys. I remember one entry a while ago where he piled over a dozen accent marks over a single "a". That "a" plus the combining diacritics all equal one giant linguistic character. (There is a less extreme example here, wherein he uses an "e" plus two combining diacritics to form one linguistic character.) If you wanted your buffer to really be able to hold five of these extreme linguistic characters, you certainly would need it to be bigger than WCHAR buffer[5].

    As a result, my recommendation to you, dear reader, is to enter every page of documentation with a bias towards storage character whenever you see the word character. Only if the function operates on the textual data linguistically should you even consider the possibility that the author actually meant linguistic character. The only functions I can think of off-hand that operate on linguistic characters are CharNext and CharPrev, and even then they don't quite get it right, although they at least try.

  • The Old New Thing

    How a bullet turns into a beep

    • 20 Comments

    Here's a minor mystery:

    echo •
    

    That last character is U+2022. Select that line with the mouse, right-click, and select Copy to copy it to the clipboard. Now go to a command prompt and paste it and hit Enter.

    You'd expect a • to be printed, but instead you get a beep. What happened?

    Here's another clue. Run this program.

    class Mystery {
     public static void Main() {
      System.Console.WriteLine("\x2022");
     }
    }
    

    Hm, there's that beep again. How about this program:

    #include <stdio.h>
    #include <windows.h>
    
    int __cdecl main(int argc, char **argv)
    {
     char ch;
     if (WideCharToMultiByte(CP_OEMCP, 0, L"\x2022", 1,
                             &ch,  1, NULL, NULL) == 1) {
      printf("%d\n", ch);
     }
     return 0;
    }
    

    Run this program and it prints "7".

    By now you should have figured out what's going on. In the OEM code page, the bullet character is being converted to a beep. But why is that?

    What you're seeing is MB_USEGLYPHCHARS in reverse. Michael Kaplan discussed MB_USEGLYPHCHARS a while ago. It determines whether certain characters should be treated as control characters or as printable characters when converting to Unicode. For example, it controls whether the ASCII bell character 0x07 should be converted to the Unicode bell character U+0007 or to the Unicode bullet U+2022. You need the MB_USEGLYPHCHARS flag to decide which way to go when converting to Unicode, but there is no corresponding ambiguity when converting from Unicode. When converting from Unicode, both U+0007 and U+2022 map to the ASCII bell character.

    "But converting a bullet to 0x07 is clearly wrong. I mean, who expects a printable character to turn into a control character?"

    Well, you're assuming that the code who does the conversion is going to interpret it as a control character. The code might treat it as a glyph character, like this:

    // starting with the scratch program
    
    void
    PaintContent(HWND hwnd, PAINTSTRUCT *pps)
    {
     HFONT hfPrev = SelectFont(pps->hdc, GetStockFont(OEM_FIXED_FONT));
     TextOut(pps->hdc, 0, 0, "\x07", 1);
     SelectFont(pps->hdc, hfPrev);
    }
    

    Run this program and you get a happy bullet in the corner of the window. The TextOut function does not interpret control characters as control characters; it interprets them as glyphs.

    The WideCharToMultiByte function doesn't know what you're going to do with the string it produces. It converts the character and leaves you to decide what to do next. There doesn't appear to be a WC_DONTUSEGLYPHCHARS flag, so you're going to get glyph characters whether you like it or not.

    (Postscript: You can see this happening in reverse from the command prompt. Then again, since this problem is itself a reversal, I guess you could say the behavior is happening in the forward direction now... Type echo ^A where you actually type Ctrl+A where I wrote ^A. The result: A smiling face, U+263A.)

  • The Old New Thing

    Wait, but why can I GetProcAddress for IsDialogMessage?

    • 21 Comments

    Okay, so I explained that a lot of so-called functions are really redirecting macros, function-like macros, intrinsic functions, and inline functions, and consequently, GetProcAddress won't actually get anything since the function doesn't exist in the form of an exported function. But why, then, can you GetProcAddress for IsDialogMessage?

    Let's take a closer look at the exports from user32.dll. Here's the relevant excerpt.

            417  1A0 0002C661 IsDialogMessage
            418  1A1 0002C661 IsDialogMessageA
            419  1A2 0001DFBC IsDialogMessageW
    

    Notice that this function is exported three ways. The last two are the ones you expect, IsDialogMessageA for ANSI callers and IsDialogMessageW for UNICODE callers. That first one is the one you didn't expect: IsDialogMessage with no A or W suffix. But notice that its entry point address is identical to that of IsDialogMessageA. The IsDialogMessage entry point is just an alias for IsDialogMessageA.

    This phantom third function is hidden from C and C++ programs because any attempt to call IsDialogMessage gets converted to IsDialogMessageA or IsDialogMessageW due to the redirection macro:

    #ifdef UNICODE
    #define IsDialogMessage  IsDialogMessageW
    #else
    #define IsDialogMessage  IsDialogMessageA
    #endif // !UNICODE
    

    (Of course, you can play fancy games to remove the redirection macros; I'm just talking about the non-fancy case.) If nobody can call the function, then why does it exist?

    Because of mistakes made long ago.

    If you hunt through user32.dll you'll find a few other functions that follow a similar pattern of having three versions, an A version, a W version, and a phantom undecorated version (which is an alias for the A version). At one point long ago, the function existed only in an undecorated version. This turned out to have been a mistake, since there was a character set dependency in the parameters (perhaps obvious, perhaps subtle). The mistake was corrected by splitting the function into the A and W versions you see today, but in order to maintain compatibility with older programs that were written before the mistake was recognized, the original undecorated function was left in the export table.

    When you don't have a time machine, you have to live with your mistakes.

    In a sense, these functions are vestigial organs of Win32.

    Postscript: Unfortunately, like your appendix, which can get infected, these vestigial organs can create a different sort of infection: If you are using p/invoke to call these functions and mistakenly override the default name declaration with ExactSpelling=true, like so:

    [DllImport("user32.dll", ExactSpelling=true)]
    public static extern
    bool IsDialogMessage(IntPtr hWndDlg,
                         [In] ref MSG msg);    
    
    

    then you will in fact get the normally-inaccessible undecorated name, since you specified that you wanted the exact spelling. This highlights once again that you need to be alert when doing interop programming: You get what you ask for, which might not be what you actually wanted.

  • The Old New Thing

    Why can't I GetProcAddress for CreateWindow?

    • 18 Comments

    Occasionally, I'll see people having trouble trying to GetProcAddress for functions like CreateWindow or ExitWindows. Usually, it's coming from people who are trying to write p/invoke signatures, for p/invoke does a GetProcAddress under the covers. Why can't you GetProcAddress for these functions?

    Because they're not really functions. They're function-like macros:

    #define CreateWindowA(lpClassName, lpWindowName, dwStyle, x, y,\
    nWidth, nHeight, hWndParent, hMenu, hInstance, lpParam)\
    CreateWindowExA(0L, lpClassName, lpWindowName, dwStyle, x, y,\
    nWidth, nHeight, hWndParent, hMenu, hInstance, lpParam)
    #define CreateWindowW(lpClassName, lpWindowName, dwStyle, x, y,\
    nWidth, nHeight, hWndParent, hMenu, hInstance, lpParam)\
    CreateWindowExW(0L, lpClassName, lpWindowName, dwStyle, x, y,\
    nWidth, nHeight, hWndParent, hMenu, hInstance, lpParam)
    #ifdef UNICODE
    #define CreateWindow  CreateWindowW
    #else
    #define CreateWindow  CreateWindowA
    #endif // !UNICODE
    
    #define ExitWindows(dwReserved, Code) ExitWindowsEx(EWX_LOGOFF, 0xFFFFFFFF)
    

    In fact, as you can see above CreateWindow is doubly a macro. First, it's a redirecting macro that expands to either CreateWindowA or CreateWindowW, depending on whether or not you are compiling UNICODE. Those are in turn function-like macros that call the real function CreateWindowExA or CreateWindowExW. All this is handled by the compiler if you include the winuser.h header file, but if for some reason you want to GetProcAddress for a function-like macro like CreateWindow, you'll have to manually expand the macro to see what the real function is and pass that function name to GetProcAddress.

    Similar remarks apply to inline functions. These functions can't be obtained via GetProcAddress because they aren't exported at all; they are provided to you as source code in the header file.

    Note that whether something is a true function or a function-like macro (or an inline function) can depend on your target platform. For example, GetWindowLongPtrA is a true exported function on 64-bit Windows, but on 32-bit Windows, it's just a macro that resolves to GetWindowLongA. As another example, the Interlocked family of functions are exported functions on the x86 version of Windows but are inlined functions on all other Windows architectures.

    How can you figure all this out? Read the header files. That'll show you whether the function you want is a redirecting macro, a function-like macro, an inline function, an intrinsic function, or a proper exported function. If you can't figure it out from the header files, you can always just write a program that calls the function you're interested in and then look at the disassembly to see what actually got generated.

  • The Old New Thing

    The family technical support department: Everything is Outlook

    • 63 Comments

    We're all in the same position. Since we work with computers all day, everybody in the extended family considers us the technical support department. One thing you all need to take away from your role as family technical support department is that normal people view computers completely differently from the way you and I do.

    One of my relatives calls every program Outlook.

    "I'm on the Internet checking the weather report and then Outlook keeps displaying these windows with advertisements in them."

    "I'm having trouble listening to music on Outlook."

    "How do I get Outlook to play that card game you showed me last time?"

    "I tried to save my spreadsheet and Outlook gave me this weird error message."

    Why is every program called Outlook?

    At work, this particular relative received word that the computer systems were being upgraded. The old system was a dedicated CAD system, but the new computers were PCs running CAD software and Outlook.

    "Okay, I know what CAD software is. That's what I've been doing for the past five years on the old system. Therefore, by process of elimination, everything else on the computer must be Outlook."

    When they got a home computer a year later, it didn't come with any CAD software on it. It was all Outlook.

    (One of my colleagues is in a similar position: His relatives call everything on the computer Microsoft X. It could be Microsoft Norton Utilities or Microsoft Quicken. I wouldn't be surprised if they even said Microsoft Google.)

    My colleague KC Lemson is in the unfortunate position of being the "Outlook expert" in her family, despite having not worked on Outlook for six years. It turns out that there have been a lot of versions of Outlook released since then, so her specialized knowledge is pretty badly outdated. That doesn't stop them from trying, though.

    She told me that she attended a family wedding some time ago, and heard from three separate people, "Oh, Alice [not her real name] has an Outlook question for you." The effect of this was perhaps not what those people expected, because KC spent the entire wedding trying to avoid Alice. KC explained, "If she'd just come up and asked the question herself, I probably would have been fine with it, but having such an early warning just scared me. Plus, sniff sniff, you want to be wanted for who you are and not what you know."

Page 4 of 4 (35 items) 1234