December, 2003

  • The Old New Thing

    When programs grovel into undocumented structures...

    • 58 Comments

    Three examples off the top of my head of the consequences of grovelling into and relying on undocumented structures.

    Defragmenting things that can't be defragmented
    In Windows 2000, there are several categories of things that cannot be defragmented. Directories, exclusively-opened files, the MFT, the pagefile... That didn't stop a certain software company from doing it anyway in their defragmenting software. They went into kernel mode, reverse-engineered NTFS's data structures, and modified them on the fly. Yee-haw cowboy! And then when the NTFS folks added support for defragmenting the MFT to Windows XP, these programs went in, modified NTFS's data structures (which changed in the meanwhile), and corrupted your disk.

    Of course there was no mention of this illicit behavior in the documentation. So when the background defragmenter corrupted their disks, Microsoft got the blame.

    Parsing the Explorer view data structures
    A certain software company decided that they wanted to alter the behavior of the Explorer window from a shell extension. Since there is no way to do this (a shell extension is not supposed to mess with the view; the view belongs to the user), they decided to do it themselves anyway.

    From the shell extension, they used an undocumented window message to get a pointer to one of the internal Explorer structures. Then they walked the structure until they found something they recognized. Then they knew, "The thing immediately after the thing that I recognize is the thing that I want."

    Well, the thing that they recognize and the thing that they want happened to be base classes of a multiply-derived class. If you have a class with multiple base classes, there is no guarantee from the compiler which order the base classes will be ordered. It so happened that they appeared in the order X,Y,Z in all the versions of Windows this software company tested against.

    Except Windows 2000.

    In Windows 2000, the compiler decided that the order should be X,Z,Y. So now they grovelled in, saw the "X" and said "Aha, the next thing must be a Y" but instead they got a Z. And then they crashed your system some time later.

    So I had to create a "fake X,Y" so when the program went looking for X (so it could grab Y), it found the fake one first.

    This took the good part of a week to figure out.

    Reaching up the stack
    A certain software company decided that it was too hard to take the coordinates of the NM_DBLCLK notification and hit-test it against the treeview to see what was double-clicked. So instead, they take the address of the NMHDR structure passed to the notification, add 60 to it, and dereference a DWORD at that address. If it's zero, they do one thing, and if it's nonzero they do some other thing.

    It so happens that the NMHDR is allocated on the stack, so this program is reaching up into the stack and grabbing the value of some local variable (which happens to be two frames up the stack!) and using it to control their logic.

    For Windows 2000, we upgraded the compiler to a version which did a better job of reordering and re-using local variables, and now the program couldn't find the local variable it wanted and stopped working.

    I got tagged to investigate and fix this. I had to create a special NMHDR structure that "looked like" the stack the program wanted to see and pass that special "fake stack".

    I think this one took me two days to figure out.

    I hope you understand why I tend to go ballistic when people recommend relying on undocumented behavior. These weren't hobbyists in their garage seeing what they could do. These were major companies writing commercial software.

    When you upgrade to the next version of Windows and you experience (a) disk corruption, (b) sporadic Explore crashes, or (c) sporadic loss of functionality in your favorite program, do you blame the program or do you blame Windows?

    If you say, "I blame the program," the first problem is of course figuring out which program. In cases (a) and (b), the offending program isn't obvious.

  • The Old New Thing

    Why not just block the apps that rely on undocumented behavior?

    • 47 Comments
    Because every app that gets blocked is another reason for people not to upgrade to the next version of Windows. Look at all these programs that would have stopped working when you upgraded from Windows 3.0 to Windows 3.1.
    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Compatibility
    

    Actually, this list is only partial. Many times, the compatibility fix is made inside the core component for all programs rather than targetting a specific program, as this list does.

    (The Windows 2000-to-Windows XP list is stored in your C:\WINDOWS\AppPatch directory, in a binary format to permit rapid scanning. Sorry, you won't be able to browse it easily. I think the Application Compatibility Toolkit includes a viewer, but I may be mistaken.)

    Would you have bought Windows XP if you knew that all these programs were incompatible?

    It takes only one incompatible program to sour an upgrade.

    Suppose you're the IT manager of some company. Your company uses Program X for its word processor and you find that Program X is incompatible with Windows XP for whatever reason. Would you upgrade?

    Of course not! Your business would grind to a halt.

    "Why not call Company X and ask them for an upgrade?"

    Sure, you could do that, and the answer might be, "Oh, you're using Version 1.0 of Program X. You need to upgrade to Version 2.0 for $150 per copy." Congratulations, the cost of upgrading to Windows XP just tripled.

    And that's if you're lucky and Company X is still in business.

    I recall a survey taken a few years ago by our Setup/Upgrade team of corporations using Windows. Pretty much every single one has at least one "deal-breaker" program, a program which Windows absolutely must support or they won't upgrade. In a high percentage of the cases, the program in question was developed by their in-house programming staff, and it's written in Visual Basic (sometimes even 16-bit Visual Basic), and the person who wrote it doesn't work there any more. In some cases, they don't even have the source code any more.

    And it's not just corporate customers. This affects consumers too.

    For Windows 95, my application compatibility work focused on games. Games are the most important factor behind consumer technology. The video card that comes with a typical computer has gotten better over time because games demand it. (Outlook certainly doesn't care that your card can do 20 bajillion triangles a second.) And if your game doesn't run on the newest version of Windows, you aren't going to upgrade.

    Anyway, game vendors are very much like those major corporations. I made phone call after phone call to the game vendors trying to help them get their game to run under Windows 95. To a one, they didn't care. A game has a shelf life of a few months, and then it's gone. Why would they bother to issue a patch for their program to run under Windows 95? They already got their money. They're not going to make any more off that game; its three months are over. The vendors would slipstream patches and lose track of how many versions of their program were out there and how many of them had a particular problem. Sometimes they wouldn't even have the source code any more.

    They simply didn't care that their program didn't run on Windows 95. (My favorite was the one that tried to walk me through creating a DOS boot disk.)

    Oh, and that Application Compatibility Toolkit I mentioned above. It's a great tool for developers, too. One of the components is the Verifier: If you run your program under the verifier, it will monitor hundreds of API calls and break into the debugger when you do something wrong. (Like close a handle twice or allocate memory with GlobalAlloc but free it with LocalAlloc.)

    The new application compatibility architecture in Windows XP carries with it one major benefit (from an OS development perspective): See all those DLLs in your C:\WINDOWS\AppPatch directory? That's where many of the the compatibility changes live now. The compatibility workarounds no longer sully the core OS files. (Not all classes of compatibility workarounds can be offloaded to a compatibility DLL, but it's a big help.)
  • The Old New Thing

    What's with those blank taskbar buttons that go away when I click on them?

    • 41 Comments

    Sometimes you'll find a blank taskbar button that goes away when you click on it. What's the deal with that?

    There are some basic rules on which windows go into the taskbar. In short:

    • If the WS_EX_APPWINDOW extended style is set, then it will show (when visible).
    • If the window is a top-level unowned window, then it will show (when visible).
    • Otherwise it doesn't show.

    (Though the ITaskbarList interface muddies this up a bit.)

    When a taskbar-eligible window becomes visible, the taskbar creates a button for it. When a taskbar-eligible window becomes hidden, the taskbar removes the button.

    The blank buttons appear when a window changes between taskbar-eligible and taskbar-ineligible while it is visible. Follow:

    • Window is taskbar-eligible.
    • Window becomes visible ? taskbar button created.
    • Window goes taskbar-ineligible.
    • Window becomes hidden ? since the window is not taskbar-eligible at this point, the taskbar ignores it.

    Result: A taskbar button that hangs around with no window attached to it.

    This is why the documentation also advises, "If you want to dynamically change a window's style to one that doesn't support visible taskbar buttons, you must hide the window first (by calling ShowWindow with SW_HIDE), change the window style, and then show the window."

    Bonus question: Why doesn't the taskbar pay attention to all windows as they come and go?

    Answer: Because that would be expensive. The filtering out of windows that aren't taskbar-eligible happens inside USER32 and it then notifies the taskbar (or anybody else who has installed a WH_SHELL hook) via one of the HSHELL_* notifications only if a taskbar-eligibie window has changed state. That way, the taskbar code doesn't get paged in when there's nothing for it to to.
  • The Old New Thing

    Why are structure sizes checked strictly?

    • 40 Comments

    You may have noticed that Windows as a general rule checks structure sizes strictly. For example, consider the MENUITEMINFO structure:

    typedef struct tagMENUITEMINFO {
      UINT    cbSize; 
      UINT    fMask; 
      UINT    fType; 
      UINT    fState; 
      UINT    wID; 
      HMENU   hSubMenu; 
      HBITMAP hbmpChecked; 
      HBITMAP hbmpUnchecked; 
      ULONG_PTR dwItemData; 
      LPTSTR  dwTypeData; 
      UINT    cch; 
    #if(WINVER >= 0x0500)
      HBITMAP hbmpItem; // available only on Windows 2000 and higher
    #endif
    } MENUITEMINFO, *LPMENUITEMINFO; 
    

    Notice that the size of this structure changes depending on whether WINVER >= 0x0500 (i.e., whether you are targetting Windows 2000 or higher). If you take the Windows 2000 version of this structure and pass it to Windows NT 4, the call will fail since the sizes don't match.

    "But the old version of the operating system should accept any size that is greater than or equal to the size it expects. A larger value means that the structure came from a newer version of the program, and it should just ignore the parts it doesn't understand."

    We tried that. It didn't work.

    Consider the following imaginary sized structure and a function that consumes it. This will be used as the guinea pig for the discussion to follow:

    typedef struct tagIMAGINARY {
      UINT cbSize;
      BOOL fDance;
      BOOL fSing;
    #if IMAGINARY_VERSION >= 2
      // v2 added new features
      IServiceProvider *psp; // where to get more info
    #endif
    } IMAGINARY;
    
    // perform the actions you specify
    STDAPI DoImaginaryThing(const IMAGINARY *pimg);
    
    // query what things are currently happening
    STDAPI GetImaginaryThing(IMAGINARY *pimg);
    

    First, we found lots of programs which simply forgot to initialize the cbSize member altogether.

    IMAGINARY img;
    img.fDance = TRUE;
    img.fSing = FALSE;
    DoImaginaryThing(&img);
    

    So they got stack garbage as their size. The stack garbage happened to be a large number, so it passed the "greater than or equal to the expected cbSize" test and the code worked. Then the next version of the header file expanded the structure, using the cbSize to detect whether the caller is using the old or new style. Now, the stack garbage is still greater than or equal to the new cbSize, so version 2 of DoImaginaryThing says, "Oh cool, this is somebody who wants to provide additional information via the IServiceProvider field." Except of course that it's stack garbage, so calling the IServiceProvider::QueryService method crashes.

    Now consider this related scenario:

    IMAGINARY img;
    GetImaginaryThing(&img);
    

    The next version of the header file expanded the structure, and the stack garbage happened to be a large number, so it passed the "greater than or equal to the expected cbSize" test, so it returned not just the fDance and fSing flags, but also returned an psp. Oops, but the caller was compiled with v1, so its structure doesn't have a psp member. The psp gets written past the end of the structure, corrupting whatever came after it in memory. Ah, so now we have one of those dreaded buffer overflow bugs.

    Even if you were lucky and the memory that came afterwards was safe to corrupt, you still have a bug: By the rules of COM reference counts, when a function returns an interface pointer, it is the caller's responsibility to release the pointer when no longer needed. But the v1 caller doesn't know about this psp member, so it certainly doesn't know that it needs to be psp->Release()d. So now, in addition to memory corruption (as if that wasn't bad enough), you also have a memory leak.

    Wait, I'm not done yet. Now let's see what happens when a program written in the future runs on an older system.

    Suppose somebody is writing their program intending it to be run on v2. They set the cbSize to the larger v2 structure size and set the psp member to a service provider that performs security checks before allowing any singing or dancing to take place. (E.g., makes sure everybody paid the entrance fee.) Now somebody takes this program and runs it on v1. The new v2 structure size passes the "greater than or equal to the v1 structure size" test, so v1 will accept the structure and Do the ImaginaryThing. Except that v1 didn't support the psp field, so your service provider never gets called and your security module is bypassed. Now everybody is coming into your club without paying.

    Now, you might say, "Well those are just buggy programs. They deserve to lose." If you stand by that logic, then prepare to take the heat when you read magazine articles like "Microsoft intentionally designed <Product X> to be incompatible with <software from a major competitor>. Where is the Justice Department when you need them?"
  • The Old New Thing

    Scoble's rant on UI defaults

    • 33 Comments

    Robert Scoble posted an entry in his Longhorn blog on the subject of what the UI defaults should be. It sure has stirred up a lot of controvery. I may pick at the remarks over the upcoming days, but for now I posted responses to two of the comments he kicked up.

    We recently did a survey of users of all abilities. Beginners, intermediates, experts: The number one complaint all of them had about the user interface - 30% of all respondents mentioned this, evenly spread across all categories - was "Too many icons on the desktop." So it's not just beginners. Experts also don't like the clutter. (Yes, I was surprised by the results, too.)
  • The Old New Thing

    Tote Hose in Weilburg

    • 32 Comments

    Wenn Teenies shoppen gehen: Flugzeuge im Warenkorb. This article caught my eye because it opens with a German slang phrase I learned just a few weeks ago: "Tote Hose" which means roughly "Absolutely nothing doing". Here's an English version of the article for those whose German isn't quite up to snuff.

    A question for Germans: Since when did "shoppen" become a German word? What happened to "einkaufen"? Do people prefer the English words because they sound "cooler"?
  • The Old New Thing

    How to stop delivery of telephone books

    • 29 Comments

    Like many of you (I suspect), I don't use the paper telephone book. If I want to look something up, I go online. Yet every year I get a dozen different telephone books. I don't like them because a telephone book sitting on my front porch screams, "Rob this house! Nobody's home!" Besides, it's a waste of paper.

    So for the past few years I've been trying to stop delivery of all the telephone books. It's harder than you think.

    The first hurdle is figuring out what the "please take me off your mailing list" number is. Because they sure don't advertise it.

    I've discovered that the best way to get through to somebody who can take you off the list is to call the "How to order more copies of this wonderful telephone directory."

    Note: WorldPages added another wrinkle to the procedure. You see, they misprinted their own telephone number. Why anybody would voluntarily pay money to be listed by a telephone directory company that can't even get their own phone number right is beyond me.

    You have to be polite but firm when dealing with these people. Qwest is particularly tricky. I had called last year to stop delivery of all three of their phone books (they have three!), but in June another one showed up. I called them and they confirmed, "Yes, I see that we have a zero on your account, I don't know what happened. Let me try again."

    Aside: Why does Qwest want my telephone number to stop delivery of my telephone book? They deliver the book to a house, not to a telephone.

    Anyway, so that seems to keep the telephone book delivery gnomes at bay, until December, when yet another Qwest telephone book arrives at my doorstep. So I call again.

    "Yes, we have you marked as 'do not deliver'."

    "So why did I get one?"

    "This wasn't one of our standard phone books. This was a promotional phone book."

    Aha, so when you say "Do not deliver" it doesn't actually mean, "Do not deliver." It just means "Don't deliver the one that I am specifically complaining about." But there's this double-secret phone book delivery list that you have to specifically ask to be removed from, and we're not going to tell you about it until you ask.

    "Please remove me from the promotional phone book delivery list as well."

    "I'm sorry, I can't do that. There is no way for us to enter that in our computers." Always blame the computers. One response to "Our computer can't do that" I haven't yet resorted to is "Well, then I guess you'll have to do it by hand.")

    "Who delivers the promotional phone books?"

    "We contract that out to a local delivery company."

    "Can I talk to them?"

    "Hang on a second."

    <wait>>

    "Okay, I can fill out a form so you don't receive promotional phone books either." (Aha, so she is going to do it by hand.)

    "Thank you. Good-bye."

    We'll see how long this lasts. I predict May.
  • The Old New Thing

    The unsafe device removal dialog

    • 27 Comments

    In a comment, somebody asked what the deal was with the unsafe device removal dialog in Windows 2000 and why it's gone in Windows XP.

    I wasn't involved with that dialog, but here's what I remember: The device was indeed removed unsafely. If it was a USB storage device, for example, there may have been unflushed I/O buffers. If it were a printer, there may have been an active print job. The USB stack doesn't know for sure (those are concepts at a higher layer that the stack doesn't know about) - all it knows is that it had an active channel with the device and now the device is gone, so it gets upset and yells at you.

    In Windows XP, it still gets upset but it now keeps its mouth shut. You're now on your honor not to rip out your USB drive before waiting two seconds for all I/O to flush, not to unplug your printer while a job is printing, etc. If you do, then your drive gets corrupted / print job is lost / etc. and you're on your own.
  • The Old New Thing

    What's with the catcow and dogoldfish?

    • 26 Comments

    Am I the only one who find these icons bizarro?

    The catcow
    The dogoldfish

    The first time I saw them, I thought the first one was a cow and the second one was a goldfish. But apparently they're a cat and a dog.

    I never understood the need for these emoticons in the first place. If you need to add a smiley face to indicate that you're joking, then you need to work on your delivery.
  • The Old New Thing

    What order do programs in the startup group execute?

    • 22 Comments

    The programs in the startup group run in an unspecified order. Some people think they execute in the order they were created. Then you upgraded from Windows 98 to Windows 2000 and found that didn't work any more. Other people think they execute in alphabetical order. Then you installed a Windows XP multilingual user interface language pack and found that didn't work any more either.

    If you want to control the order that programs in the startup group are run, write a batch file that runs them in the order you want and put a shortcut to the batch file in your startup group.
Page 1 of 5 (45 items) 12345