March, 2004

  • The Old New Thing

    Defrauding the WHQL driver certification process

    • 81 Comments

    In a comment to one of my earlier entries, someone mentioned a driver that bluescreened under normal conditions, but once you enabled the Driver Verifier (to try to catch the driver doing whatever bad thing it was doing), the problem went away. Another commenter bemoaned that WHQL certification didn't seem to improve the quality of the drivers.

    Video drivers will do anything to outdo their competition. Everybody knows that they cheat benchmarks, for example. I remember one driver that ran the DirectX "3D Tunnel" demonstration program extremely fast, demonstrating how totally awesome their video card is. Except that if you renamed TUNNEL.EXE to FUNNEL.EXE, it ran slow again.

    There was another one that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time. Bingo! Their benchmark numbers just quadrupled.

    Anyway, similar shenanigans are not unheard of when submitting a driver to WHQL for certification. Some unscrupulous drivers will detect that they are being run by WHQL and disable various features so they pass certification. Of course, they also run dog slow in the WHQL lab, but that's okay, because WHQL is interested in whether the driver contains any bugs, not whether the driver has the fastest triangle fill rate in the industry.

    The most common cheat I've seen is drivers which check for a secret "Enable Dubious Optimizations" switch in the registry or some other place external to the driver itself. They take the driver and put it in an installer which does not turn the switch on and submit it to WHQL. When WHQL runs the driver through all its tests, the driver is running in "safe but slow" mode and passes certification with flying colors.

    The vendor then takes that driver (now with the WHQL stamp of approval) and puts it inside an installer that enables the secret "Enable Dubious Optimizations" switch. Now the driver sees the switch enabled and performs all sorts of dubious optimizations, none of which were tested by WHQL.

  • The Old New Thing

    The look of Luna

    • 76 Comments

    Luna was the code name for the Windows XP "look". The designers did a lot of research (and got off to a lot of false starts, as you might expect) before they came to the design they ultimately settled upon.

    During the Luna studies, that people's reaction to Luna was often, "Wow this would be a great UI for X," where X was "my dad" or "my employees" or "my daughter". People didn't look at it as the UI for themselves; rather, they thought it was a great UI for somebody else.

    It was sometimes quite amusing to read the feedback. One person would write, "I can see how this UI would work great in a business environment, but it wouldn't work on a home computer." and the very next person would write "I can see how this UI would work great on a home computer, but it wouldn't work in a business environment."

    (And interestingly, even though armchair usability experts claim that the "dumbed-down UI" is a hindrance, our studies showed that people were actually more productive with the so-called "dumb" UI. Armchair usability experts also claim that the Luna look is "too silly for serious business purposes", but in reality it tested very well on the "looks professional" scale.)

    Aero is the code name for the Longhorn "look". With Aero, the designers have turned an important corner. Now, when they show Aero to people, the reaction is, "Wow, this would be a great UI for me to use."

    People want Luna for others, but they want Aero for themselves.

    [Raymond is currently on vacation; this message was pre-recorded.]

  • The Old New Thing

    The car with no user-serviceable parts inside

    • 68 Comments

    For the first time, a team of women is challenged to develop a car, and the car they come up with requires an oil change only every 50,000 kilometers and doesn't even have a hood, so you can't poke around the engine.

    To me, a car has no user-serviceable parts inside. The only times I have opened the hood is when somebody else said, "Hey, let me take a look at the engine of your car." (I have a Toyota Prius.) On my previous car, the only time I opened the hood was to check the oil.

    Sometimes the open-source folks ask, "Would you buy a car whose hood can't be opened?" It looks like that a lot of people (including me) would respond, "Yes."

  • The Old New Thing

    Why 16-bit DOS and Windows are still with us

    • 65 Comments
    Many people are calling for the abandonment of 16-bit DOS and 16-bit Windows compatibility subsystems. And trust me, when it comes time to pull the plug, I'll be fighting to be the one to throw the lever. (How's that for a mixed metaphor.)

    But that time is not yet here.

    You see, folks over in the Setup and Deployment group have gone and visited companies around the world, learned how they use Windows in their businesses, and one thing keeps showing up, as it relates to these compatibility subsystems:

    Companies still rely on them. Heavily.

    Every company has its own collection of Line of Business (LOB) applications. These are programs that the company uses for its day-to-day business, programs the company simply cannot live without. For example, at Microsoft two of our critical LOB applications are our defect tracking system and our source control system.

    And like Microsoft's defect tracking system and source control system, many of the LOB applications at major corporations are not commercial-available software; they are internally-developed software, tailored to the way that company works, and treated as trade secrets. At a financial services company, the trend analysis and prediction software is what makes the company different from all its competitors.

    The LOB application is the deal-breaker. If a Windows upgrade breaks a LOB application, it's game over. No upgrade. No company is going to lose a program that is critical to their business.

    And it happens that a lot of these LOB applications are 16-bit programs. Some are DOS. Some are 16-bit Windows programs written in some ancient version of Visual Basic.

    "Well, tell them to port the programs to Win32."

    Easier said than done.

    • Why would a company go to all the effort of porting a program when the current version still works fine. If it ain't broke, don't fix it.
    • The port would have to be debugged and field-tested in parallel with the existing system. The existing system is probably ten years old. All its quirks are well-understood. It survived that time in 1998 when there was a supply chain breakdown and when production finally got back online, they had to run at triple capacity for a month to catch up. The new system hasn't been stress-tested. Who knows whether it will handle these emergencies as well as the last system.
    • Converting it from a DOS program to a Windows program would incur massive retraining costs for its employees ("I have always used F4 to submit a purchase order. Now I have this toolbar with a bunch of strange pictures, and I have to learn what they all mean." Imagine if somebody took away your current editor and gave you a new one with different keybindings. "But the new one is better.")
    • Often the companies don't have the source code to the programs any more, so they couldn't port it if they wanted to. It may use a third-party VB control from a company that has since gone out of business. It may use a custom piece of hardware that they have only 16-bit drivers for. And even if they did have the source code, the author of the program may no longer work at the company. In the case of a missing driver, there may be nobody at the company qualified to write a 32-bit Windows driver. (I know one company that used foot-pedals to control their software.)

    Perhaps with a big enough carrot, these companies could be convinced to undertake the effort (and risk!) of porting (or in the case of lost source code and/or expertise, rewriting from scratch) their LOB applications.

    But it'll have to be a really big carrot.

    Real example: Just this past weekend I was visiting a friend who lived in a very nice, professionally-managed apartment complex. We had occasion to go to the office, and I caught a glimpse of their computer screen. The operating system was Windows XP. And the program they were running to do their apartment management? It was running in a DOS box.

  • The Old New Thing

    Where do those customized web site icons come from?

    • 63 Comments

    In a comment to yesterday's entry, someone asked about the customized icon that appears in the address bar... sometimes.

    There's actually method to the madness. I was going to write about it later, but the comment (and misinformed answers) prompted me to move it up the schedule a bit. (The originally-scheduled topic for today - the history of Ctrl+Z - will have to wait.)

    Each web site can put a customized icon called favicon.ico into the root of the site, or the page can use a custom LINK tag in the HTML to specify a nondefault location for the favicon, handy if the page author do not have write permission into the root directory of the server.

    In order for the favicon.ico to show up in the address bar, (1) the site needs to offer a customized icon, (2) you have to have added the site to your favorites, and (3) the site icon must still be in your IE cache.

    IE does not go and hit every site you visit for a favicon.ico file; that would put too much strain on the server. (Heck, some people got hopping mad that IE was probing for favicon.ico files at all. Imagine the apoplectic fits people would have had if IE probed for the file at every hit!) Only when you add the site to your favorites does IE go looking for the favicon and stash it in the cache for future use.

  • The Old New Thing

    C++ scoped static initialization is not thread-safe, on purpose!

    • 49 Comments

    The rule for static variables at block scope (as opposed to static variables with global scope) is that they are initialized the first time execution reaches their declaration.

    Find the race condition:

    int ComputeSomething()
    {
      static int cachedResult = ComputeSomethingSlowly();
      return cachedResult;
    }
    

    The intent of this code is to compute something expensive the first time the function is called, and then cache the result to be returned by future calls to the function.

    A variation on this basic technique is is advocated by this web site to avoid the "static initialization order fiasco". (Said fiasco is well-described on that page so I encourage you to read it and understand it.)

    The problem is that this code is not thread-safe. Statics with local scope are internally converted by the compiler into something like this:

    int ComputeSomething()
    {
      static bool cachedResult_computed = false;
      static int cachedResult;
      if (!cachedResult_computed) {
        cachedResult_computed = true;
        cachedResult = ComputeSomethingSlowly();
      }
      return cachedResult;
    }
    

    Now the race condition is easier to see.

    Suppose two threads both call this function for the first time. The first thread gets as far as setting cachedResult_computed = true, and then gets pre-empted. The second thread now sees that cachedResult_computed is true and skips over the body of the "if" branch and returns an uninitialized variable.

    What you see here is not a compiler bug. This behavior is required by the C++ standard.

    You can write variations on this theme to create even worse problems:

    class Something { ... };
    int ComputeSomething()
    {
      static Something s;
      return s.ComputeIt();
    }
    

    This gets rewritten internally as (this time, using pseudo-C++):

    class Something { ... };
    int ComputeSomething()
    {
      static bool s_constructed = false;
      static uninitialized Something s;
      if (!s_constructed) {
        s_constructed = true;
        new(&s) Something; // construct it
        atexit(DestructS);
      }
      return s.ComputeIt();
    }
    // Destruct s at process termination
    void DestructS()
    {
     ComputeSomething::s.~Something();
    }
    

    Notice that there are multiple race conditions here. As before, it's possible for one thread to run ahead of the other thread and use "s" before it has been constructed.

    Even worse, it's possible for the first thread to get pre-empted immediately after testing s_constructed but before setting it to "true". In this case, the object s gets double-constructed and double-destructed.

    That can't be good.

    But wait, that's not all. Not look at what happens if you have two runtime-initialized local statics:

    class Something { ... };
    int ComputeSomething()
    {
      static Something s(0);
      static Something t(1);
      return s.ComputeIt() + t.ComputeIt();
    }
    

    This is converted by the compiler into the following pseudo-C++:

    class Something { ... };
    int ComputeSomething()
    {
      static char constructed = 0;
      static uninitialized Something s;
      if (!(constructed & 1)) {
        constructed |= 1;
        new(&s) Something; // construct it
        atexit(DestructS);
      }
      static uninitialized Something t;
      if (!(constructed & 2)) {
        constructed |= 2;
        new(&t) Something; // construct it
        atexit(DestructT);
      }
      return s.ComputeIt() + t.ComputeIt();
    }
    

    To save space, the compiler placed the two "x_constructed" variables into a bitfield. Now there are multiple non-interlocked read-modify-store operations on the variable "constructed".

    Now consider what happens if one thread attempts to execute "constructed |= 1" at the same time another thread attempts to execute "constructed |= 2".

    On an x86, the statements likely assemble into

      or constructed, 1
    ...
      or constructed, 2
    
    without any "lock" prefixes. On multiprocessor machines, it is possible for the two stores both to read the old value and clobber each other with conflicting values.

    On ia64 and alpha, this clobbering is much more obvious since they do not have a single read-modify-store instruction; the three steps must be explicitly coded:

      ldl t1,0(a0)     ; load
      addl t1,1,t1     ; modify
      stl t1,1,0(a0)   ; store
    

    If the thread gets pre-empted between the load and the store, the value stored may no longer agree with the value being overwritten.

    So now consider the following insane sequence of execution:

    • Thread A tests "constructed" and finds it zero and prepares to set the value to 1, but it gets pre-empted.
    • Thread B enters the same function, sees "constructed" is zero and proceeds to construct both "s" and "t", leaving "constructed" equal to 3.
    • Thread A resumes execution and completes its load-modify-store sequence, setting "constructed" to 1, then constructs "s" (a second time).
    • Thread A then proceeds to construct "t" as well (a second time) setting "constructed" (finally) to 3.

    Now, you might think you can wrap the runtime initialization in a critical section:

    int ComputeSomething()
    {
     EnterCriticalSection(...);
     static int cachedResult = ComputeSomethingSlowly();
     LeaveCriticalSection(...);
     return cachedResult;
    }
    

    Because now you've placed the one-time initialization inside a critical section and made it thread-safe.

    But what if the second call comes from within the same thread? ("We've traced the call; it's coming from inside the thread!") This can happen if ComputeSomethingSlowly() itself calls ComputeSomething(), perhaps indirectly. Since that thread already owns the critical section, the code enter it just fine and you once again end up returning an uninitialized variable.

    Conclusion: When you see runtime initialization of a local static variable, be very concerned.

  • The Old New Thing

    Blow the dust out of the connector

    • 42 Comments
    Okay, I'm about to reveal one of the tricks of Product Support.

    Sometimes you're on the phone with somebody and you suspect that the problem is something as simple as forgetting to plug it in, or that the cable was plugged into the wrong port. This is easy to do with those PS/2 connectors that fit both a keyboard and a mouse plug, or with network cables that can fit both into the upstream and downstream ports on a router.

    Here's the trick: Don't ask "Are you sure it's plugged in correctly?"

    If you do this, they will get all insulted and say indignantly, "Of course it is! Do I look like an idiot?" without actually checking.

    Instead, say "Okay, sometimes the connection gets a little dusty and the connection gets weak. Could you unplug the connector, blow into it to get the dust out, then plug it back in?"

    They will then crawl under the desk, find that they forgot to plug it in (or plugged it into the wrong port), blow out the dust, plug it in, and reply, "Um, yeah, that fixed it, thanks."

    (Or if the problem was that it was plugged into the wrong port, then the act of unplugging it and blowing into the connector takes their eyes off the port. Then when they go to plug it in, they will look carefully and get it right the second time because they're paying attention.)

    Customer saves face, you close a support case, everybody wins.

    Corollary: Instead of asking "Are you sure it's turned on?", ask them to turn it off and back on.

  • The Old New Thing

    On a server, paging = death

    • 40 Comments

    Chris Brumme's latest treatise contained the sentence "Servers must not page". That's because on a server, paging = death.

    I had occasion to meet somebody from another division who told me this little story: They had a server that went into thrashing death every 10 hours, like clockwork, and had to be rebooted. To mask the problem, the server was converted to a cluster, so what really happened was that the machines in the cluster took turns being rebooted. The clients never noticed anything, but the server administrators were really frustrated. ("Hey Clancy, looks like number 2 needs to be rebooted. She's sucking mud.") [Link repaired, 8am.]

    The reason for the server's death? Paging.

    There was a four-bytes-per-request memory leak in one of the programs running on the server. Eventually, all the leakage filled available RAM and the server was forced to page. Paging means slower response, but of course the requests for service kept coming in at the normal rate. So the longer you take to turn a request around, the more requests pile up, and then it takes even longer to turn around the new requests, so even more pile up, and so on. The problem snowballed until the machine just plain keeled over.

    After much searching, the leak was identified and plugged. Now the servers chug along without a hitch.

    (And since the reason for the cluster was to cover for the constant crashes, I suspect they reduced the size of the cluster and saved a lot of money.)

  • The Old New Thing

    Why is the line terminator CR+LF?

    • 40 Comments
    This protocol dates back to the days of teletypewriters. CR stands for "carriage return" - the CR control character returned the print head ("carriage") to column 0 without advancing the paper. LF stands for "linefeed" - the LF control character advanced the paper one line without moving the print head. So if you wanted to return the print head to column zero (ready to print the next line) and advance the paper (so it prints on fresh paper), you need both CR and LF.

    If you go to the various internet protocol documents, such as RFC 0821 (SMTP), RFC 1939 (POP), RFC 2060 (IMAP), or RFC 2616 (HTTP), you'll see that they all specify CR+LF as the line termination sequence. So the the real question is not "Why do CP/M, MS-DOS, and Win32 use CR+LF as the line terminator?" but rather "Why did other people choose to differ from these standards documents and use some other line terminator?"

    Unix adopted plain LF as the line termination sequence. If you look at the stty options, you'll see that the onlcr option specifies whether a LF should be changed into CR+LF. If you get this setting wrong, you get stairstep text, where

    each
        line
            begins
    
    where the previous line left off. So even unix, when left in raw mode, requires CR+LF to terminate lines. The implicit CR before LF is a unix invention, probably as an economy, since it saves one byte per line.

    The unix ancestry of the C language carried this convention into the C language standard, which requires only "\n" (which encodes LF) to terminate lines, putting the burden on the runtime libraries to convert raw file data into logical lines.

    The C language also introduced the term "newline" to express the concept of "generic line terminator". I'm told that the ASCII committee changed the name of character 0x0A to "newline" around 1996, so the confusion level has been raised even higher.

    Here's another discussion of the subject, from a unix perspective.

  • The Old New Thing

    Char.IsDigit() matches more than just "0" through "9"

    • 38 Comments

    Warning: .NET content ahead!

    Yesterday, Brad Abrams noted that Char.IsLetter() matches more than just "A" through "Z".

    What people might not realize is that Char.IsDigit() matches more than just "0" through "9".

    Valid digits are members of the following category in UnicodeCategory: DecimalDigitNumber.

    But what exactly is a DecimalDigitNumber?

    DecimalDigitNumber
    Indicates that the character is a decimal digit; that is, in the range 0 through 9. Signified by the Unicode designation "Nd" (number, decimal digit). The value is 8.

    At this point you have to go to the Unicode Standard Committee to see exactly what qualifies as "Nd", and then you get lost in a twisty maze of specifications and documents, all different.

    So let's run an experiment.

    class Program {
      public static void Main(string[] args) {
        System.Console.WriteLine(
          System.Text.RegularExpressions.Regex.Match(
            "\x0661\x0662\x0663", // "١٢٣"
            "^\\d+$").Success);
        System.Console.WriteLine(
          System.Char.IsDigit('\x0661'));
      }
    }
    

    The characters in the string are Arabic digits, but they are still digits, as evidenced by the program output:

    True
    True
    

    Uh-oh. Do you have this bug in your parameter validation? (More examples..) If you use a pattern like @"^\d$" to validate that you receive only digits, and then later use System.Int32.Parse() to parse it, then I can hand you some Arabic digits and sit back and watch the fireworks. The Arabic digits will pass your validation expression, but when you get around to using it, boom, you throw a System.FormatException and die.

Page 1 of 5 (50 items) 12345