Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Larry and the "Ping of Death"

    • 34 Comments

    Also known as "Larry mounts a DDOS attack against every single machine running Windows NT"

    Or: No stupid mistake goes unremembered.

     

    I was recently in the office of a very senior person at Microsoft debugging a problem on his machine.  He introduced himself, and commented "We've never met, but I've heard of you.  Something about a ping of death?"

    Oh. My. Word.  People still remember the "ping of death"?  Wow.  I thought I was long past the ping of death (after all, it's been 15 years), but apparently not.  I'm not surprised when people who were involved in the PoD incident remember it (it was pretty spectacular), but to have a very senior person who wasn't even working at the company at the time remember it is not a good thing :).

    So, for the record, here's the story of Larry and the Ping of Death.

    First I need to describe my development environment at the time (actually, it's pretty much the same as my dev environment today).  I had my primary development machine running a version of NT, it was running a kernel debugger connected to my test machine over a serial cable.  When my test machine crashed, I would use the kernel debugger on my dev machine to debug it.  There was nothing debugging my dev machine, because NT was pretty darned reliable at that point and I didn't need a kernel debugger 99% of the time.  In addition, the corporate network wasn't a switched network - as a result, each machine received datagram traffic from every other machine on the network.

     

    Back in that day, I was working on the NT 3.1 browser (I've written about the browser here and here before).  As I was working on some diagnostic tools for the browser, I wrote a tool to manually generate some of the packets used by the browser service.

    One day, as I was adding some functionality to the tool, my dev machine crashed, and my test machine locked up.

    *CRUD*.  I can't debug the problem to see what happened because I lost my kernel debugger.  Ok, I'll reboot my machines, and hopefully whatever happened will hit again.

    The failure didn't hit, so I went back to working on the tool.

    And once again, my machine crashed.

    At this point, everyone in the offices around me started to get noisy - there was a great deal of cursing going on.  What I'd not realized was that every machine had crashed at the same time as my dev machine had crashed.  And I do mean EVERY machine.  Every single machine in the corporation running Windows NT had crashed.  Twice (after allowing just enough time between crashes to allow people to start getting back to work).

     

    I quickly realized that my test application was the cause of the crash, and I isolated my machines from the network and started digging in.  I quickly root caused the problem - the broadcast that was sent by my test application was malformed and it exposed a bug in the bowser.sys driver.  When the bowser received this packet, it crashed.

    I quickly fixed the problem on my machine and added the change to the checkin queue so that it would be in the next day's build.

     

    I then walked around the entire building and personally apologized to every single person on the NT team for causing them to lose hours of work.  And 15 years later, I'm still apologizing for that one moment of utter stupidity.

  • Larry Osterman's WebLog

    News Flash: Spaces are legal characters in both filenames and passwords!

    • 33 Comments

    I recently figured out a problem that I've been having with one of our internal tools.  The tool is used to automatically deploy our daily builds (extremely handy when you're doing that every other day to several test machines).  As a part of the tool, you need to include the password for a test account.

    We normally use the tool from an automatic test harness, essentially I enter the 4 or 5 parameters to the test and it automatically runs the tool (and other stuff if necessary).

    The problem I had was that I would enter my account and password but the tool kept failing after reporting invalid parameter errors.  It worked perfectly when I used a different account that is used by our testers, but when I tried using my preferred test account it kept on failing with some kind of command line parsing error.

    Eventually I tracked down the actual command line being passed by the harness into the tool and I was immediately able to see the problem.

     

    Being a security geek, my "password" is actually a passphrase - the theory is that passphrases are harder to crack than passwords because they are drawn from a larger dictionary.  So my passwords tend to be things like "The rain in Spain falls mainly on the plain".

    In this case, the test harness took my password and passed it to the tool as follows (assuming that the command line for the test tool is "testtool.exe -useuser <username> <password>:

    testtool.exe -useuser testaccount The rain in Spain falls mainly on the plain

    Doh!  Either the test tool or the test harness wasn't handling the spaces correctly.  I tried an experiment and ran the test tool manually:

    testtool.exe -useuser testaccount "The rain in Spain falls mainly on the plain"

    and it worked!  So it appears that the problem was that the test harness wasn't correctly handling the spaces in my password.

     

    So I went to the maintainer of the test harness and described the problem to him.

    His response?  "I never knew you could have spaces in a password!  Wow, I didn't even think of that."

     

    Sigh.

    On Microsoft operating systems, spaces have been legal in filenames since MS-DOS 2.0 (about 1982) and in passwords since MS-NET 1.0 (about 1984).  I'm astonished that 25 years later there are people who still don't know that.

  • Larry Osterman's WebLog

    Why does the NT redirector close file handles when the network connection breaks?

    • 33 Comments

    Yesterday, Raymond posted an article about power suspend and it's behavior in XP and Vista.  It got almost as many comments as a post in the IE blog does :).

    I want to write about one of the comments made to the article (by "Not Amused Again"):

    James Schend: Nice one. Do you know why the damn programs have to ask before allowing things to go into suspend mode? It's because MS's freakin networking code is freakin *broken*. Open files are supposed to be reconnected after a suspend, and *they are not*, leading to losses in any open files. (Not that saving the files then allowing the suspend to continue works either, as the wonderful opportunistic file locking junk seems to predictably barf after suspends.)

     

    A long time ago, in a building far, far away, I worked on the first version of the NT network filesystem (a different version was released with Windows 2000).  So I know a fair amount about this particular issue.

    The answer to the Not Amused Again's complaint is: "Because the alternative is worse".

    Unlike some other network architectures (think NFS), CIFS attempts to provide a reliable model for client/server networking.  On a CIFS network, the behavior of network files is as close to the behavior of local files as possible.

    That is a good thing, because it means that an application doesn't have to realize that files are opened over the network.  All the filesystem primitives that work locally also work over the network transparently.  That means that the local file sharing and locking rules are applied to files on network.

    The problem is that networks are inherently unreliable.  When someone trips over the connector to the key router between your client and the server, the connection between the two is going to be lost.  The client can reconnect the connection to the network share, but what should be done about the files opened over the network?

    There are a couple of criteria that any solution to this problem must have:

    First off, the server is OBLIGATED to close the file when the connection with the client is disconnected.  It has no ability to keep the file open for the client.  So any strategy that involves the server keeping the client's state around is a non-starter (otherwise you have a DoS scenario associated with the client). Any recovery strategy has to be done entirely on the client. 

    Secondly, it is utterly unacceptable to introduce the possibility of data corruption.  If there is a scenario where reopening the file can result in a data corruption scenario, then  that scenario can't be allowed.

    So let's see if we can figure out the rules for re-opening the file:

    First off, what happens if you can't reopen the file?   Maybe you had the file opened in exclusive mode and once the connection was disconnected, someone else got in and opened it exclusively.  How are you going to tell the client that the file open failed?  What happens if someone deleted the file on the share once it was closed?  You can't return file not found, since the file was already opened.

    The thing is, it turns out that failing to re-open the file is actually the BEST option you have.  The others are actually even worse than that scenario.

     

    Let's say that you succeed in re-opening the file.  Let's consider some other scenarios:

    What happens if you had locks on the file?  Obviously you need to re-apply the locks, that's a no-brainer.  But what happens if they can't be applied?  The other thing to consider about locks is that a client that has a lock open on a region of the file assumes that no other client can write to that region of the file (remember: network files look just like local files).  So they assume that nobody else has changed that region.  But what happens if someone else does change that region?  Now you just introduced a data corruption error by re-opening the file.

    This scenario is NOT far-fetched.  It's actually the usage pattern used by most file based database applications (R:Base, D-Base, Microsoft Access, etc).  Modern client/server databases just keep their files open all the time, but non client/server database apps let multiple clients open a single database file and use record locking to ensure that the database integrity is preserved (the files lock a region of the file, alter it, then unlock it).  Since the server closed the file when the connection was lost, other applications could have come in, locked a region of the file, modified it, then unlocked it.  But YOUR client doesn't know this happened.  It thinks it still has the lock on the region of the file, so it owns the contents of that region.

    Ok, so you decide that if the client has a lock on the file, we won't allow them to re-open the file.  Not that huge a restriction, but it means we won't re-open database files over the network.  You just pissed off a bunch of customers who wanted to put their shared database on the server.

     

    Next, what happens if the client had the file opened exclusively?  That means that they know that nobody else in the world has the file open, so they can assume that the file's not been modified by anyone else.  That means that the client can't re-open the file if it's opened in exclusive mode.

    Next let's consider the case where the file's not opened exclusively: There are four cases of interest, involving two file attributes and two file open modes: FILE_SHARE_READ and FILE_SHARE_WRITE  (FILE_SHARE_DELETE isn't very interesting), and FILE_READ_DATA and FILE_WRITE_DATA.

    There are four interesting combinations (the cases with more than one write collapse the file_share_write case), laid out in the table below.

      FILE_SHARE_READ FILE_SHARE_WRITE
    FILE_READ_DATA This is effectively the same as exclusive mode - nobody else can write to the file, and the client is only reading the file, thus it may cache the contents of the file The client is only reading data, and it isn't caching the data being read (because others can write to the file).
    FILE_WRITE_DATA This client can write to the file and nobody else can write to it, thus it can cache the contents of the file. The client is only writing data, and it can't be caching (because others can write to the file)

    For FILE_SHARE_READ, others can read the file, but nobody else can write to the file, the client can and will cache the contents of the file, .  For FILE_SHARE_WRITE, no assumptions can be made by the client, so the client can have no information cached about the file.

    So this means that the ONLY circumstance in which it's reliable to re-open the file is when a file has never had any locks taken on it and when it has been opened for FILE_SHARE_WRITE mode.

     

    So the number of scenarios where it's safe to re-open the file is pretty slim. we spent a long time discussing this back in the NT 3.1 days and eventually decided that it wasn't worth the effort to fix this.

    Since we can't re-open the files, the only option is to close the file.

    As a point of information, Lan Manager 2.0 redirector for OS/2  did have such a feature, but we decided that we shouldn't implement it for NT 3.1. The main reason for this was the majority of files opened in OS/2 were open for share_write access (it was the default), but for NT, the default is to open files in exclusive mode, so the majority of files can't be reopened.

     

  • Larry Osterman's WebLog

    Annoying coding tricks

    • 33 Comments
    I'm sure I've linked to The Daily WTF, on it Alex Papadimoulis collects egregious programming mistakes and distributes them one per day.

    This one isn't really that hideous, but  I ran into this construct the other day while working on some stuff and it just flat-out annoys me :) (the code's been heavily sanitized to protect the innocent)

        static BOOL fWasntDoingSomething;
        BOOL fDontDoSomething;

        fDontDoSomething = DecideIfWeShouldntDoSomething();

        fOldValue = InterlockedExchange( &fWasntDoingSomething, fDontDoSomething);
        if ( fOldValue ^ fWasntDoingSomething)
        {
            :
            :
        }
     

    I almost don't even know where to begin on this one.  It's three lines of code (and 2 lines of variable declarations), chock full of badness.

    But the thing that really got my goat (and the thing that caused me to write this post) was the use of XOR when != would just as well.  By using XOR, the author of the code guaranteed that whoever was looking at the code would have to sit and think about what the code was doing - for some reasons, the logic table for XOR isn't sitting at the front of my short-term memory.

    And then there's the variable names.  I don't know about y'all, but I just HATE trying to wrap my head around negative Boolean variables, especially when they're used as double negatives (!fDontDoSomething).  They always make me need to think twice when I see them. 

    Wouldn't it have been SO much better if the code had been:

        static BOOL fWasDoingSomething;
        BOOL fDoSomething, fOldValue;

        fDoSomething = DecideIfWeShouldDoSomething();

        fOldValue = InterlockedExchange( &fWasDoingSomething, fDoSomething);
        if ( fOldValue != fWasDoingSomething)
        {
            :
            :
        }

    ?

    Ah, I feel MUCH better now :)  Venting always helps :)

     

  • Larry Osterman's WebLog

    More OOBE experiences - D&D Online

    • 33 Comments

    Back in January, I wrote about the OOBE of my iRiver H10 player, and I've got another horrid first run story today.

    Daniel's been pestering us to get DnD Online, and yesterday it arrived.  I figured I'd install it for him (to save him the trouble).

     

    Man, talk about hideous first run experiences.  First off, the CD installed SLOWLY. Now the machine I'm running this on isn't the fastest on the planet, but I can rip CDs in way less time than the game installed.  My guess is that they were decompressing data on the fly or something.  But slow installations aren't a huge issue, I know how hard it can be to copy tons of data onto a machine.

    My biggest complaints came when I launched the application.

    It popped up a pretty splash screen, and some status text flashed on the screen about checking for web sites, etc.  Then it hung.  I waited for about 5 minutes and no progress, it just hung.  What was worse is that the app didn't show up in the task manager list so I had to find dndlauncher.exe in taskmgr and kill it manually.

    So I restarted.  This time it started and made it through the initial UI, and presented a new loader screen.  The loader started downloading two versions of the client executable.  That was wierd, the game's only been online for 7 days and there are already 2 new versions of the client available?  No big deal.  One thing I noticed was that the download was SLOW - 10KB per second according to the progress meter.  Looking at the network traffic in taskmgr, it wasn't receiving any data, the client was just slow.

    And then it hung downloading the client executable.  This time it DID have an entry in the taskbar, but I couldn't right click on it to stop it, I had to go back to the task manager to kill it.

    Third try, this time it got through downloading the client programs, and it started patching game data.  There were 50(!) patches available for the game.  Again, this is a game that's been online for all of 7 days, and there were ALREADY 50 patches for game data?  And once again, the launcher hung downloading the patches.  And I'm still getting 10KB/second download speeds.

    Fourth try (I'm getting pretty annoyed at this point), and it starts downloading more patches.  This time, the patches came in quickly - 75KB/second.  My guess is that their load balancing solution on their patch servers doesn't work, and some of the patch machines were overloaded.

    And again the game hung after downloading all the patches.

    The game also installs a notification area icon, this time I clicked on it.  A menu flashed on the screen really quickly, and then disappeared.  So back to taskmgr to kill the launcher app.

    On the 5th time, I was finally allowed to log in and start the game, but still....  4 hangs of the client app that required taskmgr intervention to recover?  10KB/sec download speeds?

     

    And then there's the notification area icon.  By default, the game installs itself into the notification area, and it's set to download game patches every 4 hours.

    Every 4 hours?  They patch this game frequently enough that you need to check for patches EVERY FOUR HOURS?!!

    Mindboggling.

     

    I've not played the game beyond racing through the character creation mechanism, this is Daniel's game to play, I have absolutely no opinions about the relative quality of the game (although it seemed to be very pretty for the 2 minutes I played it)

    I know this is a major new game in its first week or so of retail release, so it's expected that things may be overloaded - there were 10 new characters in the entry area when I logged in, so the game servers are clearly being hammered, but still...

     

  • Larry Osterman's WebLog

    Error Code Paradigms

    • 33 Comments

    At some point when I was reading the comments on the "Exceptions as repackaged error codes" post, I had an epiphany (it's reflected in the comments to that thread but I wanted to give it more visibility).

    I'm sure it's just an indication of just how slow my mind is working these days, but I just realized that in all the "error code" vs. "exception" discussions that seem to go on interminably, there are two UNRELATED issues being discussed.

    The first is about error semantics - what information do you hand to the caller about what failed.  The second is about error propogation - how do you report the failure to the caller.

    It's critical for any discussion about error handling to keep these two issues separate, because it's really easy to commingle them.  And when you commingle them, you get confusion.

    Consider the following example classes (cribbed in part from the previous post):

    class Win32WrapperException
    {
        // Returns a handle to the open file.  If an error occurs, it throws an object derived from
        // System.Exception that describes the failure.
        HANDLE OpenException(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError());
            }

        };
        // Returns a handle to the open file.  If an error occurs, it throws the Win32 error code that describes the failure.
        HANDLE OpenError(LPCWSTR FileName)
        {
            HANDLE fileHandle;
            fileHandle = CreateFile(FileName, xxxx);
            if (fileHandle == INVALID_HANDLE_VALUE)
            {
                throw (GetLastError());
            }

        };
    };

    class Win32WrapperError
    {
        // Returns either NULL if the file was successfully opened or an object derived from System.Exception on failure.
        System.Exception OpenException(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (*FileHandle == INVALID_HANDLE_VALUE)
            {
                return new System.Exception(String.Format("Error opening {0}: {1}", FileName, GetLastError()));
            }
            else
            {
                return NULL;
            }

        };
        // Returns either NO_ERROR if the file was successfully opened or a Win32 error code describing the failure.
        DWORD OpenError(LPCWSTR FileName, OUT HANDLE *FileHandle)
        {
            *FileHandle = CreateFile(FileName, xxxx);
            if (&FileHandle == INVALID_HANDLE_VALUE)
            {
                return GetLastError();
            }
            else
            {
                return NO_ERROR;
            }
        };
    };

    I fleshed out the example from yesterday and broke it into two classes to more clearly show what I'm talking about.  I have two classes that perform the same operation.  Win32WrapperException is an example of a class that solves the "How do I report a failure to the caller" problem by throwing exceptions.  Win32WrapperError is an example that solves the "How do I report a failure to the caller" problem by returning an error code.

    Within each class are two different methods, each of which solves the "What information do I return to the caller" problem - one returns a simple numeric error code, the other returns a structure that describes the error.  I used System.Exception as the error structure, but it could have just as easily been an IErrorInfo class, or any one of a bazillion other ways of reporting errors to callers.

    But looking at these examples, it's not clear which is better.  If you believe that reporting errors by exceptions is better than reporting by error codes, is Win32WrapperException::OpenError better than Win32WrapperError::OpenException?  Why? 

    If you believe that reporting  errors by error codes is better, then is CWin32WrapperError::OpenError better than CWin32WrapperError::OpenException?  Why?

    When you look at the problem in this light (as two unrelated problems), it allows you to look at the "exceptions vs. error codes" debate in a rather different light.  Many (most?) of the arguments that I've read in favor of exceptions as an error propagation mechanism  concentrate on the additional information that the exception carries along with it.  But those arguments ignore the fact that it's totally feasible (and in fact reasonable) to define an error code based system that provides the caller with exactly the same level of information that is provided by exception.

    These two problems are equally important when dealing with errors.  The mechanism for error propagation has critical ramifications for all aspects of engineering - choosing one form of error propagation over another can literally alter the fundamental design of a system.

    And the error semantic mechanism provides critical information for diagnosability - both for developers and for customers.  Everyone HATES seeing a message box with nothing but "Access Denied" and no additional context.

     

    And yes, before people complain, I recognize that none of the common error code returning APIs today provide the same quality of error semantics that System.Exception does as first class information - the error return information is normally hidden in a relatively unsophisticated scalar value.  I'm just saying that if you're going to enter into a discussion of error codes vs. exceptions, from a philosophical point of view, then you need to recognize that there are two related problems that are being discussed, and differentiate between these two. 

    In other words, are you advocating exceptions over error codes because you like how they solve the "what information do I return to the caller?" problem, or are you advocating them because you like how they solve the "how do I report errors?" problem?

    Similarly, are you denigrating exceptions because you don't like their solution to the "how do I report errors?" problem and ignoring the "what information do I return to the caller?" problem?

    Just some food for thought.

  • Larry Osterman's WebLog

    What does style look like, part 4

    • 33 Comments
    Continuing the discussion of "style"..

    Yesterday, I showed the code reformatted into "BSD" style format, which looks like:

    #include "list.h"
    main(C cArg, SZ rgszArg[])
    {
        I iNode;
        I cNodes = atoi(rgszArg[1]);
        I cNodesToSkip = atoi(rgszArg[2]);
        PNODE pnodeT;
        PNODE pnodeCur;
        InitNodes(cNodes);
        for (iNode = 2, pnodeCur = PnodeNew(1); iNode <= cNodes ; iNode++)
        {
            pnodeT = PnodeNew(iNode);
            InsertNext(pnodeCur, pnodeT);
            pnodeCur = pnodeT;
        }
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);
            }
            FreeNode(PnodeDeleteNext(pnodeCur));
        }
        printf("%d\n", Item(nodeCur));
    }

    I chose one of the variants that I personally find most attractive, any one of them could work. The thing about this example is that there are no comments.  That's reasonable given that it's an example from a textbook, and as such the words in the textbook that accompany the example suffice as the documentation.  But all code begs to be documented, so lets see what happens.

    When you look at documenting a function, there are a couple of things to keep in mind when considering style (there are other important things, like the relevance of the comments, etc, but this post's about style).

    The first question that comes to mind, almost immediately is "What commenting form should you use?"  For C/C++, there are a number of options.

    First off, C and C++ support two different styles of comment.  The first is the traditional  /* */ style of comments.  And there's the C++ style // comment.  Each of the two comments have different flavors.

    When I was in college, I loved using some of the options that /* */ comments enabled.  My code was just full of things like:

             :
             :
        }

        /*************************************************************/
        /*                                                           */
        /*                                                           */
        /*   Loop through the nodes looking for the current node.    */
        /*                                                           */
        /*                                                           */
        /*************************************************************/
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
             :
             :
     

    I also had this thing about putting a comment on every line, especially for assignments.  And every comment lined up on column 40 if possible.

             :
             :
        }

        /*************************************************************/
        /*                                                           */
        /*                                                           */
        /*   Loop through the nodes looking for the current node.    */
        /*                                                           */
        /*                                                           */
        /*************************************************************/
        while (pnodeCur != PnodeNext(x))    // Keep looping until PnodeNext == pnodeCur
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++) // Skip through cNodesToSkip nodes
             :                             // A comment for this line.
             :                             // A comment for that line.
     

    When I look back at my old code, the code screams "1980's", just like the hairdos of "A Flock of Seagulls" does.  But there are times when big honking block comments like that are appropriate, like when you're pointing out something that's REALLY important - like when you're about to do something that would cause Raymond Chen to swear like a sailor when he runs into your application.

    Also, while I'm on the subject of comments on every line.  There ARE situations where you want a comment on every line.  For example, if you're ever forced to write assembly language (for any assembly language), you really MUST comment every line.  It's far too easy to forget which registers are supposed to contain what contents, so adding comments on every line is essential to understanding the code.

    /* */ comments have the disadvantage that they take up a fair amount of space.  This means that they increase the effective line length.  // comments don't have this problem.  On the other hand, /* */ comments stand out more.

    Personally I feel that both comment styles should be used.  One fairly effective form that I've seen is:

             :
             :
        }
        /*
         * Loop through the nodes until you hit the current node.
         */
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);  // Skip to the next node.
            }
             :
             :
     

    Of course that's not very different from:

             :
             :
        }

        // Loop through the nodes until you hit the current node.
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);  // Skip to the next node.
            }
             :
             :
     

    On the other hand, the first form (/**/) stands out more, simply because it has more white space.  You can achieve the same whitespace effect with // comments:

             :
             :
        }
        //
        // Loop through the nodes until you hit the current node.
        //
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);  // Skip to the next node.
            }
             :
             :
    Another aspect of commenting style is the effective line length.  When I learned to program, I learned on 80 character wide terminals, which imposed a fairly strict limit on the number of columns in the code - anything longer than 80 characters was not OK.  Nowadays, that isn't as important, but it's still important to limit source code lines to a reasonable number, somewhere around 100 works for me, but of course your experience might be different.

    The other major thing about comments is the content of the comments.  Comments can be the bane of a maintainers existence or their savior, depending on the comment.

    I don't know how many times I've come across the following and cringed:

        //
        // Increment i.
        //
        i = i + 1;
     

    Yes, the line's commented, but how useful is the comment?  Especially since it's a block comment (block comments have more importance than in-line comments by virtue of the amount of real estate they occupy - the very size of the comment gives it weight.  In general, you want to use in-line comments for little things and block comments for the big stuff.  I almost never use in-line comments, I prefer to use block comments at the appropriate locations and let the code speak for itself.

    I've mentioned white space as an aspect of style before, but this is the time to bring it up.  Consider the example routine with some white space added to stretch the code out a bit:

    #include "list.h"
    main(C cArg, SZ rgszArg[])
    {
        I iNode;
        I cNodes = atoi(rgszArg[1]);
        I cNodesToSkip = atoi(rgszArg[2]);
        PNODE pnodeT;
        PNODE pnodeCur;

        InitNodes(cNodes);

        for (iNode = 2, pnodeCur = PnodeNew(1); iNode <= cNodes ; iNode++)
        {
            pnodeT = PnodeNew(iNode);
            InsertNext(pnodeCur, pnodeT);
            pnodeCur = pnodeT;
        }

        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);
            }
            FreeNode(PnodeDeleteNext(pnodeCur));
        }

        printf("%d\n", Item(nodeCur));
    }

    It's not a major change, but to me, by just inserting 4 empty lines, the code's been made clearer. 

    The other aspect of white space is intra-expression white space.  This is one of the aspects of style that tends to get religious.  I've seen people who do:

        for ( iNode = 2 , pnodeCur = PnodeNew( 1 ) ; iNode <= cNodes ; iNode ++ )
    And others who prefer:

        for (iNode=2,pnodeCur=PnodeNew(1);iNode<=cNodes;iNode++)
    Or:

        for (iNode=2, pnodeCur=PnodeNew(1); iNode<=cNodes; iNode++)
    Or:

        for ( iNode=2, pnodeCur=PnodeNew(1) ; iNode<=cNodes ; iNode++ )
    The reality is that it doesn't really matter which form you use, as long as you're consistent.

    Lets see what happens to the sample routine if you insert some block comments...

    #include "list.h"
    main(C cArg, SZ rgszArg[])
    {
        I iNode;
        I cNodes = atoi(rgszArg[1]);
        I cNodesToSkip = atoi(rgszArg[2]);
        PNODE pnodeT;
        PNODE pnodeCur;

        InitNodes(cNodes);

        //
        // Create a list of cNodes nodes. 
        // 
        for (iNode = 2, pnodeCur = PnodeNew(1); iNode <= cNodes ; iNode++)
        {
            pnodeT = PnodeNew(iNode);
            InsertNext(pnodeCur, pnodeT);
            pnodeCur = pnodeT;
        }

        //
        // Walk the list of nodes, freeing the node that occurs at every cNodesToSkip nodes in the list.
        //
        while (pnodeCur != PnodeNext(x))
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur);
            }
            FreeNode(PnodeDeleteNext(pnodeCur));
        }

        //
        // Print out the value of the current node.
        //
        printf("%d\n", Item(nodeCur));
    }

    To me, that's a huge improvement.  By stretching out the code and adding some comments, it's already starting to look better.

    Again, just for grins, here's how it would look in my "1980's style":

    #include "list.h"
    main(C cArg, SZ rgszArg[])
    {
        I iNode;                              // Node index.
        I cNodes = atoi(rgszArg[1]);          // Set the number of nodes
        I cNodesToSkip = atoi(rgszArg[2]);    // and the nodes to skip.
        PNODE pnodeT;                         // Declare a temporary node.
        PNODE pnodeCur;                       // And the current node.

        InitNodes(cNodes);                    // Initialize the nodes database
                                              // with cNodes elements

        /*************************************************************/
        /*                                                           */
        /*                                                           */
        /*               Create a list of cNodes nodes.              */
        /*                                                           */
        /*                                                           */
        /*************************************************************/
        for (iNode = 2, pnodeCur = PnodeNew(1); iNode <= cNodes ; iNode++)
        {
            pnodeT = PnodeNew(iNode);        // Allocate a new node.
            InsertNext(pnodeCur, pnodeT);    // Insert it after the current node
            pnodeCur = pnodeT;               // Current = New.
        }

        /*************************************************************/
        /*                                                           */
        /*                                                           */
        /*   Walk the list of nodes, freeing the node that occurs    */
        /*   at every cNodesToSkip nodes in the list.                */
        /*                                                           */
        /*                                                           */
        /*************************************************************/
        while (pnodeCur != PnodeNext(x))    // While not at the current node
        {
            for (iNode = 1; iNode < cNodesToSkip ; iNode++)
            {
                pnodeCur = PnodeNext(pnodeCur); // Current = current->Next
            }
            FreeNode(PnodeDeleteNext(pnodeCur)); // Free current->Next
        }

        /*************************************************************/
        /*                                                           */
        /*                                                           */
        /*          Print out the value of the current node.         */
        /*                                                           */
        /*                                                           */
        /*************************************************************/
        printf("%d\n", Item(nodeCur));
    }

    Yech.

    Tomorrow: function and file headers.

  • Larry Osterman's WebLog

    Structured Exception Handling Considered Harmful

    • 33 Comments

    I could have sworn that I wrote this up before, but apparently I’ve never posted it, even though it’s been one of my favorite rants for years.

    In my “What’s wrong with this code, Part 6” post, several of the commenters indicated that I should be using structured exception handling to prevent the function from crashing.  I couldn’t disagree more.  In my opinion, SEH, if used for this purpose takes simple, reproducible and easy to diagnose failures and turns them into hard-to-debug subtle corruptions.

    By the way, I’m far from being alone on this.  Joel Spolsky has a rather famous piece “Joel on Exceptions” where he describes his take on exception (C++ exceptions).  Raymond has also written about exception handling (on CLR exceptions).

    Structured exception handling is in many ways far worse than C++ exceptions.  There are multiple ways that structured exception handling can truly mess up an application.  I’ve already mentioned the guard page exception issue.  But the problem goes further than that.  Consider what happens if you’re using SEH to ensure that your application doesn’t crash.  What happens when you have a double free?  If you don’t wrap the function in SEH, then it’s highly likely that your application will crash in the heap manager.  If, on the other hand, you’ve wrapped your functions with try/except, then the crash will be handled.  But the problem is that the exception caused the heap code to blow past the release of the heap critical section – the thread that raised the exception still holds the heap critical section. The next attempt to allocate memory on another thread will deadlock your application, and you have no way of knowing what caused it.

    The example above is NOT hypothetical.  I once spent several days trying to track down a hang in Exchange that was caused by exactly this problem – Because a component in the store didn’t want to crash the store, they installed a high level exception handler.  That handler caught the exception in the heap code, and swallowed it.  And the next time we came in to do an allocation, we hung.  In this case, the offending thread had exited, so the heap critical section was marked as being owned by a thread that no longer existed.

    Structured exception handling also has performance implications.  Structured exceptions are considered “asynchronous” by the compiler – any instruction might cause an exception.  As a result of this, the compiler can’t perform flow analysis in code protected by SEH.  So the compiler disables many of its optimizations in routines protected by try/catch (or try/finally).  This does not happen with C++ exceptions, by the way, since C++ exceptions are “synchronous” – the compiler knows if a method can throw (or rather, the compiler can know if an exception will not throw).

    One other issue with SEH was discussed by Dave LeBlanc in Writing Secure Code, and reposted in this article on the web.  SEH can be used as a vector for security bugs – don’t assume that because you wrapped your function in SEH that your code will not suffer from security holes.  Googling for “structured exception handling security hole” leads to some interesting hits.

    The bottom line is that once you’ve caught an exception, you can make NO assumptions about the state of your process.  Your exception handler really should just pop up a fatal error and terminate the process, because you have no idea what’s been corrupted during the execution of the code.

    At this point, people start screaming: “But wait!  My application runs 3rd party code whose quality I don’t control.  How can I ensure 5 9’s reliability if the 3rd party code can crash?”  Well, the simple answer is to run that untrusted code out-of-proc.  That way, if the 3rd party code does crash, it doesn’t kill YOUR process.  If the 3rd party code is processing a request crashes, then the individual request fails, but at least your service didn’t go down in the process.  Remember – if you catch the exception, you can’t guarantee ANYTHING about the state of your application – it might take days for your application to crash, thus giving you a false sense of robustness, but…

     

    PS: To make things clear: I’m not completely opposed to structured exception handling.  Structured exception handling has its uses, and it CAN be used effectively.  For example, all NT system calls (as opposed to Win32 APIs) capture their arguments in a try/except handler.  This is to guarantee that the version of the arguments to the system call that is referenced in the kernel is always valid – there’s no way for an application to free the memory on another thread, for example.

    RPC also uses exceptions to differentiate between RPC initiated errors and function return calls – the exception is essentially used as a back-channel to provide additional error information that could not be provided by the remoted function.

    Historically (I don’t know if they do this currently) the NT file-systems have also used structured exception handling extensively.  Every function in the file-systems is protected by a try/finally wrapper, and errors are propagated by throwing exception this way if any code DOES throw an exception, every routine in the call stack has an opportunity to clean up its critical sections and release allocated resources.  And IMHO, this is the ONLY way to use SEH effectively – if you want to catch exceptions, you need to ensure that every function in your call stack also uses try/finally to guarantee that cleanup occurs.

    Also, to make it COMPLETELY clear.  This post is a criticism of using C/C++ structured exception handling as a way of adding robustness to applications.  It is NOT intended as a criticism of exception handling in general.  In particular, the exception handling primitives in the CLR are quite nice, and mitigate most (if not all) of the architectural criticisms that I’ve mentioned above – exceptions in the CLR are synchronous (so code wrapped in try/catch/finally can be optimized), the CLR synchronization primitives build exception unwinding into the semantics of the exception handler (so critical sections can’t dangle, and memory can’t be leaked), etc.  I do have the same issues with using exceptions as a mechanism for error propagation as Raymond and Joel do, but that’s unrelated to the affirmative harm that SEH can cause if misused.

  • Larry Osterman's WebLog

    XP SP2 Craters

    • 32 Comments

    So the newswires and forums are buzzing about this reported security flaw in XP SP2.   Essentially they are complaining that the security center in SP2 uses WMI to store its metadata and an administrator can modify the metadata to convince the user that they’re protected when they’re not.

    In the original eWeek article, Microsoft’s response is quoted as:

    In SP2, we added functionality to reduce the likelihood of unknown/devious applications running on a user's system, including turning Windows Firewall on by default, data execution prevention, attachment execution services to name a few. To spoof the Windows Security Center WMI would require system-level access to a PC. If the user downloads and runs an application that would allow for spoofing of Windows Security Center, they have already opened the door for the hacker to do what they want. In addition, if malware is already on the system, it does not need to monitor WSC to determine a vulnerable point of attack, it can simply shut down any firewall or AV service then attack – no WSC is necessary."

    "Windows Security Center, found in the Windows XP Control panel, provides customers the ability and makes it easier to check the status of these essential security functionalities such as firewalls, automatic updates and antivirus. Windows Security Center will inform users whether key security capabilities are turned on and up to date and will notify users if it appears that updates need to be made or if additional action steps may need to be taken to help them get more secure."

    In other words, if you’re running as an administrator, you can run an application that can mess up your computer.  Yup, but if you’re running as an admin and you’re running untrusted code then IMHO, spoofing the security center is the LEAST of your problems – the application that spoofed the security center could also have installed a rootkit on your machine, and at that point, the bad guys own your computer.

    Mike Dimmick also has an excellent rebuttal to the original eWeek article.

     

  • Larry Osterman's WebLog

    What is localization anyway?

    • 32 Comments
    I may be stomping on Michael Kaplan's toes with this one, but...

    I was reading the February 2005 issue of Dr. Dobbs Journal this morning and I ran into the article "Automating Localization" by Hew Wolff (you may have to subscribe to get access to the article).

    When I was reading the article, I was struck by the following comment:

     I didn't think we could, because the localization process is prety straightforward. By "localization", I mean the same thing as "globalization" (oddly) or "internationalization." You go through the files looking for English text strings, and pull them into a big "language table," assigning each one a unique key

    The first thing I thought was what an utterly wrong statement.  The author of the article is conflating five different concepts and calling them the same thing.  The five concepts are: localizability, translation, localization, internationalization, and globalization.

    What Hew's talking about is "localizability" - the process of making the product localizable.

    Given that caveat, he's totally right in his definition of localizability - localizability is the process of extracting all the language-dependant strings in your binary and putting them in a separate location that can be later modified by a translator.

    But he totally missed the boat on the rest of the concepts.

    The first three (localizability, translation, and localization) are about resources:

    • Localizability is about enabling translation and localization.  It's about ensuring that a translator has the ability to modify your application to work in a new country without recompiling your binary.
    • Translation is about converting the words in his big "language table" from one language to another.  Researchers love this one because they think that they can automate this process (see Google's language tools as an example of this).
    • Localization is the next step past translation.  As Yoshihiko Sakurai mentioned to Michael in a related discussion this morning "[localization] is a step past translation, taking the certain communication code associated with a certain culture.  There are so many aspects you have to think about such as their moral values, working styles, social structures, etc... in order to get desired (or non-desired) outputs.  This is one of the big reasons that automated translation tools leave so much to be desired - humans know about the cultural issues involved in a language, computers don't.

    Internationalization is about code.  It's about ensuring that the code in your application can handle strings in a language sensitive manner.  Michael's blog is FULL of examples of internationalization.  Michael's article about Tamil numbers, or deeptanshuv's article about the four versions of "I" in Turkish are great examples of this.  Another example is respecting the date and time format of the user - even though users in the US and the UK both speak English (I know that the Brits reading this take issue with the concept of Americans speaking English, but bear with me here), they use different date formats.  Today is 26/01/2005 in Great Britain, but it's 01/26/2005 here in the US.  If your application displays dates, it should automatically adjust them.

    Globalization is about politics.  It's about ensuring that your application doesn't step on the policies of a country - So you don't ever highlight national borders in your graphics, because you might upset your customers living on one side or another of a disputed border. I do want to be clear that this isn't the traditional use of globalization, maybe a better word would be geopoliticization, but that's too many letters to type, even for me, and since globalization was almost always used as a synonym for internationalization, I figured it wouldn't mind being coopted in this manner :)

    Having said that, his article is an interesting discussion about localization and the process of localization.  I think that the process he went through was a fascinating one, with some merit.  But that one phrase REALLY stuck in my craw.

    Edit: Fixed incorrect reference to UK dates - I should have checked first :)  Oh, and it's 2005, not 2004.

    Edit2: Added Sakurai-san's name.

    Edit3: Added comment about the term "globalization"

  • Larry Osterman's WebLog

    What's wrong with this code, part 14

    • 32 Comments
    Keeping up with the "theme" of COM related bad code examples, here's another real-world example.  To avoid any ATL confusion, it's 100% pure C++.

    Our test team rolled out a new set of tests last week and immediately hit this one.  The funny thing is that the code in question had worked  without flaw for two years before this.

    Obviously this has been abstracted to the point of ridiculousness, there's just enough here to demonstrate the bug...

    struct FooConfig
    {
        int _Value1;
        int _Value2;
    };

    [
        object,
        uuid("0A0DDEDC-C422-4BB3-9869-4FED020B66C5"),
        pointer_default(unique)
    ]
    __interface IFoo : IUnknown
    {
        HRESULT GetFooConfig([out] struct FooConfig **ReturnedFooConfig);
    };

    FooConfig _GlobalFooConfig = { 1, 2};

    class CFoo: public IFoo
    {
        LONG _refCount;
    public:
        CFoo() : _refCount(1) {};

        // IFoo
        HRESULT GetFooConfig(FooConfig **ReturnedFooConfig)
        {
            *ReturnedFooConfig = &_GlobalFooConfig;
            return S_OK;

        }

        // IUnknown
        virtual HRESULT STDMETHODCALLTYPE QueryInterface(const IID& iid, void** ppUnk)
        {
            HRESULT hr=S_OK;
            *ppUnk = NULL;
            if (iid == __uuidof(IFoo))
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(static_cast<IFoo *>(this));
            }
            else if (iid == IID_IUnknown)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(static_cast<IUnknown *>(this));
            }
            else
            {
                hr = E_NOINTERFACE;
            }
            return hr;
        }
        virtual ULONG STDMETHODCALLTYPE AddRef(void)
        {
            return InterlockedIncrement(&_refCount);
        }
        virtual ULONG STDMETHODCALLTYPE Release(void)
        {
            LONG refCount;
            refCount = InterlockedDecrement(&_refCount);
            if (refCount == 0)
            {
                delete this;
            }
            return refCount;
        }
    };

    This one's pretty straighforward, and I expect that people will see the problem right away.  So I'm going to raise the bar a bit - to get full credit, you not only have to explain not only what the problem is, but also why we'd never seen a problem with this code before.

    And, as always, kudos and mea culpas on Monday.

    Edit: Fixed return value from GetFooConfig.

  • Larry Osterman's WebLog

    Life in a faraday cage

    • 32 Comments

    There was an internal discussion about an unrelated topic recently, and it reminded me of an early experience in my career at Microsoft.

    When I started, my 2nd computer was a pre-production PC/AT (the first was an XT). The AT had been announced by IBM about a week before I started, so our pre-production units were allowed to be given to other MS employees (since I had to write the disk drivers for that machine, it made sense for me to own one of them).

    Before I got the machine, however, it was kept in a room that we semi-affectionately called "the fishtank" (it was the room where we kept the Salmons (the code name for the PC/AT)).

    IBM insisted that we keep all the pre-production computers we received from them in this room - why?

    Two reasons.  The first was that there was a separate lock on the door that would limit access to the room.

    The other reason was that IBM had insisted that we build a faraday cage around the room.  They were concerned that some n'er-do-well would use the RF emissions from the computer (and monitor) to read the contents of the screen and RAM.  I was told that they had technology that would allow them to read the contents of an individual screen from across the street, and they were worried about others being able to do the same thing.

    Someone at work passed this link along to a research paper by Wim van Eyk that discusses the technical details behind the technology.

     

  • Larry Osterman's WebLog

    Little Lost APIs

    • 32 Comments
    When you have an API set as large as the Win32 API set, sometimes APIs get "lost".  Either by forgetfulness, or by the evolution of the hardware platform.

    We've got one such set of APIs here in multimedia-land, they're the "aux" APIs.

    The "aux" APIs (auxGetNumDevs, auxGetDevCaps, auxGetVolume, auxSetVolume, and auxOutMessage) are intended to control the volume of the "aux" port on your audio adapter.

    It's a measure of how little used these are that when I asked around my group what the aux APIs did, the general consensus was "I don't know" (this isn't exactly true, but it's close).  We certainly don't know of any applications that actually uses these APIs.

    And that's not really surprising since the AUX APIs are used to control the volume of either the AUX input jack on your sound card or the output volume from a CDROM drive (if connected via the analog cable).

    What's that you say? Your sound card doesn't have an "AUX" jack?  That's not surprising, I'm not sure that ANY sound card has been manufactured in the past 10 years with an AUX input jack (they typically have a "LINE-IN" jack and a "MIC" jack).  And for at least the past 5 years, hardware manufacturers haven't been connecting the analog CD cable to the sound card (it enables them to save on manufacturing costs).

    Since almost every PC system shipped in the past many years (at least 5) has used digital audio extraction to retrieve the CD audio, the analog cable's simply not needed on most systems (there are some exceptions such as laptop machines, which use the analog connector to save battery life when playing back CD audio).  And even if a sound card were to add an AUX input, the "mixer" APIs provide a more flexable mechanism for managing those APIs anyway.

    So with the "aux" APIs, you have a set of APIs that were designed to support a series of technologies that are at this point essentially obsolete.  And even if your hardware used them, there's an alternate, more reliable set of APIs that provide the same functionality - the mixer APIs.  In fact, if you launch sndvol32.exe (the volume control applet), you can see a bunch of sliders to the right of the volume control - they're labeled things like "wave", "sw synth", "Line in", etc.  If your audio card has an "AUX" line, then you'll see an "Aux" volume control - that's the same control that the auxSetVolume and auxGetVolume API controls.  Similarly, there's likely to be a "CD Player" volume control - that's the volume for the CD-ROM control (and it works for both digital and analog CD audio).  So all the "aux" API functionality is available from the "mixer" APIs, but the mixer version works in more situations.

    But even so, the "aux" APIs still exist in the system in the event that someone might still be calling them...  Even if there's no hardware on the system which would be controlled by these APIs, they still exist.

    These APIs are one of the few examples of APIs where it's actually possible that we might be able to end-of-life the APIs - they'll never be removed from the system, but a time might come in the future where the APIs simply stop working (auxGetNumDevs will return 0 in that case indicating that there are no AUX devices on the system).

    Edit: Clarified mixer and aux API relationship a bit to explain how older systems would continue to work.

  • Larry Osterman's WebLog

    To default, or not to default, that is the question...

    • 32 Comments
    One of the "interesting" features of C++ is the ability to default the value of parameters to a method.

    It's one of those features that I'm sure that Bjarne Stroustrup thought was just flat-out neat (there are a number of feature in C++ that fall into the broad category of "convenience features", I'm pretty sure that defaulted parameters is one of them).

    One of the developers (Nick) in my group was working on a fix for a bug, the fix required that he add a new parameter to a relatively common routine.

    It turned out that for all of the calls to that routine, except for one, the new parameter would have the same value.

    Nick's suggestion was to add the new parameter as a defaulted parameter because that way he wouldn't have to change that much code.

    Eventually I vetoed the idea, because of the following problem.

     

    Let's say you have a function foo:

    HRESULT Foo(DWORD param1, BOOL param2=FALSE)
    {
        :
        :
    }

    Everything's great - your code's running, and it's all great.

    What happens six months later when you need to change the signature for Foo to:

    HRESULT Foo(DWORD param1, void *newParam, BOOL param2=FALSE)
    {
        :
        :
    }

    Well, in this case, you're in luck, you can simply compile the code and the compiler will happily tell you every function that needs to change.

    On the other hand, what if the change was:

    HRESULT Foo(DWORD param1, DWORD newParam, BOOL param2=FALSE)
    {
        :
        :
    }

    Now that's a horse of a different color.  The problem is that the types for BOOL and DWORD are compatible.  It means that any code that specified a value for param2, like:

        hr = Foo(32, TRUE);

    is still going to compile without error.  The compiler will simply interpret it as:

        hr = Foo(param1=32, newParam=1, param2=FALSE);

    Now the language lawyer's are going to shout up and down that this is a design problem in Windows, the BOOL and DWORD types shouldn't have both been defined as "unsigned long", that instead param2 should have been defined as "bool".

    The problem is that you STILL have problems.  If param2 was defined as 'bool', what happens if you need to add a non default parameter that's of type 'bool'?  You're back where you were before.

    Or you could have:

    HRESULT Foo(DWORD param1, int newParam, short param2=3)
    {
        :
        :
    }

    In this case, the automatic promotion rules will quite happily promote a short to an int without a warning.

    There have been dozens of times when I've discovered bugs that were introduced by essentially this pattern - someone added a parameter to a function that has defaulted parameters, there was an automatic conversion between the defaulted parameter and the newly added parameter, and what was a simple change all of a sudden became a bug.

  • Larry Osterman's WebLog

    Riffing on Raymond - the purpose of an operating system

    • 32 Comments
     

    In Raymond's post today (Adding flags to APIs to work around driver bugs doesn't scale), Raymond wrote:

    Perhaps it's just me, but I don't believe that workarounds for driver issues should become contractual. I would think that one of the goals of an operating system would be to smooth out these bumps and present a uniform programming model to applications. Applications have enough trouble dealing with their own bugs; you don't want them to have to deal with driver bugs, too.

    My personal take on this is that the entire PURPOSE of an operating system is to smooth out the bumps and present a uniform programming model to applications.  Applications don't need to know about the weird behavior of of the u765 floppy disk controller that only shows up when there are more than two floppy drives (I got a free 72 hour trip to France because of that particular issue).  They shouldn't need to know that a certain models of network adapter will sometimes hand the OS a packet that is half filled with garbage. Applications shouldn't have to know the low-level commands needed to drive a printer.  

    IMHO, the ONLY reason for an operating system to exist is to hide the details of the hardware from the application.  Everything else is sugar.

    People forget what the world was like when they would buy a word processor and it would come with a list of the two dozen or so printers that were supported by the application.  If you wanted to use the word processor but didn't own one of the listed printers, you were out-of-luck.

    For those of you that are now going to say "Aha!  Larry now thinks an OS shouldn't contain a web browser", that's not at all the case.  An HTML renderer is a very high level special case of hiding the details of the hardware from the application.  In the case of an HTML renderer, it's job is to provide a uniform interface that can be used to render HTML content to the screen, just like a printer driver provides a  uniform interface that can be used to render content to a printer.  Similarly an OS should contain a TCP/IP stack and network programming layer that allows for application authors to avoid understanding the icky details of interoperability (it's HARD to write a high quality TCP/IP stack, and application authors shouldn't have to).

     

    But I still stand by my original assertion: The only reason for having an operating system is to hide the details of the hardware from the applications that run on the computer.

     

    Btw: This is MY opinion.  I want to repeat that.  MY OPINION.  Others undoubtedly feel otherwise.

     

  • Larry Osterman's WebLog

    Software Contracts, Part 3 - Sometimes implicit contracts are subtle

    • 32 Comments

    I was planning on discussing this later on in the series, but "Dave" asked a question that really should be answered in a complete post (I did say I was doing this ad-hoc, it shows).

     

    Let's go back to the PlaySound API, and let's ask two different questions that can be answered by looking at the APIs software contract (the first one is Dave's question):

    I am happy to fulfill my contractual obligations but I need to know what they are. If you don't tell them, how is the caller to know that you need their memory until the sound finishes playing?

    If I call PlaySound with the SND_ASYNC flag set, how can I know if the sound's been played.

    As I implied, both of these questions can be answered by carefully reading the APIs contract (and by doing a bit of thinking about the implications of the contract).

    Let's take question 2 first.

    The explicit contract for the PlaySound API states that it returns TRUE if successful and FALSE otherwise.  If you specify the SND_ASYNC, what does that TRUE/FALSE return mean though?  Well, that's not a part of the explicit contract, it must be a part of the impicit contract.

    Remember that the PlaySound API only has three parameters (the sound name, a module handle and a set of flags).  All of these parameters are INPUT parameters - there's no way to return the final status in the async case.  Since there's no way for the AP to return whether or not the sound successfully played, the only way that the return from the API contained an indication of the success/failure of playing the sound implies that the SND_ASYNC flag didn't actually do anything.  And that violates the principle of least surprise - if the SND_ASYNC flag was a NOP, it would be a surprise.

    And in fact all the call to PlaySound does is to queue the request to a worker thread and return - the success/failure code refers to whether or not the request was successfully queued to the worker thread, not to whether or not the sound actually played.

     

    No for Dave's question...

    First off: One critical part of interpreting software contracts is:  If you have a question about whether or not a function behaves in a specific manner, if it's not specified in the explicit contract, assume the answer is 'no' unless otherwise specified.

    Since the contract for PlaySound is currently silent about the use of memory in combination with the SND_ASYNC flag, you should always make the most conservative assumptions about the behavior of PlaySound.  Since the API documentation doesn't say explicitly that the memory can be freed while the sound is playing, you should assume that it shouldn't.  And that means that the memory handed to the PlaySound call must remain valid until the call to PlaySound has completed playing the sound.

     

    But even without that, with a bit of digging, you can come to the same answer.

    Here's how my logic works. Both of the givens below are either explicit or implicit in the contract.

    1. You own the memory handed to PlaySound - you are responsible for allocating and freeing it. You know this because PlaySound is mute about what is done with the memory, thus it has no expectations about what happens to the memory it uses (this is an implicit part of the contract).
    2. The default behavior for PlaySound is synchronous (you know this because the documentation states that the SND_SYNC flag is the default behavior) (this is an explicit part of the contract).

     

    You can also assume that the SND_ASYNC flag is implemented by dispatching some parts of the call PlaySound to a background thread.  This is pretty obvious given the fact that something has to execute the code to open the file, load it into memory, and play it.  You can verify this trivially by using your favorite debugger and looking at the threads after calling PlaySound with the SND_ASYNC flag.  In addition, there are no asynchronous playback calls in Windows, so again, it's highly unlikely the playback is done using some kind of interrupt time processing (it's possible, but highly unlikely - remember that PlaySound was written for Windows 3.1).  I actually went back to the Windows 3.1 source code for PlaySound and checked how it did it's work (there were no threads in Windows 3.1) - on Windows 3.1, if you specified the SND_ASYNC flag, it created a hidden window and played the sound from that windows wndproc.

    But even given this, we're not done.  After all, it's possible that the PlaySound code makes a private copy of the memory passed into PlaySound before returning from the original call.  So the decision about whether or not the memory passed into the PlaySound API can be freed when specifying SND_ASYNC really boils down to this: If PlaySound makes a private copy of the memory, then the memory can be freed immediately on return, if it doesn't, you can't.

    This is where you need to step back and make some assumptions.  Up until now, pretty much everything that's been discussed has been a direct consequence of how the API must work - SND_ASYNC MUST be implemented on a background thread, you DO own the memory for the API, etc.

    So let's consider the kind of data that appears in the memory for which the PlaySound API is called.

    Remember that most WAV files shipped with Windows (before Vista) were authored as 22kHz, 16 bit sample, mono files (for Vista, the samples are all stereo).  That means that each second of audio takes up 44K of RAM.  That means that all non trivial WAV files are likely to be more than 64K in size (this is important).  Again, consider that the PlaySound API was written for Windows 3.1 where memory was at a premium, especially huge blocks of memory (any block larger than 64K of RAM had to be kept in "huge" memory allowing the blocks to be contiguous. 

    If Windows were to take a copy of the memory, it would require allocating another block the size of the original block.  And on a resource constrained OS like Windows 3.1 (or Windows 95) that would be a big deal.

    Also remember my 2nd point above - the defaut behavior for PlaySound is synchronous.  That means that the PlaySound call assumes that it's going to be called synchronously. 

    Given the fact that PlaySound was originally written for Windows 3.1 and given that the default for PlaySound is synchronous, and given the size of the WAV files involved, it thus makes sense that the PlaySound API would not allocate a new copy of the memory for the .WAV file and instead would use the samples that were already in memory - why take the time to allocate a new block and copy its contents over when it was already available.

    Now this is a big assumption to make - it might not even be right.  But it's likely to be a reasonable assumption.

    So you should assume that PlaySound doesn't take a copy of the memory being rendered, and thus you need to ensure that the memory is valid across the life of the call.

     

    Btw, I just was told by the doc writers that they're planning on making this part of the contract explicit at some point in the future.

     

    Tomorrow: Let's look at some explicit contracts.

  • Larry Osterman's WebLog

    How do I change the master volume in Windows Vista

    • 32 Comments

    It's actually easier in Vista than it was in XP.  For Vista, we recognized that one of the key customer scenarios was going to be setting the master volume, and since we'd removed the old mechanism that was used to set the volume, we knew we had to provide an easier mechanism for Vista.

    Just for grins,  I threw together a tiny app that demonstrates it.  To save space, all error checking was removed.

    #include <stdio.h>
    #include <windows.h>
    #include <mmdeviceapi.h>
    #include <endpointvolume.h>

    void Usage()
    {
      printf("Usage: \n");
      printf(" SetVolume [Reports the current volume]\n");
      printf(" SetVolume -d <new volume in decibels> [Sets the current default render device volume to the new volume]\n");
      printf(" SetVolume -f <new volume as an amplitude scalar> [Sets the current default render device volume to the new volume]\n");

    }
    int _tmain(int argc, _TCHAR* argv[])
    {
      HRESULT hr;
      bool decibels = false;
      bool scalar = false;
      double newVolume;
      if (argc != 3 && argc != 1)
      {
        Usage();
        return -1;
      }
      if (argc == 3)
      {
        if (argv[1][0] == '-')
        {
          if (argv[1][1] == 'f')
          {
            scalar = true;
          }
          else if (argv[1][1] == 'd')
          {
            decibels = true;
          }
        }
        else
        {
          Usage();
          return -1;
        }

        newVolume = _tstof(argv[2]);
      }

      // -------------------------
      CoInitialize(NULL);
      IMMDeviceEnumerator *deviceEnumerator = NULL;
      hr = CoCreateInstance(__uuidof(MMDeviceEnumerator), NULL, CLSCTX_INPROC_SERVER, __uuidof(IMMDeviceEnumerator), (LPVOID *)&deviceEnumerator);
      IMMDevice *defaultDevice = NULL;

      hr = deviceEnumerator->GetDefaultAudioEndpoint(eRender, eConsole, &defaultDevice);
      deviceEnumerator->Release();
      deviceEnumerator = NULL;

      IAudioEndpointVolume *endpointVolume = NULL;
      hr = defaultDevice->Activate(__uuidof(IAudioEndpointVolume), CLSCTX_INPROC_SERVER, NULL, (LPVOID *)&endpointVolume);
      defaultDevice->Release();
      defaultDevice = NULL; 

      // -------------------------
      float currentVolume = 0;
      endpointVolume->GetMasterVolumeLevel(&currentVolume);
      printf("Current volume in dB is: %f\n", currentVolume);

      hr = endpointVolume->GetMasterVolumeLevelScalar(&currentVolume);
      printf("Current volume as a scalar is: %f\n", currentVolume);
      if (decibels)
      {
        hr = endpointVolume->SetMasterVolumeLevel((float)newVolume, NULL);
      }
      else if (scalar)
      {
        hr = endpointVolume->SetMasterVolumeLevelScalar((float)newVolume, NULL);
      }
      endpointVolume->Release();

      CoUninitialize();
      return 0;
    }

    This program has essentially 3 parts.  The first parses the command line, the second retrieves an endpoint volume interface on the default endpoint, the third retrieves the current volume and sets the volume.

    I'm going to ignore the first part, it's the same junk you'll see in any CS 101 class. 

    The second part instantiates an MMDeviceEnumerator object which implements the IMMDeviceEnumerator interface.  The IMMDeviceEnumerator interface is the gateway object to the new audio subsystem - it can be used to enumerate audio endpoints and retrieve information about the various endpoints.  In this case, I'm only interested in the GetDefaultAudioEndpoint method, it returns an IMMDevice object that points to the current endpoint.

    Again, there are a bunch of things I can do with an IMMDevice object, but I'm only really interested in the "Activate" method.  The idea is that each MMDevice object supports lots of different interfaces, you "Activate" the interface to access the functionality associated with that object.  Again, in this case, I'm only interested in the IAudioEndpointVolume interface - there are other interfaces, like IDeviceTopology, and IAudioClient that can be activated from the endpoint.

    The IAudioEndpointVolume interface is where the good stuff lives, right now I'm only interested in four methods, which retrieve (and set) the current endpoint volume in either decibels or as a scalar value. 

    The decibels version of the IAudioEndointVolume interface instructs the driver to set the desired master volume (input or output) to the decibel value specified, it's intended to be used for applications that want to have exact control over the output dB value of the audio solution.

    The scalar version is a bit more complicated.  It's intended for use in applications that have volume sliders, and provides a linear volume taper (represented as a floating point value between 0.0 and 1.0).  In other words, the perceived volume when you set the scalar version of the API to .5 is twice as loud as when set to .25 and is half as loud as when set to 1.0.

  • Larry Osterman's WebLog

    No sound on a Toshiba M7 after a Vista install (aka: things that make you go "Huh?")

    • 31 Comments

    We recently had a bug reported to us internally.  The user of a Toshiba M7 had installed Vista on his machine (which was previously running XP) and discovered that he didn't get any more sounds from his machine after the upgrade.

    We tried everything we could to figure out his problem - the audio system was sending samples to the sound card, the sound card was updating its internal position register, everything looked great.

    Usually, at this point, we start asking the impolitic questions, like:

    "Sometimes some dirt collects between the plug and the internal connectors on the sound card - could you please unplug the speakers and plug them back in?" (this is the polite way of asking "Did you remember to plug your speakers in?").

    "Sometimes a set of speakers only turn on the speaker when they detect a signal being sent to them, could you try wiggling the volume knob to see if it fixes the problem?" (I actually have one of these in my office, it's excruciatingly annoying).

    "Is it possible there's an external volume control on your speakers?  What's it set to?" (this is the polite question that catches the people who accidentally hit the mute button on their speakers or turned the volume down - we get a surprising number of these).

    Unfortunately, in this case none of these worked.  So we had to dig deeper.  For some reason (I'm not sure why), someone asked the user to boot back to XP and see if he could get sound working on XP.  He booted back to XP and it worked.  He then booted back to Vista, and...

    The sounds worked!

    He mentioned to us that when he'd booted back to XP, the sound driver reported that the volume control was muted, so he un-muted it before booting to Vista.  Just for grins, we asked him to mute the volume control on XP and boot into Vista and yup, the problem had reappeared.  Somehow muting the sound card on XP caused it to be muted in Vista.

    We got on the horn with the manufacturer of the system and the manufacturer of the sound card and they informed us that for various and sundry reasons, the XP audio driver twiddled some hardware registers that were hidden from the OS to cause the sound card to mute.  The Vista driver for the sound card didn't know about those special hardware registers, so it didn't know that the sound card was muted, so Vista didn't know it was muted.

    Needless to say, this is quite annoying - the design of the XP driver for this machine made it really easy for the customer to have a horrible experience when running Vista, which is never good.  It's critical that the OS know what's going on in the hardware (in other words, back doors are bad).  When a customer has this experience, they don't blame their system vendor or their audio driver, they blame Vista.

     

    The good news is that there’s a relatively easy workaround for people with an M7 – make sure that your machine is un-muted before you upgrade, the bad news is that this is a relatively popular computer (at least at Microsoft) and sufficient numbers of people have discovered the problem that it’s made one of our internal FAQs.

  • Larry Osterman's WebLog

    It's my computer, dagnabbit, not yours!

    • 31 Comments

    I've been wanting to write this one for a while, but continually got sidetracked, but there's no time like the present...

    Many others (I'm too lazy to chase down references) have commented on the phenomenon known as "bloatware" (also known as "craplets" or "shovelware"). 

    I'm not going to talk about them, too much has been written about them by others already.

    Instead I want to talk about applets in general.  These are the "little" helper processes that software seems to leave lying around after installation.  These are a particular pet peeve of mine, I'm well known inside MS (or at least within the Windows division) as being rather fanatical about them, and fighting tooth and nail (sometimes successfully) to get them removed.  I don't know how many times I've asked: "Why does your product (or feature) have all this crap running (where 'crap' is defined as 'stuff I don't want running on my machine')?"

    Applets come in lots of sizes and shapes - they can be services waiting on an app to use them; they can be processes that handle systray icons; they can be helper applications.  But they share one common: they all consume resources, sometimes LOTs of resources.  And I would rather that these applets NOT consume resources.

    Nowadays, machines come with a fair amount of resources - my current dev machine is a dual 2.4g Core2Duo 6600 with 2G of RAM and a reasonable amount of disk space (750G on 3 drives), but Vsta runs on machines that are far less capable (before it died, my laptop was a P2 with 512M of RAM and it ran Vista Ultimate just fine (no glass, but other than that it worked well)).  On such a machine, every single unnecessary process can be painful.

    The Windows team has known that this has been an issue for years, and has built in a ton of features into the operating system to help alleviate the pain and suffering associated with applets (some of which have been there since NT 3.1), but the reality is that nobody takes advantage of this functionality, and that's a real shame.

    In a potentially futile attempt at trying to inspire people to improve our customers' experiences, I'm going to dedicate this week to writing posts about applets and how developers can fix them.

    Btw: I want to be totally clear here: Microsoft is just as guilty as others in this arena.

    Tomorrow: Why do people write applets?

  • Larry Osterman's WebLog

    When you do UX work, sometimes you have to worry about the strangest things…

    • 31 Comments

    I recently got a bug reported to me about the visuals in the sound control panel applet not being aligned properly (this is from the UI for a new Windows 7 feature):

    image

    The problem as reported was that the microphone was aligned incorrectly w.r.t. the down arrow. – the microphone was too far to the right.

     

    But if you look carefully, you’ll see that that isn’t the case – drawing a box around the controls makes it clearer:

    imageNitpickers Corner: For those of you that love to count pixels, it’s entirely possible that the arrow might be off by a couple of pixels but fixing it wouldn’t fix the problem, because then the arrow would be off-center with respect to the speakers.  The real problem is that the microphone icon is visually weighted to the right – the actual icon resource was lined up with the arrow, but because the visual weight was to the right, it displayed poorly. 

     

    It turns out that there’s really no good way of fixing this – if we were to adjust the location of the icons, it wouldn’t help, because a different device would have a different visual center (as the speaker icon does)…

     

    Instead, we looked at the visuals and realized that there was an alternative solution: Adjust the layout for the dialog and the problem more-or-less goes away:

    newmonitor

    The problem still exists at some level because the arrow is centered with the icons but some icons (like the stalk microphone above) are bottom heavy.  But for whatever reason, the visuals aren’t as disconcerting when laid out horizontally.

    As I said in the title – sometimes you need to worry about the strangest things.

  • Larry Osterman's WebLog

    Thinking about Windows Build numbers

    • 31 Comments

    There’s been an ongoing thread internally speculating about the windows build number that will be chosen for Windows 7 when it finally ships.  What’s interesting is that we’re even having speculation about the builds being chosen. 

    The Windows version is actually composed of a bunch of different fields, all packed into an OSVERSIONINFO structure.  The relevant parts of the OSVERSIONINFO are:

    • Major Version (dwMajorVersion)
    • Minor Version (dwMinorVersion)
    • Build # (dwBuildNumber)

    The major and minor version numbers are primarily marketing numbers – they’re broad brush fields that the marketing department decides are appropriate for the OS.  For Windows 7, the major and minor versions have been fixed at 6.1 for many months now, but the build numbers change more-or-less daily.

     

    Back to my story… Back in the dark ages when Windows NT was first developed, the rules for build numbers were relatively simple.  Today's build is yesterdays build number + 1.  That’s why Windows NT 3.1 was build number 511, NT3.5 was build 807, NT 3.51 was build 1057, NT 4.0 was build 1381.

    But after NT 4.0, things changed.

    When Brian Valentine moved from the Exchange team to the Windows team, he brought with him a tradition that the Exchange team used – The Exchange build numbers were rounded up to round numbers for major milestones in the product.  So Exchange 4.0’s RTM version was 4.0.837 but Exchange 5.0 started at build 1000 (maybe it was 900, I honestly don’t remember).  For NT, Brian and his team adopted this scheme but used it to ensure that the OS build number was a round number – so WIndows 2000 (the first version of Windows that was shipped with Brian as the lead) it had a (relatively) round version number of 5.0.2195.

    That tradition was continued with Windows XP (5.1.2600) and Vista (6.0.6000).  In the Vista case, it appears that there was some massaging of the numbers to make the build number work out so evenly – this list of build numbers shows that the build numbers jumped from 5825 to 5840 to 5920 to 6000 during the final push – the last few build numbers were incremented by 80 each build with sub-build numbers (QFE number) incrementing by 1 between the builds.

    For Windows 7, we’ve also seen a number of jumps in build numbers.  The PDC build was build 6801, the Beta build was 7000 and the RC build was 7100.  It’ll be interesting to see what the final build number will be (whenever that happens).  I honestly have no idea what the number’s going to be.

  • Larry Osterman's WebLog

    Paddington Thoughts

    • 31 Comments

    The other day, I mentioned I got sucked off on a high priority project shortly after I returned from my vacation.  I was asked to help out with the EU protocol documentation effort, working on a couple of the documents (based on some stuff I did back when I worked in the Lan Manager group).

    I'm pretty darned proud of the work that the people on the documentation team have done - the specifications that were produced are really quite impressive.

     

    But what most impressed me the most was the amount of work that the people working on this project have spent on it - I just got roped in to help with the final push, but some of those people have been working 16 hours a day, 7 days a week for months and months on this stuff. 

    The level of dedication to this project that I saw was as much or more than I've seen on ANY other Microsoft project.  These guys figuratively worked their fingers to the bone typing up documents - countless late hours spread across the entire team.

    In particular, the work of Henry Sanders, Iain McDonald, and several others was absolutely heroic.  Henry personally reviewed and provided technical feedback on every single specification that was submitted to the EU.  For the specification I wrote, there's no way that it couldn't have been completed without Henry's insightful comments and the invaluable help of David Kruse (he's the guy who's currently responsible for networking code I wrote back when I was on the NT networking group), and of course Veronica Skeels who made all the words I wrote look professional.

    I'm not going to say one word about the politics of the EU documentation work, I just want to recognize the truly remarkable work and the heroic effort that was done on the project.

     

  • Larry Osterman's WebLog

    WFP is my new best friend

    • 31 Comments
    I've mentioned our computer setup a couple of times before - Valorie's got her laptop, Daniel, Sharron and I each have our own desktop computers, and there are a couple of other machines floating around the house.  Since the kids machines don't have internet access, we've got a dedicated machine sitting in our kitchen whose sole purpose is to let the kids get their email and surf the net.  The theory is that if they're surfing in the kitchen, it's unlikely they'll go to bad places on the net.

    It also means we can easily allow them to run as non admins when they surf but be admins on their machines (which is necessary for some of the games they play).

    Ok, enough background.  Yesterday night, I was surfing the web from the kitchen machine, and I noticed that the menu bar on IE had disappeared.  Not only that, but I couldn't right click on any of the toolbars to enable or disable them.  All the IE settings looked reasonable, IE wasn't running in full screen mode, it was just wierd.

    Other than this one small behavior (no menus in either IE or other HTML applications (like the user manager and other control panel applets), the machine was working perfectly.  The behavior for HTAs was wierd - there was a windows logo in the middle of the window where the menu bar should be, but that was it.

    I ran an anti-spyware and virus scan and found nothing. I went to the KB to see if I could find any reason for this happening, but found nothing.

    I even tried starting a chat session with PSS but it never succeeded in connecting.

    I must have spent about 2 hours trying to figure out what was wrong.

    The first inkling of what was actually wrong was when Daniel asked me to get up so he could read his email - he got this weird message about "Outlook Express could not be started because MSOE.DLL could not be initialized".  That was somewhat helpful, and I went to the KB to look it up.  The KB had lots of examples of this for Win98, but not for XP SP2.  So still no luck.

    And then I had my Aha!.  I ran chkdsk /f to force a full chkdsk on the drive and rebooted.

    Within a second or so on the reboot, chkdsk started finding corruptions in the hard disk.  One of the files that was corrupted was one of the OE DLL's, another was something related to browsing, and there were a couple of other corrupted files.

    I rebooted after running chkdsk, and now I got a message that msimn.exe was invalid or corrupt.  I looked at the file, and yup, MSIMN.EXE had a 0 length. Obviously it was one of the files corrupted on the disk.

    So now I had a system that almost was working, but not quite.

    During my trolls through the KB, I'd run into the SFC command.  The SFC (System File Checker) is a utility in XP and Win 2K3 that will verify that all files protected by WFP (Windows File Protection) are valid.  If it finds invalid files, it restores them from the cache directory.  As per the KB article, I ran SFC /SCANNOW and waited for a while.  Darned if it didn't find all the files that had been corrupted and repaired them.

    So Daniel got his email back, IE got its menus back, and the machine seems to be back on its feet again!

    Man, I love it when stuff works the way it's supposed to.

     

    Btw, my guess is that the data corruptions have either been there for a while and we didn't notice them, or they were introduced during a rapid series of power failures we had on Saturday and Sunday (this machine isn't currently on a UPS so...).

  • Larry Osterman's WebLog

    Unintended consequences of adding APIs to the system

    • 31 Comments

    Yesterday, I wrote about a trick to reduce the number of bits in a number by one.

    It turns out that I've only ever had one opportunity to use this trick (although I ran into an instance of it when code reviewing some stuff the other day), back when I was writing the NT 3.1 network filesystem.

    One of the things that makes writing a network client so hard is that you need to map legacy versions of your network protocol to the semantics that your customers require.  In this case, I needed to map between the "how much disk space is free on the drive" SMB request and the NT filesystem query that retrieves the amount of disk space on the drive.

    It turns out that the legacy servers returned this information in terms of bytes per sector, sectors per cluster and clusters per disk (because that was how MS-DOS maintained that information).

    NT maintained the information in a different form, and the conversion between the two forms required that I divide the interim results by bytes per sector.

    Well, if the bytes per sector was a power of two (which it almost always was), I could do the division trivially (by shifting the interim result right by the base2 logarithm of the sector size). 

    Unfortunately, if the bytes per sector was NOT a power of two, I was stuck with a division.  Since the interim value was a 64bit number, that meant that I had to do a 64bit by 32bit division.  At the time I was writing the code, the compilers didn't have support for 64bit arithmetic, for 64bit math, we had a number of internal "RTL" functions that would do the math.  We had functions for addition, subtraction and multiplication, but nobody had ever written a version that did division.

    I'd never done numeric processing before, but I needed the function, so...  I went to Dave Cutler (since he knows everything) and asked him how to do the math, he told me and I trotted off to implement his algorithm.

    Since the code was only used in a special case that was likely never to occur outside of the test labs, I didn't particularly spend a lot of time optimizing the function - I faithfully coded the algorithm in C - two versions, one which did a 64x64 division, another that did a 64x32 division.  I could have written an assembly language version of the 64x32 version (since the processor supported 64x32 division natively), but we were REALLY concerned about portability and since it wasn't going to be called that often, I went with the more portable solution (and I'd been told that if I ever wrote stuff in assembly language, I'd have my thumbs slowly roasted over a red hot hibachi).

    So I wrote the API, wrote enough unit tests to confirm to myself that it worked, checked it in as a new RTL API and went away satisfied that I'd done my first piece of numerical processing.  I was SO happy :)

    At this point, anyone reading this post who was on the NT team at the time is starting to see where this is going.

     

    About three years later (long after I'd left NT and moved onto Exchange), I heard a story about something that happened some point after I'd left the team.  You see, Dave spent a fair amount of time during the NT 3.5 to NT 4 period doing analysis of the bottlenecks in the system and looking for ways of making the system faster.

    At some point, his investigations brought him to (you guessed it) my little division functions.  It turns out that one of the core components in the system (I think it was GDI, but I'm not sure) decided that they needed to divide a 64bit numbers by a 32bit number. 

    Because the compiler didn't have native support for 64bit arithmetic, they went looking through the same RTL functions I did, and they discovered my 64x32 division routine.  So they started using the routine.

    In their innermost loop.

    All of a sudden, the silly routine I'd written to solve a problem that would never be seen in real life all of a sudden became a part of the inner loop of a performance critical function inside the OS.

     

    When Cutler discovered this bottleneck, he looked at the code in my little routine and he went through the roof.  I understand he went through the roof and started cursing my name repeatedly before he rewrote it in assembly language.

     

    So the moral of the story is: When you're writing code that's going into a system library, make darned sure that it's written to be as performant as humanly possible, because you never know if someone's going to find your one-off piece of code and put it in their innermost loop.

  • Larry Osterman's WebLog

    What's wrong with this code, part 9 - File Copy

    • 31 Comments

    Yes, it's time for another "What's wrong with this code"...

    Today's example of bad coding involves an attempt to speed up file copy operations by overlapping the read and write operations to the file.  The theory is that the system can be reading from one part of the source file while its writing to another part of the destination.

    To do this, the author of the code (ok, it was me, constructing a contrived example - this isn't The Daily WTF) tried (quite badly it turns out) to perform the file I/O by queueing the work to worker threads.

    struct FileCopyBlock
    {
        LARGE_INTEGER fileDataOffset;
        DWORD readLength;
        HANDLE sourceHandle;
        HANDLE destinationHandle;
        PLARGE_INTEGER dataLengthWritten;
    };
    const DWORD readChunkSize = 0x10000;

    DWORD WINAPI CopyFileDataChunk(LPVOID Context)
    {
        FileCopyBlock *dataBlock = (FileCopyBlock *)Context;
        BYTE *dataBuffer = new BYTE[dataBlock->readLength];
        if (dataBuffer == NULL)
        {
            printf("Unable to allocate data block for copy\n");
            exit(1);
        }
        if (!SetFilePointerEx(dataBlock->sourceHandle, dataBlock->fileDataOffset, NULL, FILE_BEGIN))
        {
            printf("Unable to set current file position in source to %i64d: %d", dataBlock->fileDataOffset, GetLastError());
            exit(1);
        }
        DWORD readLength;
        if (!ReadFile(dataBlock->sourceHandle, dataBuffer, dataBlock->readLength, &readLength, NULL))
        {
            printf("Unable to read data from file: %d\n", GetLastError());
            exit(1);
        }
        if (readLength != dataBlock->readLength)
        {
            printf("Unable to read all from file: %d\n", GetLastError());
            exit(1);
        }
        if (!SetFilePointerEx(dataBlock->destinationHandle, dataBlock->fileDataOffset, NULL, FILE_BEGIN))
        {
            printf("Unable to set current file position in destination to %i64d: %d\n", dataBlock->fileDataOffset, GetLastError());
            exit(1);
        }
        DWORD writeLength;
        if (!WriteFile(dataBlock->destinationHandle, dataBuffer, dataBlock->readLength, &writeLength, NULL))
        {
            printf("Unable to read data from file: %d\n", GetLastError());
            exit(1);
        }
        if (writeLength != dataBlock->readLength)
        {
            printf("Unable to write all from file - out of disk space?\n");
            exit(1);
        }
        //
        // Account for data read.
        //
        dataBlock->dataLengthWritten->QuadPart += dataBlock->readLength;
        //
        // Free our buffers
        //
        delete [] dataBuffer;
        delete dataBlock;
        return 0;
    }

    void MyCopyFile(LPCTSTR SourceFile, LPCTSTR DestinationFile)
    {
        HANDLE sourceHandle;
        HANDLE destinationHandle;

        sourceHandle = CreateFile(SourceFile, FILE_READ_DATA, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
        if (sourceHandle == INVALID_HANDLE_VALUE)
        {
            printf("Error opening source: %d\n", GetLastError());
            return;
        }
        destinationHandle = CreateFile(DestinationFile, FILE_WRITE_DATA, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, sourceHandle);
        if (destinationHandle == INVALID_HANDLE_VALUE)
        {
            DWORD error = GetLastError();
            CloseHandle(sourceHandle);
            printf("Error opening destination: %d\n", error);
            return;
        }
        LARGE_INTEGER fileSize;
        if (!GetFileSizeEx(sourceHandle, &fileSize))
        {
            printf("Unable to get file size: %d\n", GetLastError());
            CloseHandle(sourceHandle);
            CloseHandle(destinationHandle);
            return;
        }
        //
        // Break the read up to 64K chunks, queuing it to a worker thread to do the copy.
        //
        LARGE_INTEGER currentOffset;
        LARGE_INTEGER fileSizeRemaining;
        LARGE_INTEGER fileSizeCopied;
        fileSizeRemaining.QuadPart = fileSize.QuadPart;
        fileSizeCopied.QuadPart = 0;
        currentOffset.QuadPart = 0;
        while (fileSizeRemaining.QuadPart != 0)
        {
            FileCopyBlock *copyBlock = new FileCopyBlock;
            if (copyBlock == NULL)
            {   
                printf("Unable to allocate copy block\n");
                CloseHandle(sourceHandle);
                CloseHandle(destinationHandle);
                return;
            }
            if (fileSizeRemaining.QuadPart < readChunkSize)
            {
                copyBlock->readLength = fileSizeRemaining.LowPart;
            }
            else
            {
                copyBlock->readLength = readChunkSize;
            }
            copyBlock->fileDataOffset = currentOffset;
            copyBlock->sourceHandle = sourceHandle;
            copyBlock->destinationHandle = destinationHandle;
            copyBlock->dataLengthWritten = &fileSizeCopied;
            currentOffset.QuadPart += copyBlock->readLength;
            fileSizeRemaining.QuadPart -= copyBlock->readLength;

            QueueUserWorkItem(CopyFileDataChunk, copyBlock, WT_EXECUTEINIOTHREAD);
        }

        while (fileSizeCopied.QuadPart != fileSize.QuadPart)
        {
            Sleep(100);
        }
        CloseHandle(sourceHandle);
        CloseHandle(destinationHandle);
    }

    Clearly this isn't a real world example.  For starters, there is essentially no error recovery - this is by design and is not a bug - adding in real code to handle the errors would make this example even longer than it currently is.  So errors are simply handled with exit(1).

    In addition, the code to handle the end of read (sleeping 100) is indescribably cheesy - it's horribly inefficient and defeats the primary purpose of making the I/O overlapped - if you time this file copy routine, it will always take a multiple of 100ms because of the sleep - which wastes a lot of time on small files.

    The code also doesn't handle ACLs and alternate data streams (although it should handle file attributes and EAs).  This is also not a bug.  And the file always creates the destination even if it exists, again, that's by design.

    But even with all these caveats, there are still at least five major bugs in this code.  Three of them are serious design flaws, one is a complicated implementation issue, and the other one of them is a simple implementation issue.

    The fascinating thing about this is that even with these flaws, the code works just fine in many cases - I was unable to introduce the failure using my test harness, even though the errors are quite fundamental.

    As always, tomorrow will be the answers and kudos for those who found the problem (and those who found problems I missed while testing this).

Page 4 of 33 (815 items) «23456»