December, 2004

  • The Old New Thing

    BOOL vs. VARIANT_BOOL vs. BOOLEAN vs. bool

    • 29 Comments

    Still more ways of saying the same thing. Why so many?

    Because each was invented by different people at different times to solve different problems.

    BOOL is the oldest one. Its definition is simply

    typedef int BOOL;
    

    The C programming language uses "int" as its boolean type, and Windows 1.0 was written back when C was the cool language for systems programming.

    Next came BOOLEAN.

    typedef BYTE  BOOLEAN;
    

    This type was introduced by the OS/2 NT team when they decided to write a new operating system from scratch. It lingers in Win32 in the places where the original NT design peeks through, like the security subsystem and interacting with drivers.

    Off to the side came VARIANT_BOOL.

    typedef short VARIANT_BOOL;
    #define VARIANT_TRUE ((VARIANT_BOOL)-1)
    #define VARIANT_FALSE ((VARIANT_BOOL)0)
    

    This was developed by the Visual Basic folks. Basic uses -1 to represent "true" and 0 to represent "false", and VARIANT_BOOL was designed to preserve this behavior.

    Common bug: When manipulating VARIANTs of type VT_BOOL, and you want to set a boolean value to "true", you must use VARIANT_TRUE. Many people mistakenly use TRUE or true, which are not the same thing as VARIANT_TRUE. You can cause problem with scripting languages if you get them confused. (For symmetry, you should also use VARIANT_FALSE instead of FALSE or false. All three have the same numerical value, however. Consequently, a mistake when manipulating "false" values is not fatal.)

    Newest on the scene is bool, which is a C++ data type that has the value true or false. You won't see this used much (if at all) in Win32 because Win32 tries to remain C-compatible.

    (Note that C-compatible isn't the same as C-friendly. Although you can do COM from C, it isn't fun.)

  • The Old New Thing

    How to open those plastic packages of electronics without injuring yourself

    • 57 Comments

    Small electronics nowadays come in those impossible-to-open plastic packages. A few weeks ago I tried to open one and managed not to slice my hand with the knife I was using. (Those who know me know that knives and I don't get along well.) Unfortunately, I failed to pay close attention to the sharp edges of the cut plastic and ended up cutting three of my fingers.

    The next day, I called the manufacturer's product support number and politely asked, "How do I open the package?"

    The support person recommended using a pair of very heavy scissors. (I tried scissors, but mine weren't heavy enough and couldn't cut through the thick plastic.) Cut across the top, then down the sides, being careful to avoid the sharp shards you're creating. (You might want to wear gloves.)

    If you bought someone a small electronics thingie, consider keeping a pair of heavy scissors on hand. That's my tip for the season.

  • The Old New Thing

    You can create an infinitely recursive directory tree

    • 40 Comments

    It is possible to create an infinitely recursive directory tree. This throws many recursive directory-traversal functions into disarray. Here's how you do it. (Note: Requires NTFS.)

    Create a directory in the root of your C: drive, call it C:\C, for lack of a more creative name. Right-click My Computer and select Manage. click on the Disk Management snap-in.

    From the Disk Management snap-in, right-click the C drive and select "Change Drive Letter and Paths...".

    From the "Change Drive Letter and Paths for C:" dialog, click "Add", then where it says "Mount in the following empty NTFS folder", enter "C:\C". Click OK.

    Congratulations, you just created an infinitely recursive directory.

    C:\> dir
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    
    C:\> dir C:\C
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\C
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    
    
    C:\> dir C:\C\C\C\C\C\C
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\C\C\C\C\C\C
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    

    Go ahead and add as many "\C"s as you like. You'll just get your own C drive back again.

    Okay, now that you've had your fun, go back to the "Change Drive Letter and Paths for C:" dialog and Remove the "C:\C" entry. Do this before you create some real havoc.

    Now imagine what happens if you had tried a recursive treecopy from that mysterious C:\C directory. Or if you ran a program that did some sort of recursive operation starting from C:\C, like, say, trying to add up the sizes of all the files in it.

    If you're writing such a program, you need to be aware of reparse points (that thing that shows up as <JUNCTION> in the directory listing). You can identify them because their file attributes include the FILE_ATTRIBUTE_REPARSE_POINT flag. Of course, what you do when you find one of these is up to you. I'm just warning you that these strange things exist and if you aren't careful, your program can go into an infinite loop.

  • The Old New Thing

    Why do I get E_NOINTERFACE when creating an object that supports that interface?

    • 5 Comments

    I've seen a few questions from people who call the CoCreateInstance function, asking for an interface that they know the object supports, yet receiving error E_NOINTERFACE. What's going on?

    You're seeing the same problem as the missing IMarshal, just from the other side.

    If your threading model is incompatible with the threading model of the object you're creating, then COM marshalling kicks in. And if the marshalling stuff isn't there, the error that comes out is E_NOINTERFACE, because the marshalling interface is missing.

    A common source of this is attempting to use COM objects provided by the shell from a multi-threaded apartment. Remember that shell COM objects are, for the most part, apartment-threaded, not free-threaded. If you want to use shell objects, you should do so from single-threaded apartments.

  • The Old New Thing

    Computing the size of a directory is more than just adding file sizes

    • 30 Comments

    One might think that computing the size of a directory would be a simple matter of adding up the sizes of all the files in it.

    Oh if it were only that simple.

    There are many things that make computing the size of a directory difficult, some of which even throw into doubt the even existence of the concept "size of a directory".

    Reparse points
    We mentioned this last time. Do you want to recurse into reparse points when you are computing the size of a directory? It depends why you're computing the directory size.

    If you're computing the size in order to show the user how much disk space they will gain by deleting the directory, then you do or don't, depending on how you're going to delete the reparse point.

    If you're computing the size in preparation for copying, then you probably do. Or maybe you don't - should the copy merely copy the reparse point instead of tunneling through it? What do you if the user doesn't have permission to create reparse points? Or if the destination doesn't support reparse points? Or if the user is creating a copy because they are making a back-up?

    Hard links
    Hard links are multiple directory entries for the same file. If you're calculating the size of a directory and you find a hard link, do you count the file at its full size? Or do you say that each directory entry for a hard link carries a fraction of the "weight" of the file? (So if a file has two hard links, then each entry counts for half the file size.)

    Dividing the "weight" of the file among its hard links avoids double-counting (or higher), so that when all the hard links are found, the file's total size is correctly accounted for. And it represents the concept that all the hard links to a file "share the cost" of the resources the file consumes. But what if you don't find all the hard links? It it correct that the file was undercounted? [Minor typo fixed, 12pm]

    If you're copying a file and you discover that it has multiple hard links, what do you do? Do you break the links in the copy? Do you attempt to reconstruct them? What if the destination doesn't support hard links?

    Compressed files
    By this I'm talking about filesystem compression rather than external compression algorithms like ZIP.

    When adding up the size of the files in a directory, do you add up the logical size or the physical size? If you're computing the size in preparation for copying, then you probably want the logical size, but if you're computing to see how much disk space would be freed up by deleting it, then you probably want physical size.

    But if you're computing for copying and the copy destination supports compression, do you want to use the physical size after all? Now you're assuming that the source and destination compression algorithms are comparable.

    Sparse files
    Sparse files have the same problems as compressed files. Do you want to add up the logical or physical size?

    Cluster rounding
    Even for uncompressed non-sparse files, you may want to take into account the size of the disk blocks. A directory with a lot of small files requires up more space on disk than just the sum of the file sizes. Do you want to reflect this in your computations? If you traversed across a reparse point, the cluster size may have changed as well.

    Alternate data streams
    Alternate data streams are another place where a file can occupy disk space that is not reflected in its putative "size".

    Bookkeeping overhead
    There is always bookkeeping overhead associated with file storage. In addition to the directory entry (or entries), space also needs to be allocated for the security information, as well as the information that keeps track of where the file's contents can be found. For a highly-fragmented file, this information can be rather extensive. Do you want to count that towards the size of the directory? If so, how?

    There is no single answer to all of the above questions. You have to consider each one, apply it to your situation, and decide which way you want to go.

    (And copying a directory tree is even scarier. What do you do with the ACLs? Do you copy them too? Do you preserve the creation date? It all depends on why you're copying the tree.)

  • The Old New Thing

    Using fibers to simplify enumerators, part 3: Having it both ways

    • 28 Comments
    As we discovered in the previous two entries [second], the problem with enumeration is that somebody always loses.

    Now we will use fibers to fight back. Before you decide to use fibers in your programs, make sure to read the dire warnings at the end of this article. My goal here is to show one use of fibers, not to say that fibers are the answer to all your problems. Fibers can create more problems than they solve. We'll come back to all the dire warnings later.

    As with most clever ideas, it has a simple kernel: Use a fiber to run both the caller and the enumerator each on their own stack.

    #include <windows.h>
    #include <shlwapi.h>
    #include <stdio.h>
    #include <strsafe.h>
    
    enum FEFOUND {
     FEF_FILE,          // found a file
     FEF_DIR,           // found a directory
     FEF_LEAVEDIR,      // leaving a directory
     FEF_DONE,          // finished
    };
    
    enum FERESULT {
     FER_CONTINUE,      // continue enumerating
                        // (if directory: recurse into it)
     FER_SKIP,          // skip directory (do not recurse)
    };
    
    class __declspec(novtable) FiberEnumerator {
    public:
     FiberEnumerator();
     ~FiberEnumerator();
    
     FEFOUND Next();
     void SetResult(FERESULT fer) { m_fer = fer; }
     void Skip() { SetResult(FER_SKIP); }
    
     virtual LPCTSTR GetCurDir() = 0;
     virtual LPCTSTR GetCurPath() = 0;
     virtual const WIN32_FIND_DATA* GetCurFindData() = 0;
    
    protected:
     virtual void FiberProc() = 0;
    
     static void DECLSPEC_NORETURN WINAPI
        s_FiberProc(void* pvContext);
    
     FERESULT Produce(FEFOUND fef);
    
    protected:
     void* m_hfibCaller;
     void* m_hfibEnumerator;
     FEFOUND  m_fef;
     FERESULT m_fer;
    };
    
    FiberEnumerator::FiberEnumerator()
     : m_fer(FER_CONTINUE)
    {
     m_hfibEnumerator = CreateFiber(0, s_FiberProc, this);
    }
    
    FiberEnumerator::~FiberEnumerator()
    {
     DeleteFiber(m_hfibEnumerator);
    }
    
    void DECLSPEC_NORETURN FiberEnumerator::
        s_FiberProc(void *pvContext)
    {
     FiberEnumerator* self =
        reinterpret_cast<FiberEnumerator*>(pvContext);
     self->FiberProc();
    
     // Theoretically, we need only produce Done once,
     // but keep looping in case a consumer gets
     // confused and asks for the Next() item even
     // though we're Done.
     for (;;) self->Produce(FEF_DONE);
    }
    

    This helper class does the basic bookkeeping of fiber-based enumeration. At construction, it remembers the fiber that is consuming the enumeration, as well as creating a fiber that will produce the enumeration. At destruction, it cleans up the fiber. The derived class is expected to implement the FiberProc method and call Produce() every so often.

    The real magic happens in the (somewhat anticlimactic) Produce() and Next() methods:

    FERESULT FiberEnumerator::Produce(FEFOUND fef)
    {
     m_fef = fef; // for Next() to retrieve
     m_fer = FER_CONTINUE; // default
     SwitchToFiber(m_hfibCaller);
     return m_fer;
    }
    
    FEFOUND FiberEnumerator::Next()
    {
     m_hfibCaller = GetCurrentFiber();
     SwitchToFiber(m_hfibEnumerator);
     return m_fef;
    }
    

    To Produce() something, we remember the production code, pre-set the enumeration result to its default of FER_CONTINUE, and switch to the consumer fiber. When the consumer fiber comes back with an answer, we return it from Produce().

    To get the next item, we remember the identity of the calling fiber, then switch to the enumerator fiber. This runs the enumerator until it decides to Produce() something, at which point we take the production code and return it.

    That's all there is to it. The m_fef and m_fer members are for passing the parameters and results back and forth across the fiber boundary.

    Okay, with that groundwork out of the way, writing the producer itself is rather anticlimactic.

    Since we want to make things easy for the consumer, we use the interface the consumer would have designed, with some assistance from the helper class.

    class DirectoryTreeEnumerator : public FiberEnumerator {
    public:
     DirectoryTreeEnumerator(LPCTSTR pszDir);
     ~DirectoryTreeEnumerator();
    
     LPCTSTR GetCurDir() { return m_pseCur->m_szDir; }
     LPCTSTR GetCurPath() { return m_szPath; }
     const WIN32_FIND_DATA* GetCurFindData()
        { return &m_pseCur->m_wfd; }
    
    private:
     void FiberProc();
     void Enum();
    
     struct StackEntry {
       StackEntry* m_pseNext;
       HANDLE m_hfind;
       WIN32_FIND_DATA m_wfd;
       TCHAR m_szDir[MAX_PATH];
     };
     bool Push(StackEntry* pse);
     void Pop();
    
    private:
     StackEntry *m_pseCur;
     TCHAR m_szPath[MAX_PATH];
    };
    
    DirectoryTreeEnumerator::
     DirectoryTreeEnumerator(LPCTSTR pszDir)
     : m_pseCur(NULL)
    {
     StringCchCopy(m_szPath, MAX_PATH, pszDir);
    }
    
    DirectoryTreeEnumerator::~DirectoryTreeEnumerator()
    {
     while (m_pseCur) {
       Pop();
     }
    }
    
    bool DirectoryTreeEnumerator::
          Push(StackEntry* pse)
    {
     pse->m_pseNext = m_pseCur;
     m_pseCur = pse;
     return
      SUCCEEDED(StringCchCopy(pse->m_szDir,
                     MAX_PATH, m_szPath)) &&
      PathCombine(m_szPath, pse->m_szDir, TEXT("*.*")) &&
      (pse->m_hfind = FindFirstFile(m_szPath,
           &pse->m_wfd)) != INVALID_HANDLE_VALUE;
    }
    
    void DirectoryTreeEnumerator::Pop()
    {
     StackEntry* pse = m_pseCur;
     if (pse->m_hfind != INVALID_HANDLE_VALUE) {
      FindClose(pse->m_hfind);
     }
     m_pseCur = pse->m_pseNext;
    }
    
    void DirectoryTreeEnumerator::FiberProc()
    {
     Enum();
    }
    
    void DirectoryTreeEnumerator::Enum()
    {
     StackEntry se;
     if (Push(&se)) {
      do {
       if (lstrcmp(se.m_wfd.cFileName, TEXT(".")) != 0 &&
           lstrcmp(se.m_wfd.cFileName, TEXT("..")) != 0 &&
           PathCombine(m_szPath, se.m_szDir, se.m_wfd.cFileName)) {
        FEFOUND fef = (se.m_wfd.dwFileAttributes &
                        FILE_ATTRIBUTE_DIRECTORY) ?
                        FEF_DIR : FEF_FILE;
        if (Produce(fef) == FER_CONTINUE && fef == FEF_DIR) {
         Enum(); // recurse into the subdirectory we just produced
        }
       }
      } while (FindNextFile(se.m_hfind, &se.m_wfd));
     }
     Produce(FEF_LEAVEDIR);
     Pop();
    }
    

    As you can see, this class is a mix of the two previous classes. Like the consumer-based class, information about the item being enumerated is obtained by calling methods on the enumerator object. But like the callback-based version, the loop that generates the objects themselves is a very simple recursive function, with a call to Produce in place of a callback.

    In fact, it's even simpler than the callback-based version, since we don't have to worry about the FER_STOP code. If the consumer wants to stop enumeration, the consumer simply stops calling Next().

    Most of the complexity in the class is just bookkeeping to permit abandoning the enumeration prematurely.

    Okay, let's take this fiber out for a spin. You can use the same TestWalk function as last time, but for added generality, change the first parameter from DirectoryTreeEnumerator* to FiberEnumerator*. (The significance of this will become apparent next time.)

    A little tweak needs to be made to the main function, though.

    int __cdecl main(int argc, char **argv)
    {
     ConvertThreadToFiber(NULL);
     DirectoryTreeEnumerator e(TEXT("."));
     TestWalk(&e);
     ConvertFiberToThread();
     return 0;
    }
    

    Since the enumerator is going to switch between fibers, we'd better convert the thread to a fiber so it'll have something to switch back to!

    Here's a schematic of what happens when you run this fiber-based enumerator:

    ConvertThreadToFiber
    Main fiber
    construct DirectoryTreeEnumerator Enumerator fiber
    CreateFiber (not running)
    initialize variables  
    Next(CONTINUE)  
    SwitchToFiber() starts running
      FindFirstFile etc
      Produce(FILE)
    Next(CONTINUE) "returns" FILE SwitchToFiber()
    use the result  
    Next(CONTINUE)  
    SwitchToFiber() Produce(FILE) "returns" CONTINUE
      FindNextFile etc
      Produce(DIR)
    Next(CONTINUE) "returns" DIR SwitchToFiber()
    use the result  
    Next(CONTINUE)  
    SwitchToFiber() Produce(DIR) "returns" CONTINUE
    and so on... until...
      Produce(DONE)
    Next(CONTINUE) "returns" DONE SwitchToFiber()
    cleanup  
    DeleteFiber  
    destruct DirectoryTreeEnumerator
    ConvertFiberToThread

    Observe that from each fiber's point of view, the other fiber is just a subroutine!

    Coding subtlety: Why do we capture the caller's fiber each time the Next() method is called? Why not capture it when the FiberEnumerator is constructed?

    Next time, we'll see how this fiber-based enumerator easily admits higher-order operations such as filtering and composition.

    Dire warnings about fibers

    Fibers are like dynamite. Mishandle them and your process explodes.

    The first dire warning is that fibers are expensive in terms of address space, since each one gets its own stack (typically a megabyte).

    And since each fiber has its own stack, it also has its own exception chain. This means that if a fiber throws an exception, only that fiber can catch it. (Same as threads.) That's a strong argument against using an STL std::stack object to maintain our state: STL is based on an exception-throwing model, but you can't catch exceptions raised by another fiber. (You also can't throw exceptions past a COM boundary, which severely limits how much you can use STL in a COM object.)

    One of the big problems with fibers is that everybody has to be in cahoots. You need to decide on one person who will call the ConvertThreadToFiber function since fiber/thread conversion is not reference-counted. If two people call ConvertThreadToFiber on the same thread, the first will convert it, and so will the second! This results in two fibers for the same thread, and things can only get worse from there.

    You might think, "Well, wouldn't the GetCurrentFiber function return NULL if the thread hasn't been converted to a fiber?" Try it: It returns garbage. (It's amazing how many people ask questions without taking even the slightest steps towards figuring out the answer themselves. Try writing a test program.)

    But even if GetCurrentFiber told you whether or not the thread had been converted to a fiber, that still won't help. Suppose two people want to do fibrous activity on the thread. The first converts, the second notices that the thread is already a fiber (somehow) and skips the conversion. Now the first operation completes and calls the ConvertFiberToThread function. Oh great, now the second operation is stranded doing fibrous activity without a fiber!

    Therefore, you can use fibers safely only if you control the thread and can get all your code to agree on who controls the fiber/thread conversion.

    An important consequence of the "in cahoots" rule is that you have to make sure all the code you use on a fiber is "fiber-safe" - a level of safety even beyond thread-safety. The C runtime library keeps information in per-thread state: There's errno, all sorts of bonus bookkeeping when you create a thread, or call various functions that maintain state in per-thread data (such as strerror, _fcvt, and strtok).

    In particular, C++ exception handling is managed by the runtime, and the runtime tracks this data in per-thread state (rather than per-fiber state). Therefore, if you throw a C++ exception from a fiber, strange things happen.

    (Note: Things may have changed in the C runtime lately; I'm operating from information that's a few years old.)

    Even if you carefully avoid the C runtime library, you still have to worry about any other libraries you use that use per-thread data. None of them will work with fibers. If you see a call to the TlsAlloc function, then there's a good chance that the library is not fiber-safe. (The fiber-safe version is the FlsAlloc function.)

    Another category of things that are not fiber-safe are windows. Windows have thread affinity, not fiber affinity.

  • The Old New Thing

    Optimization is often counter-intuitive

    • 25 Comments

    Anybody who's done intensive optimization knows that optimization is often counter-intuitive. Things you think would be faster often aren't.

    Consider, for example, the exercise of obtaining the current instruction pointer. There's the naïve solution:

    __declspec(noinline)
    void *GetCurrentAddress()
    {
      return _ReturnAddress();
    }
    
    ...
    void *currentInstruction = GetCurrentAddress();
    

    If you look at the disassembly, you'll get something like this:

    GetCurrentAddress:
        mov eax, [esp]
        ret
    
    ...
        call GetCurrentAddress
        mov [currentInstruction], eax
    

    "Pah," you say to yourself, "look at how inefficient that is. I can reduce that to two instructions. Watch:

    void *currentInstruction;
    __asm {
    call L1
    L1: pop currentInstruction
    }
    

    That's half the instruction count of your bloated girly-code."

    But if you sit down and race the two code sequences, you'll find that the function-call version is faster by a factor of two! How can that be?

    The reason is the "hidden variables" inside the processor. All modern processors contain much more state than you can see from the instruction sequence. There are TLBs, L1 and L2 caches, all sorts of stuff that you can't see. The hidden variable that is important here is the return address predictor.

    The more recent Pentium (and I believe also Athlon) processors maintain an internal stack that is updated by each CALL and RET instruction. When a CALL is executed, the return address is pushed both onto the real stack (the one that the ESP register points to) as well as to the internal return address predictor stack; a RET instruction pops the top address of the return address predictor stack as well as the real stack.

    The return address predictor stack is used when the processor decodes a RET instruction. It looks at the top of the return address predictor stack and says, "I bet that RET instruction is going to return to that address." It then speculatively executes the instructions at that address. Since programs rarely fiddle with return addresses on the stack, these predictions tend to be highly accurate.

    That's why the "optimization" turns out to be slower. Let's say that at the point of the CALL L1 instruction, the return address predictor stack looks like this:

    Return address
    predictor stack:
      caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller1 -> caller2 -> caller3 -> ...

    Here, caller1 is the function's caller, caller1 is the function's caller's caller, and so on. So far, the return address predictor stack is right on target. (I've drawn the actual stack below the return address predictor stack so you can see that they match.)

    Now you execute the CALL instruction. The return address predictor stack and the actual stack now look like this:

    Return address
    predictor stack:
      L1 -> caller1 -> caller2 -> caller3 -> ...
    Actual stack:   L1 -> caller1 -> caller2 -> caller3 -> ...

    But instead of executing a RET instruction, you pop off the return address. This removes it from the actual stack, but doesn't remove it from the return address predictor stack.

    Return address
    predictor stack:
      L1 -> caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller1 -> caller2 -> caller3 -> caller4 -> ...

    I think you can see where this is going.

    Eventually your function returns. The processor decodes your RET instruction and looks at the return address predictor stack and says, "My predictor stack says that this RET is going to return to L1. I will begin speculatively executing there."

    But oh no, the value on the top of the real stack isn't L1 at all. It's caller1. The processor's return address predictor predicted incorrectly, and it ended up wasting its time studying the wrong code!

    The effects of this bad guess don't end there. After the RET instruction, the return address predictor stack looks like this:

    Return address
    predictor stack:
      caller1 -> caller2 -> caller3 -> ...
    Actual stack:   caller2 -> caller3 -> caller4 -> ...

    Eventually your caller returns. Again, the processor consults its return address predictor stack and speculatively executes at caller1. But that's not where you're returning to. You're really returning to caller2.

    And so on. By mismatching the CALL and RET instructions, you managed to cause every single return address prediction on the stack to be wrong. Notice in the diagram that, in the absence of somebody playing games with the return address predictor stack of the type that created the problem initially, not a single prediction on the return address predictor stack will be correct. None of the predicted return addresses match up with actual return addresses.

    Your peephole optimization has proven to be shortsighted.

    Some processors expose this predictor more explictly. The Alpha AXP, for example, has several types of control flow instructions, all of which have the same logical effect, but which hint to the processor how it should maintain its internal predictor stack. For example, the BR instruction says, "Jump to this address, but do not push the old address onto the predictor stack." On the other hand, the JSR instruction says, "Jump to this address, and push the old address onto the predictor stack." There is also a RET instruction that says, "Jump to this address, and pop an address from the predictor stack." (There's also a fourth type that isn't used much.)

    Moral of the story: Just because something looks better doesn't mean that it necessarily is better.

  • The Old New Thing

    Dragging a shell object, part 1: Getting the IDataObject

    • 20 Comments

    The shell gives you the IDataObject; all you have to do is drag it around. (This is the first of a five-part series.)

    Start with the scratch program, and add the function GetUIObjectOfFile from an earlier article. Also, change the calls to CoInitialize and CoUninitialize to OleInitialize and OleUninitialize, respectively, since we're now going to be using full-on OLE and not just COM.

    In order to initiate a drag/drop operation, we need a drop source:

    class CDropSource : public IDropSource
    {
    public:
      // *** IUnknown ***
      STDMETHODIMP QueryInterface(REFIID riid, void **ppv);
      STDMETHODIMP_(ULONG) AddRef();
      STDMETHODIMP_(ULONG) Release();
    
      // *** IDropSource ***
      STDMETHODIMP QueryContinueDrag(BOOL fEscapePressed, DWORD grfKeyState);
      STDMETHODIMP GiveFeedback(DWORD dwEffect);
    
      CDropSource() : m_cRef(1) { }
    private:
      ULONG m_cRef;
    };
    
    HRESULT CDropSource::QueryInterface(REFIID riid, void **ppv)
    {
      IUnknown *punk = NULL;
      if (riid == IID_IUnknown) {
        punk = static_cast<IUnknown*>(this);
      } else if (riid == IID_IDropSource) {
        punk = static_cast<IDropSource*>(this);
      }
    
      *ppv = punk;
      if (punk) {
        punk->AddRef();
        return S_OK;
      } else {
        return E_NOINTERFACE;
      }
    }
    
    ULONG CDropSource::AddRef()
    {
      return ++m_cRef;
    }
    
    ULONG CDropSource::Release()
    {
      ULONG cRef = --m_cRef;
      if (cRef == 0) delete this;
      return cRef;
    }
    
    HRESULT CDropSource::QueryContinueDrag(
              BOOL fEscapePressed, DWORD grfKeyState)
    {
      if (fEscapePressed) return DRAGDROP_S_CANCEL;
    
      // [Update: missing paren repaired, 7 Dec]
      if (!(grfKeyState & (MK_LBUTTON | MK_RBUTTON)))
        return DRAGDROP_S_DROP;
    
      return S_OK;
    }
    
    HRESULT CDropSource::GiveFeedback(DWORD dwEffect)
    {
      return DRAGDROP_S_USEDEFAULTCURSORS;
    }
    

    As you can see, this drop source is extraordinarily boring. Even the interesting methods are uninteresting.

    The IDropSource::QueryContinueDrag method is pretty much boilerplate. If the Escape key was pressed, then cancel the drag/drop operation. If the mouse buttons are released, then complete the operation. Otherwise, continue the operation.

    The IDropSource::GiveFeedback method is even less interesting. It merely returns DRAGDROP_S_USEDEFAULTCURSORS to indicate that it wants default drag feedback.

    Believe it or not, we now have everything we need to drag a file.

    void OnLButtonDown(HWND hwnd, BOOL fDoubleClick,
                       int x, int y, UINT keyFlags)
    {
      IDataObject *pdto;
      // In a real program of course
      // you wouldn't use a hard-coded path.
      // [comment added 11am because apparently some
      // people thought this wasn't self-evident.]
      if (SUCCEEDED(GetUIObjectOfFile(hwnd,
                        L"C:\\Windows\\clock.avi",
    		    IID_IDataObject, (void**)&pdto))) {
        IDropSource *pds = new CDropSource();
        if (pds) {
          DWORD dwEffect;
          DoDragDrop(pdto, pds, DROPEFFECT_COPY | DROPEFFECT_LINK,
                     &dwEffect);
          pds->Release();
        }
        pdto->Release();
      }
    }
    
        HANDLE_MSG(hwnd, WM_LBUTTONDOWN, OnLButtonDown);
    

    To drag an object, you need two things, a data object and a drop source. We created our drop source above, and the data object comes from the shell. All that's left to do is start the drag/drop operation by calling the DoDragDrop function.

    Notice that we specify that the permitted operations are DROPEFFECT_COPY and DROPEFFECT_LINK. We specifically disallow DROPEFFECT_MOVE because this program doesn't present a folder-like window; the user has no expectation that the drag/drop will result in a Move operation.

    Next time, adding Move support, just to see how it works.

  • The Old New Thing

    Don't save anything you can recalculate

    • 15 Comments

    Nowadays, a major barrier to performance for many classes of programs is paging. We saw earlier this year that paging can kill a server. Today, another example of how performance became tied to paging.

    The principle is "Don't save anything you can recalculate." This of course, seems counterintuitive: Shouldn't you save the answer so you don't have to recalculate it?

    The answer is, "It depends."

    If recalculating the answer isn't very expensive and has good data locality, then you may be better off recalculating it than saving it, especially if saving it reduces locality. For example, if the result is stored in a separate object, you now have to touch a second object—risking a page fault—to get the saved answer.

    Last time, we saw how Windows 95 applied this principle so that rebasing a DLL didn't thrash your machine. I'm told that the Access team used this principle to reap significant performance gains. Instead of caching results, they just threw them away and recalculated them the next time they were needed.

    Whether this technique works for you is hard to predict. If your program is processor-bound, then caching computations is probably a good idea. But if your program is memory-bound, then you may be better off getting rid of the cache, since the cache is just creating more memory pressure.

  • The Old New Thing

    The hunt for a faster syscall trap

    • 14 Comments

    The performance of the syscall trap gets a lot of attention.

    I was reminded of a meeting that took place between Intel and Microsoft over fifteen years ago. (Sadly, I was not myself at this meeting, so the story is second-hand.)

    Since Microsoft is one of Intel's biggest customers, their representatives often visit Microsoft to show off what their latest processor can do, lobby the kernel development team to support a new processor feature, and solicit feedback on what sort of features would be most useful to add.

    At this meeting, the Intel representatives asked, "So if you could ask for only one thing to be made faster, what would it be?"

    Without hesitation, one of the lead kernel developers replied, "Speed up faulting on an invalid instruction."

    The Intel half of the room burst out laughing. "Oh, you Microsoft engineers are so funny!" And so the meeting ended with a cute little joke.

    After returning to their labs, the Intel engineers ran profiles against the Windows kernel and lo and behold, they discovered that Windows spent a lot of its time dispatching invalid instruction exceptions. How absurd! Was the Microsoft engineer not kidding around after all?

    No he wasn't.

    It so happens that on the 80386 chip of that era, the fastest way to get from V86-mode into kernel mode was to execute an invalid instruction! Consequently, Windows/386 used an invalid instruction as its syscall trap.

    What's the moral of this story? I'm not sure. Perhaps it's that when you create something, you may find people using it in ways you had never considered.

Page 1 of 4 (34 items) 1234