• The Old New Thing

    Computing the size of a directory is more than just adding file sizes

    • 30 Comments

    One might think that computing the size of a directory would be a simple matter of adding up the sizes of all the files in it.

    Oh if it were only that simple.

    There are many things that make computing the size of a directory difficult, some of which even throw into doubt the even existence of the concept "size of a directory".

    Reparse points
    We mentioned this last time. Do you want to recurse into reparse points when you are computing the size of a directory? It depends why you're computing the directory size.

    If you're computing the size in order to show the user how much disk space they will gain by deleting the directory, then you do or don't, depending on how you're going to delete the reparse point.

    If you're computing the size in preparation for copying, then you probably do. Or maybe you don't - should the copy merely copy the reparse point instead of tunneling through it? What do you if the user doesn't have permission to create reparse points? Or if the destination doesn't support reparse points? Or if the user is creating a copy because they are making a back-up?

    Hard links
    Hard links are multiple directory entries for the same file. If you're calculating the size of a directory and you find a hard link, do you count the file at its full size? Or do you say that each directory entry for a hard link carries a fraction of the "weight" of the file? (So if a file has two hard links, then each entry counts for half the file size.)

    Dividing the "weight" of the file among its hard links avoids double-counting (or higher), so that when all the hard links are found, the file's total size is correctly accounted for. And it represents the concept that all the hard links to a file "share the cost" of the resources the file consumes. But what if you don't find all the hard links? It it correct that the file was undercounted? [Minor typo fixed, 12pm]

    If you're copying a file and you discover that it has multiple hard links, what do you do? Do you break the links in the copy? Do you attempt to reconstruct them? What if the destination doesn't support hard links?

    Compressed files
    By this I'm talking about filesystem compression rather than external compression algorithms like ZIP.

    When adding up the size of the files in a directory, do you add up the logical size or the physical size? If you're computing the size in preparation for copying, then you probably want the logical size, but if you're computing to see how much disk space would be freed up by deleting it, then you probably want physical size.

    But if you're computing for copying and the copy destination supports compression, do you want to use the physical size after all? Now you're assuming that the source and destination compression algorithms are comparable.

    Sparse files
    Sparse files have the same problems as compressed files. Do you want to add up the logical or physical size?

    Cluster rounding
    Even for uncompressed non-sparse files, you may want to take into account the size of the disk blocks. A directory with a lot of small files requires up more space on disk than just the sum of the file sizes. Do you want to reflect this in your computations? If you traversed across a reparse point, the cluster size may have changed as well.

    Alternate data streams
    Alternate data streams are another place where a file can occupy disk space that is not reflected in its putative "size".

    Bookkeeping overhead
    There is always bookkeeping overhead associated with file storage. In addition to the directory entry (or entries), space also needs to be allocated for the security information, as well as the information that keeps track of where the file's contents can be found. For a highly-fragmented file, this information can be rather extensive. Do you want to count that towards the size of the directory? If so, how?

    There is no single answer to all of the above questions. You have to consider each one, apply it to your situation, and decide which way you want to go.

    (And copying a directory tree is even scarier. What do you do with the ACLs? Do you copy them too? Do you preserve the creation date? It all depends on why you're copying the tree.)

  • The Old New Thing

    You can create an infinitely recursive directory tree

    • 40 Comments

    It is possible to create an infinitely recursive directory tree. This throws many recursive directory-traversal functions into disarray. Here's how you do it. (Note: Requires NTFS.)

    Create a directory in the root of your C: drive, call it C:\C, for lack of a more creative name. Right-click My Computer and select Manage. click on the Disk Management snap-in.

    From the Disk Management snap-in, right-click the C drive and select "Change Drive Letter and Paths...".

    From the "Change Drive Letter and Paths for C:" dialog, click "Add", then where it says "Mount in the following empty NTFS folder", enter "C:\C". Click OK.

    Congratulations, you just created an infinitely recursive directory.

    C:\> dir
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    
    C:\> dir C:\C
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\C
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    
    
    C:\> dir C:\C\C\C\C\C\C
    
     Volume in drive has no label
     Volume Serial Number is A035-E01D
    
     Directory of C:\C\C\C\C\C\C
    
    08/19/2001  08:43 PM                 0 AUTOEXEC.BAT
    12/23/2004  09:43 PM    <JUNCTION>     C
    05/05/2001  04:09 PM                 0 CONFIG.SYS
    12/16/2001  04:34 PM    <DIR>          Documents and Settings
    08/10/2004  12:00 AM    <DIR>          Program Files
    08/28/2004  01:08 PM    <DIR>          WINDOWS
                   2 File(s)              0 bytes
                   4 Dir(s)   2,602,899,968 bytes free
    

    Go ahead and add as many "\C"s as you like. You'll just get your own C drive back again.

    Okay, now that you've had your fun, go back to the "Change Drive Letter and Paths for C:" dialog and Remove the "C:\C" entry. Do this before you create some real havoc.

    Now imagine what happens if you had tried a recursive treecopy from that mysterious C:\C directory. Or if you ran a program that did some sort of recursive operation starting from C:\C, like, say, trying to add up the sizes of all the files in it.

    If you're writing such a program, you need to be aware of reparse points (that thing that shows up as <JUNCTION> in the directory listing). You can identify them because their file attributes include the FILE_ATTRIBUTE_REPARSE_POINT flag. Of course, what you do when you find one of these is up to you. I'm just warning you that these strange things exist and if you aren't careful, your program can go into an infinite loop.

  • The Old New Thing

    Alton Brown book tour 2005: I'm Just Here for More Food

    • 11 Comments

    Alton Brown, geek cooking hero and Bon Appetit Magazine Cooking Teacher of the Year 2004 will be spending January 2005 promoting his latest book, Food × Mixing + Heat = Baking (I'm Just Here for More Food), sequel to his award-winning debut cookbook Food + Heat = Cooking (I'm Just Here for the Food). Check the schedule to see when/whether he'll be in your area.

  • The Old New Thing

    Why does the system convert TEMP to a short file name?

    • 22 Comments

    When you set environment variables with the System control panel, the TEMP and TMP variables are silently converted to their short file name equivalents (if possible). Why is that?

    For compatibility, of course.

    It is very common for batch files to assume that the paths referred to by the %TEMP% and %TMP% environment variables do not contain any embedded spaces. (Other programs may also make this assumption, but batch files are the most common place where you run into this problem.)

    I say "if possible" because you can disable short name generation, in which case there is no short name equivalent, and the path remains in its original long form.

    If you are crazy enough to set this value and point your TEMP/TMP variables at a directory whose name contains spaces and doesn't have a short name, then you get to see what sorts of things stop working properly. Don't say I didn't warn you.

  • The Old New Thing

    How to open those plastic packages of electronics without injuring yourself

    • 57 Comments

    Small electronics nowadays come in those impossible-to-open plastic packages. A few weeks ago I tried to open one and managed not to slice my hand with the knife I was using. (Those who know me know that knives and I don't get along well.) Unfortunately, I failed to pay close attention to the sharp edges of the cut plastic and ended up cutting three of my fingers.

    The next day, I called the manufacturer's product support number and politely asked, "How do I open the package?"

    The support person recommended using a pair of very heavy scissors. (I tried scissors, but mine weren't heavy enough and couldn't cut through the thick plastic.) Cut across the top, then down the sides, being careful to avoid the sharp shards you're creating. (You might want to wear gloves.)

    If you bought someone a small electronics thingie, consider keeping a pair of heavy scissors on hand. That's my tip for the season.

  • The Old New Thing

    Do you need clean up one-shot timers?

    • 12 Comments

    The CreateTimerQueueTimer function allows you to create one-shot timers by passing the WT_EXECUTEONLYONCE flag. The documentation says that you need to call the DeleteTimerQueueTimer function when you no longer need the timer.

    Why do you need to clean up one-shot timers?

    To answer this, I would like to introduce you to one of my favorite rhetorical questions when trying to puzzle out API design: "What would the world be like if this were true?"

    Imagine what the world would be like if you didn't need to clean up one-shot timers.

    Well, for one thing, it means that the behavior of the function would be confusing. The caller of the the CreateTimerQueueTimer function would have to keep track of whether the timer was one-shot or not, to know whether or not the handle needed to be deleted.

    But far, far worse is that if one-shot timers were self-deleting, it would be impossible to use them correctly.

    Suppose you have an object that creates a one-shot timer, and you want to clean it up in your destructor if it hasn't fired yet. If one-shot timers were self-deleting, then it would be impossible to write this object.

    class Sample {
     HANDLE m_hTimer;
     Sample() : m_hTimer(NULL) { CreateTimerQueueTimer(&m_hTimer, ...); }
     ~Sample() { ... what to write here? ... }
    };
    

    You might say, "Well, I'll have my callback null out the m_hTimer variable. That way, the destructor will know that the timer has fired."

    Except that's a race condition.

    Sample::Callback(void *context)
    {
      /// RACE WINDOW HERE
      ((Sample*)context)->m_hTimer = NULL;
      ...
    }
    

    If the callback is pre-empted during the race window and the object is destructed, and one-shot timers were self-deleting, then the object would attempt to use an invalid handle.

    This race window is uncloseable since the race happens even before you get a chance to execute a single line of code.

    So be glad that you have to delete handles to one-shot timers.

  • The Old New Thing

    BOOL vs. VARIANT_BOOL vs. BOOLEAN vs. bool

    • 29 Comments

    Still more ways of saying the same thing. Why so many?

    Because each was invented by different people at different times to solve different problems.

    BOOL is the oldest one. Its definition is simply

    typedef int BOOL;
    

    The C programming language uses "int" as its boolean type, and Windows 1.0 was written back when C was the cool language for systems programming.

    Next came BOOLEAN.

    typedef BYTE  BOOLEAN;
    

    This type was introduced by the OS/2 NT team when they decided to write a new operating system from scratch. It lingers in Win32 in the places where the original NT design peeks through, like the security subsystem and interacting with drivers.

    Off to the side came VARIANT_BOOL.

    typedef short VARIANT_BOOL;
    #define VARIANT_TRUE ((VARIANT_BOOL)-1)
    #define VARIANT_FALSE ((VARIANT_BOOL)0)
    

    This was developed by the Visual Basic folks. Basic uses -1 to represent "true" and 0 to represent "false", and VARIANT_BOOL was designed to preserve this behavior.

    Common bug: When manipulating VARIANTs of type VT_BOOL, and you want to set a boolean value to "true", you must use VARIANT_TRUE. Many people mistakenly use TRUE or true, which are not the same thing as VARIANT_TRUE. You can cause problem with scripting languages if you get them confused. (For symmetry, you should also use VARIANT_FALSE instead of FALSE or false. All three have the same numerical value, however. Consequently, a mistake when manipulating "false" values is not fatal.)

    Newest on the scene is bool, which is a C++ data type that has the value true or false. You won't see this used much (if at all) in Win32 because Win32 tries to remain C-compatible.

    (Note that C-compatible isn't the same as C-friendly. Although you can do COM from C, it isn't fun.)

  • The Old New Thing

    Sometimes people don't like it when you enforce a standard

    • 50 Comments

    Your average computer user wouldn't recognize a standards document if they were hit in the face with it.

    I'm reminded of a beta bug report back in 1996 regarding how Outlook Express (then called "Microsoft Internet Mail and News") handled percent signs in email addresses (I think). The way Outlook Express did it was standards-conformant, and I sent the relevant portion of the RFC to the person who reported the bug. Here's what I got back:

    I have never read the RFC's (most people, I'm sure, haven't) but I know when something WORKS in one mail reader (Netscape) and DOESN'T WORK in another (MSIMN).

    The problem, restated to comply with your RFC:

    MS Internet Mail and News DO NOT HANDLE PERCENT SIGNS like the RFC says.

    That first sentence pretty much captures the reaction most of the world has to standards documents: They are meaningless. If Outlook Express doesn't behave the same way as Netscape, then it's a bug in Outlook Express, regardless of what the standards documents say.

    There are many "strangenesses" in the way Internet Explorer handles certain aspects of HTML when you don't run it in strict mode. For example, did you notice that the font you set via CSS for your BODY tag doesn't apply to tables? Or that invoking the submit method on a form does not fire the onsubmit event? That's because Netscape didn't do it either, and Internet Explorer had to be bug-for-bug compatible with Netscape because web sites relied on this behavior.

    The last paragraph in the response is particularly amusing. The person is using the word "RFC" as a magic word, not knowing what it means. Apparently if you want to say that something doesn't work as you expect, you say that it doesn't conform to the RFC. Whether your expectation agrees with the RFC is irrelevant. (By his own admission, the person who filed the bug didn't even read the RFC.)

  • The Old New Thing

    Don't save anything you can recalculate

    • 15 Comments

    Nowadays, a major barrier to performance for many classes of programs is paging. We saw earlier this year that paging can kill a server. Today, another example of how performance became tied to paging.

    The principle is "Don't save anything you can recalculate." This of course, seems counterintuitive: Shouldn't you save the answer so you don't have to recalculate it?

    The answer is, "It depends."

    If recalculating the answer isn't very expensive and has good data locality, then you may be better off recalculating it than saving it, especially if saving it reduces locality. For example, if the result is stored in a separate object, you now have to touch a second object—risking a page fault—to get the saved answer.

    Last time, we saw how Windows 95 applied this principle so that rebasing a DLL didn't thrash your machine. I'm told that the Access team used this principle to reap significant performance gains. Instead of caching results, they just threw them away and recalculated them the next time they were needed.

    Whether this technique works for you is hard to predict. If your program is processor-bound, then caching computations is probably a good idea. But if your program is memory-bound, then you may be better off getting rid of the cache, since the cache is just creating more memory pressure.

  • The Old New Thing

    How did Windows 95 rebase DLLs?

    • 23 Comments

    Windows 95 handled DLL-rebasing very differently from Windows NT.

    When Windows NT detects that a DLL needs to be loaded at an address different from its preferred load address, it maps the entire DLL as copy-on-write, fixes it up (causing all pages that contain fixups to be dumped into the page file), then restores the read-only/read-write state to the pages. (Larry Osterman went into greater detail on this subject earlier this year.)

    Windows 95, on the other hand, rebases the DLL incrementally. This is another concession to Windows 95's very tight memory requirements. Remember, it had to run on a 4MB machine. If it fixed up DLLs the way Windows NT did, then loading a 4MB DLL and fixing it up would consume all the memory on the machine, pushing out all the memory that was actually worth keeping!

    When a DLL needed to be rebased, Windows 95 would merely make a note of the DLL's new base address, but wouldn't do much else. The real work happened when the pages of the DLL ultimately got swapped in. The raw page was swapped off the disk, then the fix-ups were applied on the fly to the raw page, thereby relocating it. The fixed-up page was then mapped into the process's address space and the program was allowed to continue.

    This method has the advantage that the cost of fixing up a page is not paid until the page is actually needed, which can be a significant savings for large DLLs of mostly-dead code. Furthermore, when a fixed-up page needed to be swapped out, it was merely discarded, because the fix-ups could just be applied to the raw page again.

    And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time. What could possibly go wrong?

    Hint: It's a problem that is peculiar to the x86.

    The problem is fix-up that straddle page boundaries. This happens only on the x86 because the x86 architecture is the weirdo, with variable-length instructions that can start at any address. If a page contains a fix-up that extends partially off the start of the page, you cannot apply it accurately until you know whether or not the part of the fix-up you can't see generated a carry. If it did, then you have to add one to your partial fix-up.

    To record this information, the memory manager associates a flag with each page of a relocated DLL that indicates whether the page contained a carry off the end. This flag can have one of three states:

    • Yes, there is a carry off the end.
    • No, there is no carry off the end.
    • I don't know whether there is a carry off the end.

    To fix up a page that contains a fix-up that extends partially off the start of the page, you check the flag for the previous page. If the flag says "Yes", then add one to your fix-up. If the flag says "No", then do not add one.

    But what if the flag says "I don't know?"

    If you don't know, then you have to go find out. Fault in the previous page and fix it up. As part of the computations for the fix-up, the flag will get to indicate whether there is a carry out the end. Once the previous page has been fixed up, you can check the flag (which will no longer be a "Don't know" flag), and that will tell you whether or not to add one to the current page.

    And there you have it, demand-paging rebased DLLs instead of fixing up the entire DLL at load time, even in the presence of fix-ups that straddle page boundaries. What could possibly go wrong?

    Hint: What goes wrong with recursion?

    The problem is that the previous page might itself have a fix-up that straddled a page boundary at its start, and the flag for the page two pages back might be in the "I don't know" state. Now you have to fault in and fix up a third page.

    Fortunately, in practice this doesn't go beyond three fix-ups. Three pages of chained fix-ups was the record.

    (Of course, another way to stop the recursion is to do only a partial fix-up of the previous page, applying only the straddling fix-up to see whether there is a carry out and not attempting to fix up the rest. But Windows 95 went ahead and fixed up the rest of the page because it figured, hey, I paid for this page, I may as well use it.)

    What was my point here? I don't think I have one. It was just a historical tidbit that I hoped somebody might find interesting.

Page 377 of 438 (4,378 items) «375376377378379»