Computing the size of a directory is more than just adding file sizes
One might think that computing the size of a directory would be
a simple matter of adding up the sizes of all the files in it.
Oh if it were only that simple.
There are many things that make computing the size of a directory
difficult, some of which even throw into doubt the even existence
of the concept "size of a directory".
- Reparse points
-
We mentioned this last time.
Do you want to recurse into reparse points
when you are computing the size of a directory?
It depends why you're computing the
directory size.
If you're computing the size in order to show the user how much disk space
they will gain by deleting the directory, then you do or don't,
depending on how you're going to delete the reparse point.
If you're computing the size in preparation for copying,
then you probably do. Or maybe you don't - should the copy merely
copy the reparse point instead of tunneling through it?
What do you if the user doesn't have permission to create
reparse points? Or if the destination doesn't support reparse points?
Or if the user is creating a copy because they are making a back-up?
- Hard links
-
Hard links are multiple directory entries for the same file.
If you're calculating the size of a directory
and you find a hard link, do you count the file at its full size?
Or do you say that each directory entry for a hard link carries
a fraction of the "weight" of the file? (So if a file has two hard
links, then each entry counts for half the file size.)
Dividing the "weight" of the file among its hard links avoids
double-counting (or higher), so that when all the hard links are
found, the file's total size is correctly accounted for.
And it represents the concept that all the hard links to a file
"share the cost" of the resources the file consumes.
But what if you don't find all the hard links?
It it correct that the file was undercounted?
[Minor typo fixed, 12pm]
If you're copying a file and you discover that it has multiple
hard links, what do you do? Do you break the links in the copy?
Do you attempt to reconstruct them? What if the destination doesn't
support hard links?
- Compressed files
-
By this I'm talking about filesystem compression rather than
external compression algorithms like ZIP.
When adding up the size of the files in a directory, do you
add up the logical size or the physical size? If you're
computing the size in preparation for copying, then you probably
want the logical size, but if you're computing to see how much
disk space would be freed up by deleting it, then you probably
want physical size.
But if you're computing for copying and the copy destination
supports compression, do you want to use the physical size
after all? Now you're assuming that the source and destination
compression algorithms are comparable.
- Sparse files
-
Sparse files have the same problems as compressed files.
Do you want to add up the logical or physical size?
- Cluster rounding
-
Even for uncompressed non-sparse files, you may want to take
into account the size of the disk blocks. A directory with
a lot of small files requires up more space on disk than just
the sum of the file sizes. Do you want to reflect this in
your computations?
If you traversed across a reparse point, the cluster size may
have changed as well.
- Alternate data streams
-
Alternate data streams are another place where a file can
occupy disk space that is not reflected in its putative "size".
- Bookkeeping overhead
-
There is always bookkeeping overhead associated with file storage.
In addition to the directory entry (or entries), space also
needs to be allocated for the security information, as well
as the information that keeps track of where the file's
contents can be found. For a highly-fragmented file, this
information can be rather extensive.
Do you want to count that towards the size of the directory?
If so, how?
There is no single answer to all of the above questions.
You have to consider each one, apply it to your situation,
and decide which way you want to go.
(And copying a directory tree is even scarier.
What do you do with the ACLs? Do you copy them too?
Do you preserve the creation date?
It all depends on why you're copying the tree.)