Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

Why does Windows still place so much importance on filenames?

Why does Windows still place so much importance on filenames?

Rate This
  • Comments 35

Earlier today, Adrian Kingsley-Hughes posted a rant (his word, not mine) about the fact that Windows still relies on text filenames.

The title says it all really. Why is it that Windows still place so much importance on filenames.

Take the following example - sorting out digital snaps. These are usually automatically given daft filenames such as IMG00032.JPG at the time they are stored by the camera. In an ideal world you’d only ever have one IMG00032.JPG on your entire system, but the world is far from perfect. Your camera might decide to restart its numbering system, or you might have two cameras using the same naming format. What happens then?

I guess I’m confused.  I could see a *very* strong argument against Windows dependency on file extensions, but I’m totally mystified about why having filenames is such a problem.

At some level, Adrian’s absolutely right – it IS possible to have multiple files on the hard disk named “recipe.txt”.  And that’s bad.  But is it the fault of Windows for allowing multiple files to have colliding names? Or is it the fault of the user for choosing poor names?  Maybe it’s a bit of both.

What would a better system look like?  Well Adrian gives an example of what he’s like to see:

Why? Why is the filename the deciding factor? Why not something more unique? Something like a checksum? This way the operating system could decide is two files really are identical or not, and replace the file if it’s a copy, or create a copy if they are different. This would save time, and dramatically reduce the likelihood of data loss through overwriting.

But how would that system work?  What if we did just that.  Then you wouldn’t have two files named recipe.txt (which is good).

Unfortunately that solution introduces a new problem: You still have two files.  One named “2B1015DB-30CA-409E-9B07-234A209622B6” and the other named “5F5431E8-FF7C-45D4-9A2B-B30A9D9A791B”. It’s certainly true that those two files are uniquely named and you can always tell them apart.  But you’ve also lost a critical piece of information: the fact that they both contain recipes.

That’s the information that the filename conveys.  It’s human specific data that describes the contents of the file.  If we were to go with unique monikers, we’d lose that critical information.

But I don’t actually think that the dependency on filenames is really what’s annoying him.  It’s just a symptom of a different problem. 

Adrian’s rant is a perfect example of jumping to a solution without first understanding the problem.  And why it’s so hard for Windows UI designers to figure out how to solve customer problems – this example is a customer complaint that we remove filenames from Windows.  Obviously something happened to annoy Adrian that was related to filenames, but the question is: What?  He doesn’t describe the problem, but we can hazard a guess about what happened from his text:

Here’s an example. I might have two files in separate folders called recipe.txt, but one is a recipe for a pumpkin pie, and the other for apple pie. OK, it was dumb of me to give the files the same name, but it’s in situations like this that the OS should be helping me, not hindering me and making me pay for my stupidity. After all, Windows knows, without asking me, that the files, even if they are the same size and created at exactly the same time, are different. Why does Windows need to ask me what to do? Sure, it doesn’t solve all problems, but it’s a far better solution than clinging to the notion of filenames as being the best metric by which to judge whether files are identical or not.

The key information here is the question: “Why does Windows need to ask me what to do?”  My guess is that he had two “recipe.txt” files in different directories and copied a recipe.txt from one directory to the other.  When you do that, Windows presents you with the following dialog:

Windows Copy Dialog

My suspicion is that he’s annoyed because Windows is forcing him to make a choice about what to do when there’s a conflict.  The problem is that there’s no one answer that works for all users and all scenarios.    Even in my day-to-day work I’ve had reason to chose all three options, depending on what’s going on.  From the rant, it appears that Adrian would like it to chose “Copy, but keep both files” by default.  But what happens if you really *do* want to replace the old recipe.txt with a new version?  Maybe you edited the file offline on your laptop and you’re bringing the new copy back to your desktop machine.  Or maybe you’re copying a bunch of files from one drive to another (I do this regularly when I sync my music collection from home and work).  In that case, you want to ignore the existing copy of the file (or maybe you want to copy the file over to ensure that the metadata is in sync).

Windows can’t figure out what the right answer is here – so it prompts the user for advice about what to do.

Btw, Adrian’s answer to his rhetorical question is “the reason is legacy”.  Actually that’s not quite it.  The reason is that it’s filenames provide valuable information for the user that would be lost if we went away from them.

Next time I want to spend a bit of time brainstorming about ways to solve his problem (assuming that the problem I identified is the real problem – it might not be). 



PS: I’m also not sure why he picked on Windows here.  Every operating system I know of has similar dependencies on filenames.  I think that’s an another indication that he’s jumping on a solution without first describing the problem.

  • @voo

    > And I don't see how it doesn't give me all the information I need to make the right decision.

    By this I meant that if I have two files called recipe.txt I might

    need to see the contents of both to make the right decision. Maybe

    they are entirely different recipes and I want both. Maybe they've

    both been edited since they were copied and the last modified date and

    size aren't enough to pick one over the other. And I know I have in

    the past started such an operation, chosen to replace or keep a few

    files and then gotten to a point where I realise there's no good

    answer for some file and I really want to roll-back the operation, but

    by then it's too late. (I don't *think* it's undo-able... or is it?)

  • Think of this scheme:

    1) a file is identified by a GUID based on its content, a file name merely makes it resident in the filesystem (like the UNIX inode system)

    2) an (optional) file name merely ties together the different GUIDs and keeps a reference to (some number of) old versions

    3) some use cases may not require file names at all, but instead rely completely on the metadata to identify the right file GUID.

    You would get version history for free, no more issues with "this file is in use" (the old version can be kept open for reading as long as an app wants), a moved or renamed file can always be found if the last GUID is known, etc.

    Note that this makes file moving and renames absolutely painless. No need to ask for permission with a modal dialog: all operations can be reverted if needed.

    Examples where file names are redundant may include the component DLLs of an application, which may not need to reside in the file system at all. Similarly, shared DLLs could be identified with their metadata only, etc.

    My guess is that present-day computers have more than enough resources to do the indexing required for these operations. If Microsoft is interested in implementing this then contact me :-)

  • This discussion makes me think of WinFS.

  • I suppose two niceties could be to detect when the file contents are identical, and maybe tie in the preview pane so one could examine both files.

  • The rant in question may have more to do with a pressing deadline and a need for copy than a real grievance. It sounds more like a polemic of convenience.

  • @Larry, I think the problem with the "copying an amended file from a laptop" scenario is that there really shouldn't be a checksum, but a GUID.  When I copy the file to my laptop, it uses the GUID, when I copy it back, the system notes that (a) it was copied to the laptop on date X, (b) the file on the master system hasn't been changed since date X and (c) these files have the same GUID and therefore does an overwrite.

    The problem here is that you need to have (a) a mechanism to do intentional forking of files (File | Save As ?) and (b) a real conflict resolution mechanism for merging back two files that have both been amended since they were separated.

    Of course, now you're headed towards having a DVCS instead of a filesystem.  But maybe that's what would be appropriate for document files if they weren't binary blobs that the DVCS can't see into and therefore can't do merging properly.

    Certainly, using a VCS or a DVCS for code and a DMS for documents at work has made me regard filesystems as being a bit primitive for user-facing documents.  You'll notice that Google Docs doesn't really emulate a file system.

  • @Evan, yes what I was thinking is that each file gets an ID when they're created and that is how they're manipulated, this ID being independent from the identity as perceived by the user.

    Another of the limitations of the filenames is the fact that you can only use "legal" characters on them, I can't name my file "My 1\2 work done" because it so happens that the OS decided eons ago that the '\' character is the "path separator" whatever that means in a GUI world. Try to explain to somebody that is not very computer savvy that with a straight face.

  • It does sound like a useful idea to have a filesystem that inherently knows whether two given files have the same content (eg. by having an internal hash of the file contents, which gets invalidated [but not necessarily recalculated, for performance reasons] on any change to the file); that would enable the "automatically skip files that have previously been copied" scenario, at least if they have the same name.

    Others have proposed having something other than the name being the core identifier, but I'm not confident that would work well.  Perhaps you're copying a document to somewhere else to make changes to it you *don't* want to propagate back to the original document; you'd need to have some way to say "this was cloned from the original but is independent" vs. "this is an updated version of the original", and you don't always know which you want in advance.

    Besides, in my day-to-day job I almost always use copy-and-replace (and I tend to give things sensible names in the first place, so collisions only happen when they really should be replaced).

  • > "In many cases, users don't care about the filename. When dealing with the photos on my camera, the camera automatically fills in the date"

    I actually really hate the random identifiers that cameras apply but understand at the moment there isn't a better solution.  I *do* care about the file name, but it is overly burdensome to apply it.  In my ideal world cameras would have a lot better built in tagging capability (other than date, and occassionally GPS cordinates which are utterly un user friendly) and allow me to specify a file name based on the meta data in the same way various utilities let me rename MP3s based on tag info.  At the same time, while Windows is getting better at using meta data it would be nice if it could go further - functionality like Windows Live Photo Gallery really should be built into the shell.

    > "When you copy a folder with a file named "recipe.txt" over a folder, which already contains a file named "recipe.txt" it would be better if Windows would know, if these files are identical."

    In a perfect world sure, but that comes at some real world tradeoffs that are huge.  A 5K text file can be diffed in a matter of milliseconds (especially if you are only checksum diffing) but even that becomes very complicated if you want to start making intellignent decisions based on things like trailing whitespaces, differing unicode quote marks, etc. Expand that out to a 200 meg powerpoint which can have all sorts of very subtle comparisons (office properties metadata, minute differences in specifying formatting between office versions, document change metadata, etc).  Comparing binary identity is easy, but comparing context identity is incredibly hard and not at all processing cheap.  

  • I wonder if Adrian has children.

    I wonder if he gave them names.

  • I don't understand how you're supposed to use a system where filenames don't have to be unique -- how do you know which file you want to open when they have the same name? If I want to open a recipe, I need to be able to figure out which recipe to open. You could argue that a thumbnail of the document would show me which one I want, but I would imagine that most recipes would look similar in thumbnail form. Can you imagine a cookbook where the recipes weren't uniquely named? It would be very difficult to use. You'd have to use meaningless metadata like page number to find a specific recipe.

    Of course you could argue that the user should be allowed to create as many files of the same name as they want, but then most users would end up with hundreds of files all called "Untitled" because it's easier. Then people would be complaining about how hard it gets to use computers because filenames are so often the same and it's hard to know which "recipe" they're looking for. By forcing unique names at file creation time, the computer is saving users lots of pain later on.

  • An enterprising person could prototype such a thing today. Ignore the actual filenames, and use the file id (MFT record number). Stuff your non-unique filename in a named data stream.   Write yourself a file browser that showed the non-unique name.

    Mind you, I'm not so sure I'd want to use it.

  • Maybe there should be more options in the file copy dialog. Just looked how Total Commander handles and there is at least two useful option: "Replace all older" and "Replace all shorter". This would probably solve many situations where people synchronize files. Also a visual feedback option might be OK - there are preview handlers in new Windows - why not display a part of the file in the copy dialog - XnView will present you an option like this if you try to overwrite files with same name.

    "After all, Windows knows, without asking me, that the files, even if they are the same size and created at exactly the same time, are different." - he seems to not see many problems here. When are they different? When every letter is different? Or one letter is different? What about binary files? It would result in a totally unhelpful and unpredictable (for a user who haven't read the, probably several pages long, documentation of this *feature*) choices.

  • The thing that really annoys me when doing a synchronisation type exercise is that Windows wont let you copy the files across (even if you are going to be ignoring or replacing most of them) if there isn't enough space for the whole lot of data you are moving.

    Say I have 400G of data on a 500G volume, if I copy 410G of data of which only 10G is new and select skip when it asks about overwriting.  The copy is still aborted due to insufficient disk space.

  • I'd like to add a thought. Perhaps you should have a look at a situation where this issue regularly occurs: When I take pictures with my digital camera, the system starts with a filename like IMG000001.jpg. It continues to count till infinity as long (!!!) as I don't change the card on which pics are stored. If I do, it starts again at IMG000001.jpg. When I come back from holidays and I store the pics on my harddrive, I have to store the pics from the second card, in a separate folder to not lose half of my pics. As far as I know (and I am no expert), other OSs will recognise that these are indeed different files and store them without regard to the file name...

Page 2 of 3 (35 items) 123