File URIs in Windows

IEBlog

Windows Internet Explorer Engineering Team Blog

File URIs in Windows

  • Comments 44

Invalid file URIs are among the most common illegal URIs that we were forced to accommodate in IE7. As I mentioned in a previous blog post there is much confusion over how to handle file URIs. The standard for the file scheme doesn’t give specific instructions on how to convert a file system path for a specific operating system into a file URI. While the standard defines the syntax of the file scheme, it leaves the conversion from file system path to file URI up to the implementers. In this post, I describe the conversion we use in IE, and I have a list of best-practices to use when constructing or manipulating file URIs.

Proper Syntax

For the UNC Windows file path
     \\laptop\My Documents\FileSchemeURIs.doc

The corresponding valid file URI in Windows is the following:
     file://laptop/My%20Documents/FileSchemeURIs.doc

For the local Windows file path
     C:\Documents and Settings\davris\FileSchemeURIs.doc

The corresponding valid file URI in Windows is:
     file:///C:/Documents%20and%20Settings/davris/FileSchemeURIs.doc

The important factors here are the use of percent-encoding and the number of slashes following the ‘file:’ scheme name.

In order to avoid ambiguity, and for your Windows file paths to be interpreted correctly, characters that are important to URI parsing that are also allowed in Windows file paths must be percent-encoded. This includes ‘#’ and ‘%’. Characters that aren’t allowed in URIs but are allowed in Windows file paths should also be percent-encoded. This includes ‘ ‘, ‘{‘, ‘}’, ‘`’, ‘^’ and all control characters. Note, for instance, that the spaces in the example URIs above have been percent-encoded to ‘%20’. See the latest URI standardfor the full list of characters that aren’t allowed in URIs.

The number of slashes following the ‘file:’ is dictated by the same rules as other wellknown schemes like http and ftp. The text following two slashes is the hostname. In the case of the UNC Windows file path, the hostname appears immediately following the ‘//’. In the case of a local Windows file path, there is no hostname, and thus another slash and the path immediately follow.

The username, password, and port components of a file URI in Windows are not used. In IE, including any of these components means you won’t be able to navigate to the URI. In contrast, the query and fragment components may be used. The query component will not be used when locating the resource, but the application that displays the content from the file URI may use the query component. For example, if an html document contains script, the script may read the query component of its URI when accessed via the file scheme. Similarly, the fragment will be used like a fragment in any other URI scheme.

Improper Syntax Examples

The following are some examples of poorly formed file URIs with which we’ve dealt. (Paths have been modified to hide the identity of the culprits. :-) These “bad” URIs will continue to work in IE7, however you should steer clear of them for the reasons stated and since there’s no guarantee of support in the future.

Incorrect: file://D:\Program Files\Viewer\startup.htm
Correct: file:///D:/Program%20Files/Viewer/startup.htm

A large set of invalid file URIs come from the common but incorrect notion that it’s acceptable to place a Windows file path after the text ‘file://’ and call it a file URI. This is bad because Windows file paths, as mentioned earlier, may contain characters that aren’t allowed in URIs or that are important to the parsing of URIs. For instance, if a ‘#’ is in a Windows file path and that Windows file path is simply appended to the text ‘file://’ then we can’t know if the ‘#’ is supposed to be part of the path or if its supposed to delimit the fragment as it would in an actual URI. Similarly, if the path contains a ‘%’ then we can’t determine whether the ‘%’ identifies a percent-encoded octet, or if it is just a plain percent character in the Windows file path. Zeke Odins-Lucas wrote an informative and entertaining blog poston this topic.

Incorrect: C:\Program Files\Music\Web Sys\main.html?REQUEST=RADIO
Correct: file:///C:/Program%20Files/Music/Web%20Sys/main.html?REQUEST=RADIO

In many places inside IE, we allow a Windows file path as input when the input is actually specified as a URI. For example, the function CreateURLMonikerEx takes a string URI, but a Windows file path may be provided instead. Despite this, it is important to realize that a Windows file path is not a URI and a URI is not a Windows file path. You should not, as is done in this example, place a ‘?’ character after a Windows file path and provide a query component. The Windows file path has no such construct. If you wish to reference a file and provide a query then you must use a file URI.

Incorrect: file:////applib/products/a%2Db/ abc%5F9/4148.920a/media/start.swf
Correct: file://applib/products/a-b/abc_9/4148.920a/media/start.swf

The author of this URI was heading in the correct direction. They converted the backslashes in their Windows file path to forward slashes and they percent-encoded characters they thought should be encoded. Although they meant well, there are a couple of problems. First, ‘applib’ is meant to be the host, but is preceded by two extra slashes. If interpreted as an actual URI, then applib isn’t the host but rather part of the path. If interpreted as a legacy file URI (as described by Zeke in his previously mentioned blog post) then those percent-encoded octets will be interpreted literally. Additionally, the characters ‘-‘ and ‘_’ are percent-encoded in this example, but shouldn’t be, as stated by the URI standard.

Non US-ASCII Characters

Characters outside of US-ASCII may appear in Windows file paths and accordingly they’re allowed in file IRIs. (URIs are defined as US-ASCII only and so when including non-US-ASCII characters in a string, what you've actually created is called an IRI: Internationalized Resource Identifier.) Don’t use percent-encoded octets to represent non US-ASCII characters because, in file URIs, percent-encoded octets are interpreted as a byte in the user’s current codepage. The meaning of a URI containing percent-encoded octets for bytes outside of US-ASCII will change depending on the locale in which the document is viewed. Instead, to represent a non-US-ASCII character you should use that character directly in the encoding of the document in which you are writing the IRI. For instance:

Incorrect: file:///C:/example%E3%84%93.txt
Correct: file:///C:/exampleㄓ.txt

IPv6

In the latest URI standard IPv6 literals are a part of the URI host syntax. In Windows, file URIs are dereferenced by converting them to their corresponding Windows file path and then using Windows file APIs to access the Windows file path. Since there’s no way to include an IPv6 address in a Windows file path, there’s no corresponding file URI and so there’s no way to incorporate an IPv6 address in file URIs in Windows. You can still use a hostname that resolves to an IPv6 address in the file URI, just not the IPv6 literal itself.

In Conclusion…

To reiterate the points above, please construct and use well-formed file URIs. If you’re writing code that generates or interprets file URIs, use the functions PathCreateFromUrl and UrlCreateFromPathto convert between Windows file paths and file URIs. These functions will work correctly with well-formed file URIs and legacy file URIs. Even if your file URI syntax looks reasonable and works in one case, that doesn’t mean it will work correctly in corner cases like paths that contain the ‘#’ or ‘%’ characters.

If you know of other interesting misuses of file URIs or have other related comments please let us know!

Dave Risney
Software Design Engineer

edit: added incorrect/correct wording, link update
edit: Corrected URI/IRI language in the Non US-ASCII Characters section

 

  • Loading...