With the launch of Internet Explorer 7 we are now at the beginning of the international domain names revolution.

What is an International Resource Identifier (IRI): In short, current domain names are restricted to ASCII characters e.g. www.microsoft.com. However, with the introduction of Unicode we can now have any language represented in a domain name. e.g. http://www.cafè.com or http://www.αβγ.com

For more information please refer RFC 3987.

Challenges for IRI: Just adding Unicode characters to a domain name is not sufficient. There are additional issues that need to be dealt with now

  1. Handling DNS lookup: DNS is used to resolve a domain name to its IP address. The DNS protocol can only handle ASCII name resolution. Unicode characters therefore need to be converted into an ASCII format. This is where International Domain Name (IDN) comes into the picture (refer RFC 3490). It defines a unique hash from a Unicode label to an ASCII label. It becomes an ACE label of the form ‘xn--<ascii characters>' e.g. Å becomes xn--5ca
  2. Equivalency: A single Unicode character can be represented as multiple Unicode characters. Therefore when comparing Unicode labels, they have to be reduced to a standard Unicode form. This is the process of normalization. e.g. Å and A° (i.e. character A with character 30A hex) are equivalent by normalization.
  3. Another issue specific to programming languages is the representation of Unicode characters. These are represented as type chars. In C#, the char type is of two byes and maximum value that it can hold is 65535. However, Unicode code points can have larger values. C# uses the concept of surrogate characters to represent these Unicode characters. A domain name having these surrogate characters should work the same as in cases where the character falls within 65535.

Managed Uri changes:

For managed Uri parsing, Uri class handles generic URI parsing (by RFC 2396) whereas custom scheme parsers can be derived from GenericUriParser class.
New feature changes will now involve giving back IDN names from the DnsSafeHost property and normalizing operations on the Uri string provided as input.

This feature will enable users to adopt international domain names while using the existing Uri class.

Generic URI parsing:

Currently the Uri class takes in specific Unicode values in the domain name that are classified as a character using the Char.IsChar() function. Other parts of the Uri take in any Unicode characters but do not perform normalization on them.

The Uri class is being updated with the ability to handle IDNs and normalization.

How it works:

When the user specifies Unicode characters in the domain, we

  • Clean the string of any bidirectional characters (refer RFC 3490 and Unicode formats at www.unicode.org)
  • Convert Unicode labels to IDN ace labels. This step also normalizes the labels. User can get the IDN equivalent domain name via the DnsSafeHost property
  • Convert user passed in Ace domain labels to their Unicode format. User can get the Unicode equivalent of this domain name via the Host property
  • Specified Unicode labels will be normalized by the new IDN rules.

When the user specifies Unicode characters in the other parts of the Uri we

  • Clean the string of any bidirectional characters
  • Normalize those Uri parts
  • Uri parts can be retrieved from the Uri using the GetComponents() function and specifying the type of escaping desired using the UriFormat enum. Depending on the UriFormat specified with GetComponents(), we escape Unicode characters that fall out of the IDN format specification for that UriComponent. The query part of the URI is allowed more Unicode characters that the other components.

For the domain name as well as the other parts

  • Equivalency via Equals() will be automatically checked on specified Uri strings after they have been normalized.
  • Surrogate pairs will be handled if specified.

Custom URI parsing.

Currently user can define custom URI parsers by deriving from the GenericUriParser class and specifying options using GenericUriParserOptions enum. User can opt into the enhanced functionality with two new values added to this enum

a) Idn - Turns on the IDN functionality for this custom parser.
b) IriParsing - Turns on the IRI functionality for this custom parser.

Targeted Timeframe and Backward compatibility:

This feature is currently slated for the Orcas .NET release. Once released, current users of these classes will not see any change in functionality. Instead, they will have to opt into the new IRI/IDN functionality. Thus, by default the Uri parser will have identical behavior as Whidbey.

There will be two switches that control backward compatibility with Uri class. Both these will be switched off by default.

  1. IDN - User can specify if he wants domain name converted to ACE labels
    a. Off completely
    b. On completely
    c. On for internet Uri's and off for Intranet Uri's. This handles scenarios where user has an Intranet active directory way to resolve Unicode domain names to IP addresses
  2. IRI Parsing - User can specify if he wants to do the normalization and character checking according to the IRI RFC. The current Uri class is based off RFC 2396. This flag will move us to the latest Uri RFC implementation RFC 3986 that IRI RFC is based on. Specifying only this switch will not do IDN.