I’ve divided this into a few parts:
My interest in IDN is that I’m the SDE for the System.Globalization.IdnMapping class in Whidbey. I also think its pretty nifty for the users in countries that use more than the basic Latin letters.
For those of you that don’t know, IDN/IDNA is trying to solve the problem of international (non-ASCII) characters in domain names. IDN is an “Internationalized Domain Name”. RFC 3490 - Internationalizing Domain Names in Applications http://www.faqs.org/rfcs/rfc3490.html has the details. IDN only addresses domain names, it doesn’t attack the email address user name issue or other internationalization issues related to URLs/URIs/IRIs.
Before IDN, domain names were basically restricted to the Latin character set, A-Z, 0-9 and sometimes -. This is useful if your company is Microsoft.com, but not so helpful if you’re company is in Chinese or Cyrillic characters. IDN provides a mechanism for encoding additional Unicode characters using the allowed a-z, 0-9 and - characters. So a name like きくどら.com (Kikuna Driving School) or www.mäkitorppa.com (Mäkitorppa mobile store) is represented like xn--w8je2f2f.com (きくどら) or www.xn--mkitorppa-v2a.com (www.mäkitorppa.com).
So IDN doesn’t require any changes to the DNS layers of the Internet, but it does require conversion from the Unicode to the ASCII “Punycode” form of a name at some point. A Whidbey .Net application uses the System.Globalization.IdnMapping class to convert between the Unicode and “Punycode” forms.
In addition to the punycode conversion, IDN does some normalization using NFKC and additional mappings such as making the strings all the same case. Some Unicode characters are considered ambiguous or dangerous and are disallowed in IDN, others are folded into a more common form to prevent some repetition.
IDN disallows some Unicode characters considered dangerous and “folds” others into a more common form in some cases if they are ambiguous.
Even with these restrictions, it was quite obvious that many look alike characters, or homographs, exist in Unicode. Examples exist even in ASCII as you can construct MICROSOFT.com as MlCR0S0FT.com by using the little el and zero characters. The DNS system would think this is a different domain name and send a user to a different server, yet, depending on your font it could be difficult to distinguish from the real domain name.
Unicode has tens of thousands of characters, so when the IDN RFC was created it was the homograph problem becomes even more complicated when Unicode characters are allowed. For example, Місrоѕоft.com can be written almost entirely in Cyrillic letters (this example has only the r, f & t in Latin. Я just doesn’t look quite the same ;-)).
Even worse, some scripts have characters that are difficult to distinguish. Many Chinese characters appear very similar in small fonts. Other scripts have minor diacritics that could be missing or slightly modified such that the user might not notice. Due to the complexity of the problem, the IDN RFCs leave the homograph problem to be resolved later, perhaps by the registrars or a future RFC.
Since IDN doesn’t directly address the homograph problem, users could be susceptible to spoofing, phishing and other social engineering attacks. This is exactly what happened with the recent paypal attack. The IDN name pаypal.com was registered with a Cyrillic a for the first A. xn--pypal-4ve.com is the punycode version of this name.
A user following such a link in some browsers would see what looked like paypal.com in their address bar, but would actually be a different web site. An email or link from another web site could be used to trick a user into providing their paypal information to an attacker. This type of attack is similar to the socially engineered emails that have already been used to try to get users to enter personal information by trying to get them to go to https://safe-com.com/ebay or some such URL instead of a real vendor site.
Some people were amused that Mozilla, Firefox and other browsers were susceptible to the pаypal.com homograph attack, but Internet Explorer is not (because it doesn’t do IDN conversion). Equally interesting is the browser reaction of removing IDN support and then choosing to display the Punycode name instead.
Personally I don’t think this is just an IDN weakness. Rather IDN merely makes an existing problem with trusting links more obvious. http://paypal-safe.com or http://secure.com/paypal would catch many users anyway.
My thinking is that basically the IDN pаypal.com attack where the first A is Cyrillic is a social attack. For this attack to succeed, a user must first follow an untrusted link. That link could be a web site (please buy my book at Amazon.com) or an email (click here to update your bank information). Some users are already wary of following unsolicited links from email, but don’t think twice about a web link. In either case, a look-alike name in the address bar of the browser would be reassuring.
Several solutions to the homograph problem have been suggested. I don’t have a magic bullet, but I think that the root problem is a user education/social engineering problem, remember in some cases these attacks can even happen with the non-IDN DNS names. The following are suggestions I’ve seen in various places and my thoughts about them. Other people & coworkers disagree, so these are merely my thoughts.
Several suggestions seem American centric, and I’m disappointed that the developer community doesn’t have a broader global perspective.
· Disabling IDN – This is perhaps the most obvious suggestion, but seems quite short sighted. After all IDN was created to solve a real problem for DNS names that are not ASCII. If your corporate name is きくどら this “solution” doesn’t help at all.
· Displaying Punycode – This is also a quite non-global suggestion. While showing punycode solves the particular pаypal.com problem, its even worse for the きくどら user. xn--w8je2f2f.com is the same as xn—gibberish.com to their users (xn—gibberish.com would be ٮ٨٧٩ٯٲٳ.com, cool it even decodes!) Even in the www.mäkitorppa.com case how is a user supposed to know if its www.xn--mkitorppa-v2a.com or www.xn--mkitorppa-01a.com? For non-US users displaying the punicode doesn’t solve the problem, it makes it worse.
There are many other suggestions though. My concerns with these suggestions are that either a) they are too restrictive, preventing reasonable names, or b) they aren’t sufficient to catch all of the problems, or both.
Some suggestions seem more reasonable to me. (Of course, I’m just me, so other people probably have other ideas).
Various technical problems, including IDN, can be combined with well-designed social attacks to allow a user to trust a web site that shouldn’t really be trusted. Vigilance by the registrars could prevent many homograph attacks, however some will undoubtedly still be possible. Font choices and browser behavior might limit a few more mistakes, but those can be offset by poor eyesight, dyslexia, monitor differences, color choices (user or application), platform (mobile or PC) and other differences. User education can also help catch a percentage of the problem.
It seems to me that all of these taken together will reduce the available surface for attacks, but there will still be a window for the attackers to attempt their exploits. Many of those socially engineered attacks don’t even require IDN Homographs to trap some unwary or uneducated users.