So anyway, I was pointed to Chris Mullins' .NET Unicode Puzzle and was struck by the irony of the use of the ASCII code page rather than the CharUnicodeInfo class (which I used for my own solution to the problem in Stripping Diacritics).

I don't mean the irony of how he went on and on about discovering the use of normalization in the solution. I mean, that isn't ironic, that just means he didn't see the article. But even regular readers can miss a post, let alone folks who don't read the blog. So that isn't ironic.

The irony for me was the way Chris went on in the end:

Whenever I drop into doing Unicode related tasks, I'm always amazed at the sheer bredth of the Unicode standard. There is so much information in there, and so many powerfull features that it's easy to quickly become overwhelmed.

It's easy too to forget that everthing we do these days on a computer is leveraging Unicode. Prettymuch everything is encoded in either UTF-8 or UTF-16 - all web pages, all XML documents, all text files stored on your hard disk. Unicode is at the heart of Windows, Linux, .Net & Java. Despite this, very few developers have any real understanding of what Unicode is, or how it works. I've been asking 'What does that UTF-8 or UTF-16 mean that you've typed in a zillion times?" during interviews now for years, and have yet to ever get back the right answer (although I've sure had some creative responses!).

Isn't it just a little bit ironic that he says so much about the power of Unicode and how no one understands it, while the solution to the problem pivots through the ASCII encoding which allows almost nothing in Unicode through?

For an example of the kind of character that his solution won't work for, see the rather irked sponsoring character, below! :-)

 

This post brought to you by ΣΆ (U+04e2, a.k.a. CYRILLIC CAPITAL LETTER I WITH MACRON)