Getting intermediate forms

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Getting intermediate forms

  • Comments 19

Unicode has a certain complexity to it that can at times be challenging.

Let's take for example U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE. Here is what it looks like (how good will depend on your OS and browser support!):

Now obviously that is pretty fully precomposed (in Unicode Normalization Form C). If it is fully decomposed, we get U+0065 U+0302 U+0303, a.k.a. LATIN SMALL LETTER E + COMBINING CIRCUMFLEX ACCENT + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

ễ

And here is where the problems come in. Because between these two extremes lies as third case: U+00ea U+0303 a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX + COMBINING TILDE. Here is what it looks like (again, how good will depend on your OS and browser support!):

ễ

Now if you convert that third case to NFC you will get the first case, and to NFD you will get the second. How does that happen?

Well, the rules for normalization are that you have to keep on performing the compression or decompression until you can't anymore.

So, there are two ways to get the information of that last case:

  1. You can cart around the decomposition info from the Unicode Character Database so you can get it all yourself.
  2. You can take the NFD string and start converting to NFC with one additional character at a time, thus:

Step 1:   Convert the string to NFD; we now have: U+0065 U+0302 U+0303

Step 2:   U+0065 + U+0302 to NFC == U+00ea; we now also have U+00ea U+0303

Step 3:   U+00ea + U+0303 to NFC == U+1ec5; we now also have U+1ec5

Now this is not what I would call a perfect algorithm by any stretch of the imagination. But it is a quick and dirty way to get the information on a bunch of equal forms.

But it certainly leaves open the question of whether the operating system and/or the .NET Framework should expose this information at some point....

 

This post brough to you by "ễ" (U+1ec5, a.k.a. LATIN SMALL LETTER E WITH CIRCUMFLEX AND TILDE)

Comment on the blather
Leave a Comment
  • Please add 1 and 4 and type the answer here:
  • Post
Blog - Comment List
  • QTran asked via the Contact link: Michael, Install the Vietnamese keyboard on XPSP2, and guess what it

  • Michael Kaplan's personal blog not approved by Microsoft (see disclaimer )! You may have read Vietnamese

  • Over in the Suggestion Box, Aaron asked: Hi again - question about one of your favorite codepages - 1258

  • As a by the way, this blog does NOT represent anything beyond my own personal thoughts. You could even

Page 2 of 2 (19 items) 12