Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
In response to the recent post Appreciation, embarrassment, and redirecting thanks, that MVP Omi I mentioned in relation to Bengali sent me an email:
Are you going to change the rendering engine somehow? I mean if I press ড় it will become ড + ় = ড় ? Please don't do it. ড় is a independent consonant and nothing should apply on it.Please read a controversy from: http://bugzilla.wikimedia.org/show_bug.cgi?id=5948
This is not what I said (or what I meant) when I said:
I also think I added a few of those canonical equivalences into the table (like ড় == ড়, a.k.a. U+09dc == U+09a1 U+09bc).
What I was talking about was making sure that when comparisons are made that involved sequences like the one above that have a canonical equivalance to (and which happen to be visually indistinguishable from) another string that they will be treated as if they were equal.
But no part of that comparison operation done with the Bengali locale or culture, whether done with CompareStringEx, CompareString, CompareInfo.Compare, or any of the functions/methods that call them, will do anything to the string to modify what is put in the backing store.
Now with that said, the simple fact is that U+09dc is canonically equivalent to U+09a1 U+09bc. And because of that, it is quite possible (throughout the entire world wide web and across all the many software products out there) to run across those who follow the principle of normalizing strings to some consistent form, and doing so consistently.
Hell, there are a few people who believe so strongly in normalizing early and often that they make religious fundamentalist extremists seem mild and dull.
Even Microsoft, in cases of working within standards like IDN and XML and others, may be normalizing a string or two.
And while this can make a native speaker of a language unhappy (as it clearly has in this case), it is important to not place too much stock on this issue as a problem, because it is really not.
The issue is similar to one I often bring up related to canonical equivalence, with U+00e5 (å) being canonically equivalent to U+0061 U+030a (å). This too is an equivalence that is annoying for a native speaker of a language that considers U+00e5 to be a helluva lot more than the letter a with some shmutz on top of it -- U+00e5 is a unique letter on its own that deserves more than to be assaulted with a deadly function, right?
But it is not about all that.
In the end, å can often look like å, and ড় can often look like ড়. Normalization does not destroy language through the equivalance, and neither does collation. Both technologies are simply working to make sure that no matter what is the preferred normalization form in which data is created that they will be treated as being the same. Since they can look the same, this is neither insult nor injury to the language; in fact, it frees one up to not be concerned about them not being treated as equal even if a normalizing process happens to do something to the string....
It is not a bug, and it is not something to worry about. In any language....
This post brought to you by ড় (U+09dc, a.k.a. BENGALI LETTER RRA)
PingBack from http://omiazad.wordpress.com/2006/10/19/trying-to-fix-google/