The [Upper]Case of the Turkish İ (or: Casing, the 2nd)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

The [Upper]Case of the Turkish İ (or: Casing, the 2nd)

  • Comments 13

I think the Turkish folks have it right.

After all, say that we had all of the following characters in English:

  1. I   U+0049   LATIN CAPITAL LETTER I
  2. i   U+0069   LATIN SMALL LETTER I
  3. İ   U+0130   LATIN CAPITAL LETTER I WITH DOT ABOVE
  4. ı   U+0131   LATIN SMALL LETTER DOTLESS I

Wouldn't we do the case mapping to put the dotted and dotless variants together (so that both #1/#4 and #3/#2 would be case pairs)? Be honest, doesn't that make more sense?

We even have a good reason, if you think about it. I mean, its not like the "I" in "him" sounds the one in "nice" and neither of them sounds like the one in "niece" and none of them sounds like the one with no sound in "friend". So with all of those different sounds, English would be a lot simpler if we had an extra pair of letters to work with. I have talked to a lot of native speakers of other languages about languages (occupational hazard), and many suggest that one of the hard things about learning English is the multiple sounds for the same letter. We could actually move towards simplifying things by adding the complication of a few variations on letters....

Ah well, that probably won't happen. But hopefully you can see the basis for languages that might have for wanting an "Å" or an "Ö" or a "Č" or an "İ" in their midst. And then like I pointed out at the beginning of this post, if all of the variants of "I" did exist, it would be crazy to case them in any other way....

Of course, as you may have imagined this plan does not exactly co-exist well with case insensitve registries, or filesystems (like FAT and NTFS). Suddenly that idea that seems more sensible looks like an awful security risk (I do not even have to imagine; I have built versions of Windows on my own development machine that would not boot because they were unable to find the "HKLM\SOFTWARE\MICROSOFT\Windows" registry key and have heard tales of the ones that were unable to find WIN.ini). And I have witnessed code reviews that had scores of developers scan through thousands of files in the .NET Framework to (among other things) properly not use "Turkic" casing when trying to look at the filesystem or the registry. Its amazing how difficult and expensive it can be to make a product behave intuitively....

See how I slipped the proper design into that last paragraph? If you said "yes" then I feel very clever, otherwise I don't. :-)

The right design is to use CultureInfo.CurrentCulture in your .NET code any time you want to get the (possibly different) casing behavior seen in Turkish and Azeri, like in strings that your end users would see. At the same time you would use CultureInfo.InvariantCulture for those cases where you want the invariant, unchanging behavior. And in unmanaged code you want LCMapString with the LCMAP_UPPERCASE/LCMAP_LOWERCASE transformations to use or not use the LCMAP_LINGUISTIC_CASING flag, depending on the same conditons.

Its easy to remember it and do it, if you learn it in the first place. :-)

Comment on the blather
Leave a Comment
  • Please add 1 and 1 and type the answer here:
  • Post
Blog - Comment List
  • > many suggest that one of the hard things
    > about learning English is the multiple
    > sounds for the same letter

    The same happens a lot in other languages too, usually not as much as in English, but a lot more than they think it does.

    And it would not be solved by adding phonetic markers (like Vietnamese) or changing the rules for phonetic characters (like Japanese kana), because social trends will still result in changing some pronunciations and the new rules will become just as obsolete as the old rules were.
  • Yes, I suspected this was true, I just know that in talking to native speakers of other languages (there are a lot of those at Microsoft!) that no one ever seemed to think their language did it more....
  • Turkish folks sign on IRC a lot - or, at least, my IRC network (Nightstar) - and when they do they use the "windows-turkish" character set for 8-bit character sets (which the RFC demands and most clients use - though many modern clients use UTF-8) When they say I-with-dot or i-without-dot, they show up as Ý and ý, respectively. The cases are, of course, mixed in the way mentioned above, so when they miss (or just are lazy) it's rather obvious.

    Vorn
  • A few days ago I reminded everyone about how every Unicode character has a story, and I was talking about...
  • Scott figured it out, and it was not a Microsoft bug.:-)
    You can read about the details on Scott's blog...
  • Yet another 'New in Vista Beta 1' post!
    Now I answer a lot of questions in this blog, some that people...
  • A while back Patrick asked me: Hi Michael, Please forward if you’re not the right person to ask the question…

  • It is a commonly reported issue in Windows and many components that run upon it, a recent one can be

  • Over the last few years, quite a few of my blogs have mentioned the LCMAP_LINGUISTIC_CASING flag for

  • Regular readers might recall a long ago blog entitled New in Vista: What's your name? Who's your daddy?

  • Help Turkısh

  • Turkısh help

  • A couple of days ago, friend and colleague from Microsoft Damit Senanayake put a link up on my Facebook

Page 1 of 1 (13 items)