The last word on the FINAL SIGMA

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

The last word on the FINAL SIGMA

  • Comments 16

Back in the beginning of April, I explained about the one scenario where casing does not need to roundtrip in .NET -- the Greek final sigma.

Anyway, the day before yesterday I got an email from someone who had been reading my blog and was looking at all of the one-way mappings that are in the linguistic tables (accessed with the LCMAP_LINGUISTIC_CASING flag, which I have discussed previously). He was wondering why that FINAL SIGMA could not be put into the linguistic tables since it is a one-way mapping.

A fair question, one I thought worthy of a post. :-)

If you are a native speaker of Greek, then you know that both ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA) and σ (U+03c3, a.k.a. GREEK SMALL LETTER SIGMA) do indeed uppercase to Σ (U+03a3, a.k.a. GREEK CAPITAL LETTER SIGMA). But if we added this character to the linguistic table, then it suddenly ς would never work in the CharUpper/CharUpperBuff functions and would not work in the default call to LCMapString with the LCMapString function with the LCMAP_UPPERCASE flag.

Obviously that would not be a good thing.

Try to imagine how you would feel if attempting to uppercase the string hello would come out as HELLo. Wouldn't you consider it a bug? Especially is it used to come out with the HELLO you were expecting? You might be thinking about telling the platform GooDBYE, if you know what I mean.

Of course ideally the functions would notice whether the Σ was at the end of a word and then decide whether to use ς or σ, depending. But LCMapString does not really look beyond the character level here, so until it does that would not really be an option.

Though of course a more sophisticated application might work to provide results beyond the character boundary. Though I do not envy such programs; the boundary for them becomes quite fuzzy if you have non-Greek characters after the ς. Does that count as a new word or doesn't it? That is the kind of question where an API can never win -- no matter which way it goes, there will be some people who do not like the answer.

Anyway, that is why ς is not uppercased only in the linguistic table. Because there are too many cases where the results simply don't make sense, at least not as things are implemented currently....

 

This post brought to you by "ς" (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)
A character that wonders whether Unicode would have been simpler if it did not exist as an independent entity, and fionts could then decide whether to make it a "final" form or not....

Comment on the blather
Leave a Comment
  • Please add 7 and 1 and type the answer here:
  • Post
Blog - Comment List
  • I suppose English has the same problem with medial s's?
  • Well, we don't sort or case those differently...
  • Sorry, I meant "ſ"
    LATIN SMALL LETTER LONG S
    http://www.fileformat.info/info/unicode/char/017f/index.htm

    AKA "medial s"

    Looks like Greek and German both have problems with multiple "s"-es - luckily English abandoned the long s prior to computers becoming popular.
  • Ah yes, *that* character.

    In Windows today, we do not case it at all. In Windows tomorrow there are interesting conversations about whether to put it into the linguistic table or not. Admittedly these conversations have more heat and less light, but we are working toward a conclusion....
  • From a mathematical point of view (if I may...)
    Define U to be the ToUpper operator.
    Define L to be the ToLower operator.

    The naive expectation of a typical user is that UL and LU will be the identity operator. As far as I can make out, this is unfixably broken (or at least made much more difficult) by such things as ligatures and medial forms. So I find it reasonable to expect that UL != LU for these problem characters.

    However, I would like to count on U == ULU for everything... and L == LUL for everything. In other words, though U and L are not strictly inverses, it would be nice if they were at least stable.

    In particular, I would like to see L(U("ſ")) = "s", and L(U("ß")) = "ss" - or perhaps "sz". I suppose I'm asking for a way to escape the problem cases by pushing things through a s.LowerCase().UpperCase() operator...

    Is this sensible?
  • Well, I won't say your expectations are not sensible. But they do not match the current behavior, which does only imple Unicode casing with no extra context rules.
  • Replace imple with simple. :-)
  • Oh, I do slightly disagree that a user who is sophisticated enough to call an API is one who we would class as "naive" :-)
  • I think we're at cross purposes over the term "user".

    I meant "user" as in someone who:
    * Installs an application I write
    * Puts text in a box
    * Chooses Format | Case | UPPERCASE
    * Thinks "Hmmm, no, I don't like that"
    * Chooses Format | Case | lowercase
    * Thinks "Hey, what happened to my capital D in Donald" or "Hey, what happened to my ß in straße" or ...
    * Immediately calls me to inform me that my application corrupted their data

    I probably shouldn't have used the word "user" - as an API writer, you probably hear "user" and think "application developer." I meant to say "end user". :)
  • Unfortunately, the answer is the same there -- the API does not handle either of those cases. Certainly we have no way of supporting proper casing anyway, but we do not currently handle strings that would increase the size of the buffer....

    If you need support like that, you have to build it yourself -- we are very low tech here. :-)
  • On this capitalization note, I'd like you to know that I've created an Evil Small-Caps Test for browsers:
    http://www.geocities.com/mvaneerde/small-caps.html
    (The "evil" is a nod to Ian Hickson's "evil" CSS test)
  • To me, the NLS API function LCMapString has a full-time job, one that is crucial to the fundamental fabric...
  • PingBack from http://joahua.com/%CF%84%CF%81%CE%B1%CE%BD%CF%83%CE%BB%CE%B9%CF%84%CE%B5%CF%81%CE%B1%CF%84%CE%B9%CE%BF%CE%BD/?p=3
  • In internationalization contexts, one often hears about the notion of dangerous characters . This is

  • (Negative assessment word ( blows ) chosen via a magic eight ball and the info in this post ) Benski

Page 1 of 2 (16 items) 12