And what about the Japanese (Unicode) sort?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

And what about the Japanese (Unicode) sort?

  • Comments 14

Although I got no public comments about it, seven different people contacted me privately (by email or via the "Contact" link) asking me what was the answer to Andrea's question about the Japanese (Unicode) sort.

(I'm not sure why no one asked in the public comments. I must be very intimidating or something)

Its not a very exciting answer, for what it's worth.

Its about the same as the answer about Korean (but the Yen sign U+00a5 is used instead of the Won sign, for obvious reasons). We also move the HORIZIONTAL BAR (U+2015) to sort by the KATAKANA-HIRAGANA PROLONGED SOUND MARK (U+30fc), for similarly unremarkable historical reasons. Otherwise the sort is identical to the default sort, a fact that makes it quite fundmentally useless for Japanese data.

Comment on the blather
Leave a Comment
  • Please add 1 and 2 and type the answer here:
  • Post
Blog - Comment List
  • I was not intimidated. I just did not want to try and make you answer the question if there was some political reason that you could not.
  • Michael Kaplan's random stuff of dubious value.
  • In case anyone looks here without looking at "it", please do go look at "it". I do understand the codepoint of the single-byte yen sign in Japanese (non-Unicode) character sets. Regarding sorting, the number of meaningful Japanese sort orderings does not magically go down to 1 if you use Unicode encoding instead of more common character sets (ANSI code page 932 or other), and Windows does not have a sort ordering that would match my local phone book.
  • Yes, that is hopefully true, since the so-called Japanese Unicode sort was removed due to its ueslessness.

    The actual Japanese sort in Windows today is hopefully not in the phone book order, or that would be a messed up phone book. I will probably relay a conversation I had with a different person st a conference about THAT sort on another day. :-)
  • Ok, I'll bite.. does anyone have a url for a good japanese sort algorithm? I can imagine the pain of such an algorithm, having to look up readings first... but maybe someone has done said work already?
  • 12/21/2004 12:38 PM Nick Lange

    > does anyone have a url for a good japanese
    > sort algorithm?

    I doubt it very much. After posting that Windows doesn't have a sort ordering that would match my local phone book, I belatedly remembered that it isn't even possible to define a sort ordering that would match a phone book even without me being listed in it. And I belatedly remembered that I have even posted that fact in Raymond Chen's blog...

    Anyway, there are a few standard sort orderings, but all of them are unsuitable for use in human displays. They are only suitable for use in internal operations such as storing and retrieving keys in databases or hash tables or symbol tables and such things.

    I once knew someone named Kanbe. The Kanji of his name were the same as for the city Kobe. Consider sorting the names Kanbe, Kimura, and Kobe. The exact same Kanji for Kanbe and Kobe must be listed both before and after Kimura. The only way to sort them properly is to also have the pronunciations recorded, use the pronunciations as the primary sort key, and use additional secondary keys including the actual names that are going to be displayed or printed.

    The first name of a former colleague is Yukie but someone read her name and called her Sachie (of course he really read and called her by full name, properly starting with her family name). There are thousands like this.
  • Agreed, so I guess the best thing to do is just butcher the actual readings and just sort from the first match in a lookup. (joke)
    Although my ketai has fields for both the reading and the kanji... probably how most systems here work.
  • Note that the solutions currently used for Korean and Chinese are very much based on the model of "take the most common pronunciation." This solution is routinely rejected for Japanese, a point that I will actually be exploring in a future posting to the blog (cf: http://blogs.msdn.com/michkap/articles/271003.aspx#329682 ).

    :-)
  • 12/22/2004 12:25 AM Nick Lange

    > my ketai has fields for both the reading and
    > the kanji

    Yes, so do databases and hand-written paper forms for all kinds of purposes etc. The printed phone book doesn't. (If you only have a keitai then I don't know if you're supposed to be entitled to a printed phone book.)

    Sorry for picking on phone books so much. There are other situations too where furigana aren't printed but would have helped some readers if they had been printed, but if those situations require lists to be sorted then they often turn out to be phone books.
  • nice... After my current contract is up, I'd like to get into more multilingual programming projects... can't wait.
  • Excellent blog about Windows and Unicode

  • The story I am telling here is completely true. I have only omitted project and people names to protect...

  • A while back (well, in March of this year) I was talking about Traditional versus modern sorts, and...
  • As a general rule, once a sort has been added to Windows, it cannot ever be removed. But you have probably

Page 1 of 1 (14 items)