Whats up with the Korean (Unicode) sort?

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Whats up with the Korean (Unicode) sort?

  • Comments 33

I had this conversation a little over two years ago in the Netherlands on the end of the last day at a conference. It may not be word for word, though I actually think it comes pretty close (its not like I had a tape recorder). The cookies were Pepperidge Farm Mint Milanos, but I do not like mint (I love the non-mint varieties, I am not sure how I ended up with the ones I did - it might have been a mistake to mention I did not like them).

Oh, also the name of woman I talked to is not really Andrea; I just like the name and do not mind the nod to Jubal Harshaw....


Me: Andrea, would you like a cookie?

Andrea: Actually, I would like to know what the "Korean Unicode sort" is.

Me: I'd actually rather give you one of these cookies. They are really good. Plus its less embarrassing than the answer to your question.

Andrea: I know you hate mint, you said so yesterday at the luncheon. C'mon Michael!

(Short pause)

Andrea: Or is it Mike? Or maybe michka like your mails?

Me: Michael's best.

Andrea: Ok, no Russian bears. So tell me, why is the Korean Unicode sort embarrassing? I could not find it defined anywhere, except maybe I found a vague hint to the 'Unicode collation' setting that was used in SQL Server 7.0, which could be Korean. Is that it?

Me: No, that's not what it is. Though SQL Server does have a "Korean Unicode collation" of its own that matches the one that used to be on Windows.

Andrea: Grrr. You are infuriating, Michael. What is the Korean Unicode sort? The one that is in SQL Server, the one that used to be in Windows, the one that is still in the header files. What is it?

Me: Well, its almost the same sort as the one we use for English.

Andrea: Almost? How close is almost? Sounds like almost hitting a home run, but what kind? Was it an almost home run that was a strike out, or an almost home run that was a triple?

Me: Ouch! Well, if you put it that way, I guess you could say it's a strike out.

(I have an embarrassed smile at this point)

Me: We move one character.

Andrea: One character?

Me: One character.

Andrea: What character is it? Something insulting to a government? Did Microsoft upset the Korean premier or something?

Me: No, nothing like that. Its U+005c, the "REVERSE SOLIDUS". Also known as the backslash. Not insulting at all.

Andrea: One of us has to be missing something, Michael. Maybe you had better give me a cookie.

(She eats a cookie, and tries to hand the package back. I shake my head)

Andrea: So please, explain to me why the backslash has to be moved for Korea.

Me: Well, because for Korean, it is also the Won sign ().

Andrea: You said in your talk today that there is room for over a million characters in Unicode. There is no room for a dedicated Won?

Me: Oh, there is a dedicated Won Sign at U+20a9. Its just that in most Korean fonts a character that looks like a Won is put in the slot for U+005c, and since the characters look the same we try to make sure that they are treated as if they were the same.

Andrea: Ok, I see that. But why is it called the Korean Unicode sort. If its legacy then that would make it the Korean ANSI sort, right?

Me: Well, ANSI does not have Korean in it, and there is no Won.

Andrea: You know what I mean, Michael. Are you this exasperating when you talk with your girlfriend?

Me: Oh, I... I'm between girlfriends at the moment.

Andrea: I WONder why....

Me: Hey now!

(Andrea is wearing quite an impish grin at this point)

Andrea: Just kidding. But I was up too late last night and you already gave me the cookies. So I have no real need to flirt when I am teasing at this point.

Me: Hmmmm, no one ever used to have a need. Anyway, I know what you mean. It probably would have made more sense to tie it to the Korean standard, except thats encoding and not sorting. And they basically do put the won at 0x5c in their encoding standard, so MS is just trying to be consistent. It would have been really weird trying to tie to KSC-5601.

Andrea: I can definitely see that. So, what about the rest of the Hangul and Hanja and Jamo and whatnot that is used in by Koreans?

Me: Well, now you understand why it was probably removed from Windows -- because it does not really do much for Korean.

Andrea: But its still in SQL Server. They didn't get the memo?

Me: I know you think that I am a bigwig at Microsoft, but I'm not. I was offered a job there but I haven't even started yet. And I am definitely not "in the know" about what they do in SQL Server.

Andrea: No need to be shirty, dear. I understand. I apologize for thinking you were important.

(I grimace at this point)

Andrea: Ok, and I apologize for teasing you now. But back to the Korean thing.... do you have a guess?

Me: Oh, definitely. I just don't know if I am right.

Andrea: So what is the theory?

Me: My guess is that since there is a serious worry about backward compatibiliy and sort orders in SQL Server, and they can't really get rid of something as easily, even if it is useless. I guess they could have hacked it since its only different by one character, but they are a team that is astoundingly against hacks. Thats something I can respect.

Andrea: So can I. Probably worth a KB article, at least.

Me: Maybe. If PSS gets customers wondering where good old 0x00010412 went, I'll suggest it.

(She eats another cookie)

Andrea: Ok. I'm sorry to monopolize your time like this.

Me: No worries, the group is gone, the conference is mostly over. Hell, I'd probably be flying out tonight if there were a flight. You can come out with us tonight if you want. Well, that is if we are going anywhere.

Andrea: Actually, you can come out with us. My friends are more socially adept than yours.

Me: Probably true. And more than me, too.

Andrea: One more question and we can head back to what's left of the group.

Me: Ok. What's the question?

Andrea: Whats up with the Japanese (Unicode) sort?


Needless to say, the conversation devolved at that point. But Andrea did finish the cookies. I did go out with four of Andrea's friends that night and drank more than I should have. The flight home was harder with a hangover, and to be perfectly honest it was not until I sat down to try and remember the whole conversation earlier tonight that I remembered I was supposed to follow up with PSS.

Maybe the blog entry is good enough at this point? :-) 

Comment on the blather
Leave a Comment
  • Please add 1 and 1 and type the answer here:
  • Post
Blog - Comment List
  • Michael Kaplan's random stuff of dubious value.
  • Good day. When I have time I'll try to read more of your blog, but after being pointed this way from Raymond Chen's blog, here's a few comments.

    > Me: Well, because for Korean, it is also the
    > Won sign (₩).

    1. I doubt that. If it's anything like Japanese and the yen sign, then it isn't "also" the won sign, it _is_ the won sign. If it's anything like Japanese, there is no single-byte backslash, there might be a double-byte wide backslash but that's a different character and different codepoint, and of course there's at least two backslashes in Unicode but one of them has no counterpart in the ANSI code page.

    2. When I used the mouse to copy and paste from your posting, my submission here has a won sign in it. I wonder how that came about. I can't input it. My keyboard has a yen sign. (Actually my keyboard has two of them, it has a yen sign that looks like a yen sign and generates a yen sign, and it has a yen sign that looks like a backslash and generates a yen sign. Different graphics for historical reasons, different scan codes, but the same identical codepoint and character, a yen sign.)

    > Me: Well, ANSI does not have Korean in it,
    > and there is no Won.

    If it's anything like Japanese, that's completely wrong. The ANSI code page for Japanese is 932 and it has a single-byte yen sign, codepoint 0x5C. If I recall correctly the ANSI code page for Korean is 936 and it has a single-byte won sign, codepoint 0x5C.

    ASCII doesn't have Korean or a won sign or a yen sign. It also doesn't have any codepoint larger than 127.

    ANSI code pages for small character sets based originally on Italian alphabetic characters also don't have a won sign or a yen sign but do have codepoints going up to 255. I'd guess you meant to talk about these ANSI code pages, but these are kind of irrelevant in a conversation about ANSI code page 936, or a conversation about trying to create some form of compatibility between Unicode and ANSI code page 936.

    I don't know if the number of meaningful sort orderings for Korean is larger or smaller than for Japanese. In Japanese I can't imagine labelling just one of them as "the Japanese (Unicode) sort". I can guarantee that Windows doesn't have a sort ordering that would match my local phone book. Doesn't matter if it's Unicode or not. If you find what I did, you could create a sort ordering for it, but you don't have it now.
  • Hello Norman,

    #1 -- Actually, collation on Windows always uses Unicode, and every Korean sort that has ever existed on Windows has put that backslas character U+005c as looking like the Won.

    #2 -- It *is* the Won, I made it the real Won so that you did not have to have a Korean system locale to read the article and see it.

    I meant the Microsoft meaning of ANSI, where this is no dedicated Won character other than the thing that is the backslash in payhs.

    What you see is exactly what I was describing, except you see it for the Yen. Except since Andrea had neither Japanese nor Korean settings, I was explaining it to someone who sees a backslash.

    None of the so called "Unicode" sorts for Japanese or for Korean are meaningful -- that was my point. They are not only meaningless, they are also useless....

  • 12/19/2004 10:24 PM Michael Kaplan

    > Actually, collation on Windows always uses
    > Unicode

    I'm still not quite sure of the relevance. A sort ordering based on codepoint comparisons could be useful for the same purposes as the invariant ordering, i.e. not for any purpose usable in communicating information to humans. All other sort orderings must be based on some characteristic other than the codepoint values, in which case it doesn't matter what encoding scheme you use, you must get the things sorted by the chosen characteristic.

    > and every Korean sort that has ever existed
    > on Windows has put that backslas character
    > U+005c as looking like the Won.

    Aha, that is useless. The character U+005c is a backslash, it isn't a single-byte character, and if Korean character sets are anything like Japanese then it doesn't even exist in Korean character sets (in other words it doesn't exist in Japanese character sets). It doesn't look like a won sign and it doesn't look like a yen sign, it just isn't displayable unless you change fonts. And it certainly isn't the single-byte character with codepoint 0x5c, because in ANSI code page 936 codepoint 0x5c is a won sign and in ANSI code page 932 codepoint 0x5c is a yen sign. Japanese character sets do include a wide character, double-byte backslash, which can be used in displaying a backslash if you don't mind the fact that it appears wider than the one you wanted.

    (By the way the ISO and ANSI committees on the C and C++ languages screwed up with it too.)

    > I meant the Microsoft meaning of ANSI

    Huh? I've read a few dozen MSDN pages which seem to be aware of the fact that there are a ton of ANSI code pages. One of those code pages includes a single-byte won character, one of those code pages includes a single-byte yen character, and others don't.
  • I guess what I am saying is that for Japanese and Korean, we MUST sort U+005c in the same way that we sort those other characters since they look the same. This happens even in the valid code pages.

    In addition, somebody thought it would be a good idea to add some sorts that *only* do this and nothing else. The point of my posts here is that this was eventually recognized as a bad idea and removed.

    Thus these articles that describe two awful sorts that we are happy to be rid of. :-)
  • 12/20/2004 4:54 PM Michael Kaplan

    > I guess what I am saying is that for
    > Japanese and Korean, we MUST sort U+005c
    > in the same way that we sort those other
    > characters since they look the same.

    But they don't look the same. U+005c is a backslash. The single-byte Korean codepoint 0x5c is a won sign and does not look like a backslash. The single-byte Japanese codepoint 0x5c is a yen sign and does not look like a backslash.

    Regarding looks, there is no way to display a U+005c without switching fonts. (Though at least in Japanese it's possible to display a wide character that looks very close to it because it's also a backslash.)

    Regarding sorting and other internal operations, U+005c exists even though it can't be displayed, and it should be sorted as the backslash that it is.
  • They look identical on a Japanese or a Korean setting. If you truly believe that two characters that look the same should sort differently, then you are in the minority in both Korea and Japan....
  • > They look identical on a Japanese or a
    > Korean setting.

    U+005c cannot be displayed in a Japanese setting. The character that looks closest, well sorry I don't want to take time to look it up now, but you know what a full-width double-byte backslash looks like. It does not look like a yen sign.

    In a NON-UNICODE sort, in a sort based on Japanese encoding, of course the codepoint's value 0x5c should sort as 0x5c. There it is not a backslash, it is a yen sign or it is a won sign or whatever. (And also of course this is still a sort for some purpose other than human interaction, since human-oriented sorts such as phone books still have the same issues that they have regardless of which binary encoding system is used for them.)

    > you are in the minority in both Korea and
    > Japan....

    Indeed I think so, and here's the reason: up to this point I was trying to give serious treatment to Unicode sorts as you were trying, instead of ANSI codepage sorts.
  • Actually, no. Even the Unicode sort does this, for valid reasons in the marketplace. Beyond that, note that there is no separate sort for non-Unicode. The "A" APIs convert and call the "W" APIs. There is only one set of tables for collation.

    Note that this happens on *all* Japanese and Korean sorts.

    THIS posting was a recapturing of an explanation for a sort that does this and nothing else. Since you think its a bad idea anyway, perhaps we can just agree that a sort that does only this was a bad idea and then walk away....
  • > Actually, no. Even the Unicode sort does
    > this, for valid reasons in the marketplace.

    When you say "does this", I guess "this" means sort U+005c the same way as a code page's 0x5c was sorted? Which marketplace wants that?

    For a Unicode sort other than the invariant one, I have the impression that there was some effort to make the sort somewhat compatible with a non-Unicode sort, which would not have been a Microsoft "A" API calling a Microsoft "W" API, but would have been a computer-centric sort ordering based on code points in a national or linguistic code page. In Japanese, the yen sign comes between the left bracket and the right bracket, so you would want U+whatever the code point is for yen sign to come between U+005b and U+005d. In Korean you would want U+whatever the code point is for won sign to come between U+005b and U+005d. Then, even though old databases don't get their contents transcoded into Unicode, new databases that use Unicode could get sorted the same way as the old databases got sorted.

    Since a single-byte backslash didn't exist in the old code page, U+005c could be added to the new Unicode whichever-variant sort ordering, in places where other characters get added.

    Japanese government databases cannot store my wife's name. I recommended to them to misspell my wife's name to approximate the pronunciation rather than approximating the appearance, because other Japanese adaptations of foreign words usually try to approximate pronunciations. If a future government database uses Unicode, if it will become possible to store my wife's name, it doesn't necessarily mean that the new character should get squashed into the same sorting position that Latin-1 put it in.

    Of course all of the above are not phone book sorts, they are just ways to match the existing national-but-not-phonebook sorts.
  • Both chracters (the Yen and the Won) sort in the same place as the backslash on Microsoft platforms when you specify you want Japanese or Korean as a default user locale.

    Its not *too* diferent from the old location (all symbols after all) but the difference between equal and not equal can be pretty staggeting.

    I assume you run with a Japnese user locale. If you have never noticed a problem before then you likely do not object to the behavior.
  • 12/22/2004 1:58 PM Michael Kaplan

    > Both chracters (the Yen and the Won) sort in
    > the same place as the backslash on Microsoft
    > platforms when you specify you want Japanese
    > or Korean as a default user locale.

    I'm afraid I don't understand this. When Japanese or Korean is the default user locale, there is no single-byte backslash, so how can anything else sort in the same place as that? If you mean that the character whose codepoint is 0x5c sorts in between characters whose codepoints are 0x5b and 0x5d, then I'd say it looks pretty reasonable.

    > I assume you run with a Japnese user locale.

    No kidding. If most computers sold in your country are sold with your country's locale set as the default, and most of the things that you use them for at both work and home work at least as well under that locale as under alternatives, then wouldn't you usually refrain from switching?

    > If you have never noticed a problem before
    > then you likely do not object to the
    > behavior.

    That much is true, for at least three reasons.

    1. I haven't usually needed to do that kind of sorting. When I need to remove duplicates from a list, it is convenient to sort and then weed out adjacent lines that are duplicates, but it doesn't really matter what order they're in.

    2. When Outlook Express doesn't even sort things in the same order as Outlook Express, it's good for laughs, but it isn't a problem (at least for me). I don't know which sort rules it's using and don't care. Again no objection, just a smirk.

    3. When Windows Explorer sometimes sorts things differently than the way previous versions of Windows Explorer used to sort them, sometimes it becomes a nuisance. I don't want to take the time to write details right at the moment. But this does not involve symbols, so again it is not an objection to the item that you're mentioning.
  • Try it this way, maybe it will help: :-)

    Pretend there is nothing but Unicode, since from my point of view, there isn't. The "A" version of the function just converts to Unicode anyway....

    If you are not using Unicode then the post is not relevant to you, but if you are not using Unicode then almost half of the characters that the government put into JIS x213 are unavasilable to you, so I would recommend upgrading to Unicode at some point. :-)

    When you select the Korean LCID (0x0412), U+005c will sort equivalently to U+2089 (WON SIGN). When you select the Japanese LCID (0x0411), U+005c will sort equivalently to U+00a5 (YEN SIGN).

    For the numbered sections (I will recommend you number from now on so the references are more obvious <grin>):

    #1 -- makes sense

    #2 -- no idea what you mean here, but I'd rather not go there, it stinks of an "ill wind" direction for conversation....

    #3 -- they use almost the same API except the Shell does the "sort ASCII digits as numbers" thing (cf: StrCmpLogicalW). I will be talking about that some other day, don't worry....
  • 12/23/2004 7:36 PM Michael Kaplan

    > if you are not using Unicode then almost
    > half of the characters that the government
    > put into JIS x213 are unavasilable to you

    I'm not sure how many of the government's own machines have fonts capable of displaying those, and/or allow (politically allow) their usage. In business documents I've neither seen nor used them. In experimentation around 10 years ago, I saw them in EUC, didn't see them in Shift-JIS, and didn't see Unicode in use yet. Outside of experimentation I didn't see them used even in EUC.

    > When you select the Korean LCID (0x0412),
    > U+005c will sort equivalently to U+2089
    > (WON SIGN). When you select the Japanese
    > LCID (0x0411), U+005c will sort equivalently
    > to U+00a5 (YEN SIGN).

    "Equivalently" sounds fine to me. Now, in each case does that location fall in between U+005b and U+005d? If yes, then it's compatible with sorting in each code page. If no, then I think it's pretty obvious why no one wanted to use it.

    > #2 -- no idea what you mean here

    It becomes visible depending on which newsgroups you subscribe to and if you watch carefully when it's downloading. (This doesn't mean you have to hunt it down, unless you wish. As mentioned I don't object but only smirk, and I don't think it's a problem.)

    > #3 [...] "sort ASCII digits as numbers"

    OK, I haven't read it enough. I thought I had vaguely read a summary that it sorted numerals as numbers and that it had been internationalized. If it's only intended to work with ASCII then it's meeting its intent, but the result is less consistent than the old style.
  • If you are not using the characters then no worries. I'd still recommend moving to Unicode as you will otherwise be more and more likely each year to start running into problems....

    It is not between U+005b and U+005d. The sort you refer to is not compatible with what Windows has been doing since NT 3.51 and Windows 95 JPN. No one has complained yet, though....

    For the "digits as numbers" stuff, it only does ASCII digits. Like Is said, I'll talk sbout it more another day. :-)

Page 1 of 3 (33 items) 123