Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
I had this conversation a little over two years ago in the Netherlands on the end of the last day at a conference. It may not be word for word, though I actually think it comes pretty close (its not like I had a tape recorder). The cookies were Pepperidge Farm Mint Milanos, but I do not like mint (I love the non-mint varieties, I am not sure how I ended up with the ones I did - it might have been a mistake to mention I did not like them).
Oh, also the name of woman I talked to is not really Andrea; I just like the name and do not mind the nod to Jubal Harshaw....
Me: Andrea, would you like a cookie?
Andrea: Actually, I would like to know what the "Korean Unicode sort" is.
Me: I'd actually rather give you one of these cookies. They are really good. Plus its less embarrassing than the answer to your question.
Andrea: I know you hate mint, you said so yesterday at the luncheon. C'mon Michael!
(Short pause)
Andrea: Or is it Mike? Or maybe michka like your mails?
Me: Michael's best.
Andrea: Ok, no Russian bears. So tell me, why is the Korean Unicode sort embarrassing? I could not find it defined anywhere, except maybe I found a vague hint to the 'Unicode collation' setting that was used in SQL Server 7.0, which could be Korean. Is that it?
Me: No, that's not what it is. Though SQL Server does have a "Korean Unicode collation" of its own that matches the one that used to be on Windows.
Andrea: Grrr. You are infuriating, Michael. What is the Korean Unicode sort? The one that is in SQL Server, the one that used to be in Windows, the one that is still in the header files. What is it?
Me: Well, its almost the same sort as the one we use for English.
Andrea: Almost? How close is almost? Sounds like almost hitting a home run, but what kind? Was it an almost home run that was a strike out, or an almost home run that was a triple?
Me: Ouch! Well, if you put it that way, I guess you could say it's a strike out.
(I have an embarrassed smile at this point)
Me: We move one character.
Andrea: One character?
Me: One character.
Andrea: What character is it? Something insulting to a government? Did Microsoft upset the Korean premier or something?
Me: No, nothing like that. Its U+005c, the "REVERSE SOLIDUS". Also known as the backslash. Not insulting at all.
Andrea: One of us has to be missing something, Michael. Maybe you had better give me a cookie.
(She eats a cookie, and tries to hand the package back. I shake my head)
Andrea: So please, explain to me why the backslash has to be moved for Korea.
Me: Well, because for Korean, it is also the Won sign (₩).
Andrea: You said in your talk today that there is room for over a million characters in Unicode. There is no room for a dedicated Won?
Me: Oh, there is a dedicated Won Sign at U+20a9. Its just that in most Korean fonts a character that looks like a Won is put in the slot for U+005c, and since the characters look the same we try to make sure that they are treated as if they were the same.
Andrea: Ok, I see that. But why is it called the Korean Unicode sort. If its legacy then that would make it the Korean ANSI sort, right?
Me: Well, ANSI does not have Korean in it, and there is no Won.
Andrea: You know what I mean, Michael. Are you this exasperating when you talk with your girlfriend?
Me: Oh, I... I'm between girlfriends at the moment.
Andrea: I WONder why....
Me: Hey now!
(Andrea is wearing quite an impish grin at this point)
Andrea: Just kidding. But I was up too late last night and you already gave me the cookies. So I have no real need to flirt when I am teasing at this point.
Me: Hmmmm, no one ever used to have a need. Anyway, I know what you mean. It probably would have made more sense to tie it to the Korean standard, except thats encoding and not sorting. And they basically do put the won at 0x5c in their encoding standard, so MS is just trying to be consistent. It would have been really weird trying to tie to KSC-5601.
Andrea: I can definitely see that. So, what about the rest of the Hangul and Hanja and Jamo and whatnot that is used in by Koreans?
Me: Well, now you understand why it was probably removed from Windows -- because it does not really do much for Korean.
Andrea: But its still in SQL Server. They didn't get the memo?
Me: I know you think that I am a bigwig at Microsoft, but I'm not. I was offered a job there but I haven't even started yet. And I am definitely not "in the know" about what they do in SQL Server.
Andrea: No need to be shirty, dear. I understand. I apologize for thinking you were important.
(I grimace at this point)
Andrea: Ok, and I apologize for teasing you now. But back to the Korean thing.... do you have a guess?
Me: Oh, definitely. I just don't know if I am right.
Andrea: So what is the theory?
Me: My guess is that since there is a serious worry about backward compatibiliy and sort orders in SQL Server, and they can't really get rid of something as easily, even if it is useless. I guess they could have hacked it since its only different by one character, but they are a team that is astoundingly against hacks. Thats something I can respect.
Andrea: So can I. Probably worth a KB article, at least.
Me: Maybe. If PSS gets customers wondering where good old 0x00010412 went, I'll suggest it.
(She eats another cookie)
Andrea: Ok. I'm sorry to monopolize your time like this.
Me: No worries, the group is gone, the conference is mostly over. Hell, I'd probably be flying out tonight if there were a flight. You can come out with us tonight if you want. Well, that is if we are going anywhere.
Andrea: Actually, you can come out with us. My friends are more socially adept than yours.
Me: Probably true. And more than me, too.
Andrea: One more question and we can head back to what's left of the group.
Me: Ok. What's the question?
Andrea: Whats up with the Japanese (Unicode) sort?
Needless to say, the conversation devolved at that point. But Andrea did finish the cookies. I did go out with four of Andrea's friends that night and drank more than I should have. The flight home was harder with a hangover, and to be perfectly honest it was not until I sat down to try and remember the whole conversation earlier tonight that I remembered I was supposed to follow up with PSS.
Maybe the blog entry is good enough at this point? :-)