Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
Previous posts in this series:
Now that I have been talking about collation in Windows across ten separate blog posts, I thought it might make sense to talk about the characters in Unicode that take up more space in the standard than any others -- ideographs.
Whether you call them Han or Hanja or Kanji, they are all basically Chinese characters used in either Chinese, Korean, or Japanese.
The story for collating these items was not created in a vacuum, but there also were not simple uncomplicated sources that were used in their creation.
The collation story is in fact kind of a messy one, due to many different factors:
But the goal is quite simple:
There are many different collations across the various locales, and I have talked about various issues in many different posts, from Why is there no pronunciation-based sort for Japanese? to Supporting a pronunciation based sort for East Asian languages... to Is it Macau or is it Macao? to 'Acceptable' Japanese sort order? and more.
The simple fact is that trying order over 70,000 items is going to be complicated, though hopefully as intuitive as it can be....
Now prior to Vista there were several specific problems in the tables:
Though even with addressing all of these problems, there was a problem (in some people's minds) starting with Vista -- an issue I hypothetically discussed in If you add enough characters to a sort, intuitive distinction can suffer and then more directly in On distinctions that are primarily with [and without] difference. That latter post even had a nice high-view narrative of several the various East Asian sorts,
In the next post I'll dig in a bit and provide some examples with different sort keys across different locales....
This post brought to you by ⑾ (U+247e, a.k.a. PARENTHESIZED NUMBER ELEVEN)
Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The