A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

A&P of Sort Keys, part 11 (aka It's not like ideographic sorts were developed idiopathically)

  • Comments 2

Previous posts in this series:

Now that I have been talking about collation in Windows across ten separate blog posts, I thought it might make sense to talk about the characters in Unicode that take up more space in the standard than any others -- ideographs.

Whether you call them Han or Hanja or Kanji, they are all basically Chinese characters used in either Chinese, Korean, or Japanese.

The story for collating these items was not created in a vacuum, but there also were not simple uncomplicated sources that were used in their creation.

The collation story is in fact kind of a messy one, due to many different factors:

  • The tables were mostly not updated for multiple versions of Windows despite the fact that more and more characters were coming into general use;
  • Most of the characters that were not added to the tables had some weight, just not the one to put them in the correct order;
  • Some of the characters actually had no weight, with the predictable results thereof;
  • In the case of pronunciation based sorts, the "most common" pronunciation of some characters actually changed over the course of the last 10+ years.

But the goal is quite simple:

  1. In the default table, put all of the ideographs after almost everything else in Unicode -- first regular CJK, then Extension A, then Extension B, in code point order for each section.
  2. For each specific East Asian language, put the relevant ideographs in the expected order for the expected sort in question.

There are many different collations across the various locales, and I have talked about various issues in many different posts, from Why is there no pronunciation-based sort for Japanese? to Supporting a pronunciation based sort for East Asian languages... to Is it Macau or is it Macao? to 'Acceptable' Japanese sort order? and more.

The simple fact is that trying order over 70,000 items is going to be complicated, though hopefully as intuitive as it can be....

Now prior to Vista there were several specific problems in the tables:

  • Missing ideographs
  • Some overlap between the language specific table and the extras meant to be put in the end
  • A few mistakes
  • A few changes in official source data (like for most common pronunciation)
  • Missing support of the expected repertoire in several national standards.

Though even with addressing all of these problems, there was a problem (in some people's minds) starting with Vista -- an issue I hypothetically discussed in If you add enough characters to a sort, intuitive distinction can suffer and then more directly in On distinctions that are primarily with [and without] difference. That latter post even had a nice high-view narrative of several the various East Asian sorts,

In the next post I'll dig in a bit and provide some examples with different sort keys across different locales....

 

This post brought to you by (U+247e, a.k.a. PARENTHESIZED NUMBER ELEVEN)

Comment on the blather
Leave a Comment
  • Please add 1 and 3 and type the answer here:
  • Post
Blog - Comment List
  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

  • Previous posts in this series: Part 0: The empty string sorts the same in every language Part 1: The

Page 1 of 1 (2 items)