It has been at least a good 29 months since I posted Is it Macau or is it Macao?, which (among other things) pointed out how a primarily Traditional Chinese locale basically was using the Simplified Chinese sorting data provided by standardization bodies in China.

And I pointed out how that was not perfect, but people seemed to think it was "good enough" or at least they were not complaining loudly enough to have anyone want to look into the matter....

Since that time I have been asked by several different people if I could quantify the difference.

The engineer in me agrees with the sentiment -- it feels unsatisfying to pick the wrong answer for the mere reason that it will be "as right as it can be without doing actual work" since that feels too much like intentionally making a bad landing.

And even the non-engineer reads about it and the question occurs to them -- how close is the answer?

How bad is it able to be and still fit within that "good enough" label?

So, looking at the total stroke data provided by standards bodies in China for all 70,195 ideographs in Unicode 5.0 and comparing it with the 54,195 ideographs for which stroke count data was provided by Taiwan standards bodies, how different are those 54,194 ideographs?

Well, 9,768 (or 18%) of them have different stroke counts between the two standards.

Here are the summary totals for the amount of difference:

Stroke Count Difference Ideograph Count
1 9,045
2 675
3 44
4 2
5 1
6 1

Scary numbers, huh?

Well, it scared me a bit. And made me really wonder how we decide what looks like it is close enough....

It might be interesting to run the numbers and see how much difference it would make in the actual order of characters if the Macao data was sorted using the Traditional Chinese numbers, perhaps that would be an interesting topic for another day....

For now, here are those numbers from the very bottom of the list, for your enjoyment:

Biggest unmatched stroke counts
Unicode Code point Taiwan stroke count China  stroke count
U+272F0 19 13
U+28F71 24 29
U+27055 19 15
U+25F22 20 16

Makes you wonder how well the data represents the actual counts, let me tell you....

Well, here they are, with the PMingLiU-ExtB provided ideograph on the left and the SimSun-ExtB provided glyph on the right:

And everyone who is curious can now look at these very extreme cases in each direction and decide what they think is the source of the difference -- linguistic preference, orthographic choice, typographic tradition, creeping errors in standards, or whatever.

Anyone in country want to take a crack at this one?

And maybe using the 18% figure people in Maca[o|u] can decide how bad do thing have to be before they decide things are not good enough. :-)

 

This post brought to you by 𧋰, 𨽱, 𧁕, and 𥼢 (U+272f0, U+28f71, U+27055, and U+25f22, four Extension B CJK ideographs)