UCS-2 to UTF-16, Part 2: A&P of a 'linguistic character'

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

UCS-2 to UTF-16, Part 2: A&P of a 'linguistic character'

  • Comments 14
Previous blogs in this series of blogs on this Blog:

A&P in the title stands for Anatomy and Physiology, since in some alternate universe I went ahead and got a medical degree and made a good friend (a friend who, in that alternate universe, is still alive) proud of me. Ignore it, the deeper meaning of the title, even when it exists, isn't really important. :-)

Now that I've made everyone thinking "Let's update to support UTF-16 instead of UCS-2" they need to just back the hell off a few steps with the previous blog, I thought it might be good to go a little deeper in so you can see that even though you may have been completely and totally wrong, that there is a good basis for you thinking the way you were, and that you can use that knowledge to feel better about future steps. :-)

In theory, there is very little difference between the general case of linguistic character as I defined it last time and the specific case that got everyone freaking out about UCS-2 (surrogate pairs).

In practice, all linguistic characters fall into one of twothree categories:

  • A Surrogate Pair (two code units, a high and a low surrogate), neither of which is itself a character, linguistic or otherwise. The cheese may stand alone, but surrogate code units didn't teach it how, if you know what I mean;
  • A Grapheme Cluster (to use Unicode's term) aka Text Element (to use Microsoft's in the .NET Framework), made of two or more code units, at least some though not all of which can be independently thought of as being linguistic characters themselves;
  • A Sort Element (to use my term, via this blog) aka Compression (to use Microsoft's term) aka Contraction (to use Unicode's), made up of two or more code units, all of which can be independently thought of as being linguistic characters themselves.
To show an example of each:
  • 𐎀, aka UGARITIC LETTER ALPA, aka U+10380, aka U+d800 U+df80 -- this one is four bytes in UTF-8, two code units in UTF-16, and one code unit in UTF-32 -- interestingly, always four bytes!
  • , aka the fully decomposed form of U+1e78 (LATIN CAPITAL LETTER U WITH TILDE AND ACUTE), aka U+0055 U+0303 U+0301 -- this one is five bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32;
  • dzs, a sequence of letters that collates together in Hungarian, aka U+0064 U+007a U+0073 -- this one is three bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32.

Now one can argue at length on relative consequences of truncation of any of these sequences of code units. You might even make an argument that truncation is most serious in the first case and then gets less and less serious as you go down the list.

Truncation in this case is a superset of any operation that splits apart the component pieces before a user's eyes, including cursor movement through the string, deletion of a single "character" via the delete key, cutting off the end to fit in a buffer, or whatever. Anything that would show a lack of respect for a linguistic character's boundaries. Everyone gets involved here -- fonts, keyboards, you name it...

From one point of view you might be right if that is your argument.

But as long as we are choosing to call them linguistic characters I am going to channel that Spock-with-a-beard version of me that managed to avoid the scandal with the Dean's daughter and got a PhD in linguistics, and claim that they each have the potential to have unique meaning to a user who took the time and effort to put them into data.

In my opinion, you get no points for vicious truncation just because it doesn't look as bad.

And in which case anyone with eager willingness to truncate should consider themselves to be a bloodthirsty linguistic character murderer. Sentence suspended by me since there really isn't a competent court with the authority to punish for this crime. :-)

Because if you are working on or using a computer program displaying or storing or in any way using data then you have a right to not have someone change the meaning of that data in the name of expediency.

And truncating a linguistic character has the potential to do just that.

Okay, now that I have been all crazy about this, I'll point out that only the first two of these three categories have any supported way for a program looking for safe truncation points to detect them.

Which means if I made you feel guilty, you can take some solace in the fact that just about everyone is going to be doing it some of the time....

But it is worth considering that fact when one carefully does one's best avoiding problems with the categories that you can easily help with.

Okay, that's it for now, next time I'll talk about those various operations an how to go about them....


This blog brought to ou by(U+1e78, aka LATIN CAPITAL LETTER U WITH TILDE AND ACUTE)

Comment on the blather
Leave a Comment
  • Please add 1 and 5 and type the answer here:
  • Post
Blog - Comment List
  • And let's not forget the IVS (Ideographic Variation Sequences) :-)

  • No worries there -- for our present purposes, they fall into Category #2. :-)

  • "[...] at least some though not all of which can be independently thought of as being linguistic characters themselves" isn't strictly true.  Decomposed Korean syllables are grapheme clusters, but each component is a linguistic character.

  • Ah yes, that is true. Though most users would consider the net effect of truncation to be just as destructive to meaning....

  • Well, the same is true of almost any sort of trunca

  • Exactly my point. :-)

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

  • Previous blogs in this series of blogs on this Blog: Part 0: The intro, sans content Part 1: Getting

Page 1 of 1 (14 items)