Previous blogs in this series of blogs on this Blog:
Okay, so far we have introduced the topic, pointed out that 9/10's of what a person was going to run off and do is probably too much, and then jumped into define the things that are sequences of storage characters that meet the definition of what a user calls a character.
So what are the things you can do with them, if you are armed with this knowledge?
Well, since we are focusing on "user" characters, we'll start there -- with users moving through a text stream. You know, using the arrow keys to move either forward or backward through the text, and watching the cursor as they go.
The ideal behavior that the user expects without thinking about it is not too complicated: if they think of something as a single character, then the do not expect it to tale multiple arrow keypresses to move through it.
In other words, they want the computer to understand the text in the same way that they do.
Now although that is simple conceptually, it os not always supportable by software today -- especially when one considers sort elements, where there is no easy function to call that finds those boundaries. The underlying data exists in collation algorithms (for example Microsoft's and the UCA's) and is used in order to define the sorting behavior of those elements, but when they are independent letters like the Traditional Spanish ch or the Hungarian dzs, there generally isn't an easy way ti query for the information.
Now in this case there has not been such a method for as long as computers or even typewriters have been there, so it may be stretching the definition of expectation to assume that people would expect computers to understand the boundaries of a sort element when it comes to cursor movement. At best they would be pleasantly surprised if this happened, and at worst they would think of it as a bug.
Finding out whether this is learned behavior or an intuitive expectation would make for fascinating study if the parameters for determining the truth could be defined. It makes me jealous of the ClearType folks when I think of the number of studies they do related to reading when I think about how large the budget is to commission such a study in Windows International.
$0.
So realistically we can put that third type of linguistic character aside for now.
And possibly forever. :-)
Looking at the other two types of linguistic characters -- surrogate pairs (aka supplementary characters) and grapheme clusters (aka text elements), generally users don't want want or expect to require multiple keypresses to get through what they think of a single character.
This raises an interesting question for a developer performing an operation that is enumerating a string.
Should the developer ever care about the answer on the length of a string or substrinfg when they are scrolling through a character?
I mean, take the word 𐎀𐎇𐎖 (this is not a word so much as a stream of Ugaritic letters).
𐎀𐎇𐎖
Now this is three "characters":
U+10380 U+10387 U+10396
Using a modern browser like FireFox I have no problem seeing the string treated as three characters, despite the fact that under the covers it is actually:
U+d800 U+df80 U+d800 U+df87 U+d800 U+df96
And if you try to click in the middle of a letter you are never given the opportunity -- it always picks a side and puts you in one spot or the other.
So clearly there are times that a developer might need to care about this fact, and therefore there should be a good way to provide this.
Unfortunately, generally speaking, there isn't one inline with the text.
But there are things like .NET's StringInfo class, which will help map the storage characters to the linguistic characters -- something I have talked about before.
Though as I pointed out Sometimes you need more than StringInfo, there are cases in between the second and third category that actually do have data somewhere.
Thus in Assamese ম্পা is four Unicode code points (U+09ae U+09cd U+09aa U+09be), two text elements according to StringInfo, but we know from prior "Virama-esque" posts like this one that this is actually a conjunct. So as Sometimes you need more than StringInfo points out, there is a construct that the computer understands that is not being provided as easily to developers.
Now I am tempted to call this yet another category, and it really is a grapheme cluster that is not a text element,.
I think in the long run it would be better if Microsoft treated this as a limitation/bug in StringInfo and its definiton of text element and either fixef it or added a new construct to handle this additional understanding of "characterness" that Uniscribe clearly understands even if StringInfo does not.
In other words. Microsoft ought to provide the mechanisms that it actually does expose in easier ways here.
Because no method should break up a conjunct, or put a cursor in the middle of one. But how is a developer supposed to support all that without a way to get at the data?
Now in Sometimes you need more than StringInfo I actually asked if samples for this kind of data would be desirable and nobody responded, but I'll ask again to see if I have inspired interest. Any takers? :-)
This post brought to you by ত
(U+09a4, a.k.a.
BENGALI LETTER TA)