In the world of Unicode, it is small irony that what usually causes the most emails to be exchanged and the most documents to be written are the characters that have no actual visible representation.

Whether it is U+feff (a.k.a. ZERO WIDTH NO BREAK SPACE, a.k.a. the BOM), U+200e/U+200f (a.k.a. LEFT-TO-RIGHT MARKER/RIGHT-TO-LEFT MARKER), or many of the others, it is these characters that often have the most dangerous and troublesome influence on surrounding text.

I am going to talk about two more of them now....

They are:

Now these characters each have a simple purpose -- the former is to suggest that the characters preceding it and following it should not try to join or ligate, and the latter is to suggest that the characters preceding it and following it should try to join or ligate.

If you have neither then it is something that basically up to the font and rendering engine what they want to do based on their impression of the desired behavior.

And if the font and/or shaping engine wish to ignore the suggestion, they are free to do so at will -- which they often will do if they do not have any specific behavior within their understanding of the two characters.

ZWJ and ZWNJ are only supposed to be used to suggest visual distinctions, not ones that would change the meaning or interpretation of the characters.

They are thus supposed to be ignored in things like the Unicode Collation Algorithm and outright stripped in things like StringPrep.

The problem is that sometimes they do convey semantic meaning or content.

Like in the native Sinhalese word for Sri Lanka, which is the first of these two strings and requires a ZERO WIDTH JOINER:

Now tell me if you think that the first one -- which is the name and needs a U+200d) looks different enough from the second to be considered a significant difference to native readers.

Definitely an example where a linguistic distinction controlling how a language is rendered can quickly become a political one!

Or an reported analagous situation with U+200c and Myanmar. Or perhaps the several reported cases where Farsi appears to show similar issues.

I suppose the  conclusion of all this is simple enough: one person's ignorable suggestions are another person's crucial directions. :-)

Obviously this is an issue that needs to be figured out, especially with UTR #36 and UTR #39 being relied on so heavily to provide guidance on how consumers of Unicode who want to avoid spoofing and other issues.

Preferably in a way that not offend anyone linguistically, or politically.

But look out for those invisible characters. As we learned in the movie Small Soldiers, they are like the wind. Just because you can't see something doesn't mean it's not there....

 

This post brought to you by U+200c and U+200d (a.k.a. ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER)