<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Sorting it all Out : Every Character Has a Story</title><link>http://blogs.msdn.com/michkap/archive/category/8760.aspx</link><description>Ken Whistler, Technical Director of Unicode, once said that in Unicode, every character has a story. This category will tell some of these stories that often make up the dark underbelly of Unicode.&lt;br&gt;&lt;br&gt;</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>Every character has a story #31: U+272f0 from CJK Extension B, an ideograph that proves that every rose has its thorn! (aka It wasn't my fault, but [from the Windows standpoint] it was because of me....)</title><link>http://blogs.msdn.com/michkap/archive/2007/12/03/6643180.aspx</link><pubDate>Mon, 03 Dec 2007 17:31:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:6643180</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/michkap/comments/6643180.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=6643180</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=6643180</wfw:comment><description>&lt;P&gt;&lt;EM&gt;&lt;FONT color=#ff0000&gt;Yes, the end of the title is an allusion&amp;nbsp;to a&amp;nbsp;late 80s &lt;STRONG&gt;Poison&lt;/STRONG&gt; power ballad based on a Bret Michaels love affair that did not work out (due to a philandering lady, in that case).&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The other day,&amp;nbsp;I posted &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2007/11/22/6462768.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2007/11/22/6462768.aspx"&gt;&lt;STRONG&gt;How bad does it need to be in order to be not good enough, anyway?&lt;/STRONG&gt;&lt;/A&gt;&amp;nbsp;and I was focusing on the differences between Traditional and Simplified stroke data that was being used for stroke-based collation, and wondering about the net effect in Macao (a Traditional Chinese-using region for which Microsoft uses the Simplified Chinese collation, as I discussed in &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/06/11/428351.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/06/11/428351.aspx"&gt;&lt;STRONG&gt;Is it Macau or is it Macao?&lt;/STRONG&gt;&lt;/A&gt;).&lt;/P&gt;
&lt;P&gt;But Andrew West was looking at&amp;nbsp;some of these more extreme cases of stroke count differences (as you can tell from the comments), in particular 𧋰 (&lt;A class="" href="http://www.fileformat.info/info/unicode/char/272f0" mce_href="http://www.fileformat.info/info/unicode/char/272f0"&gt;U+272f0&lt;/A&gt;). and managed to prove in his own blog (his post &lt;A class="" href="http://babelstone.blogspot.com/2007/12/cjk-b-case-study-1-u272f0.html" mce_href="http://babelstone.blogspot.com/2007/12/cjk-b-case-study-1-u272f0.html"&gt;CJK-B Case Study #1 : U+272F0&lt;/A&gt;) that &lt;STRONG&gt;every ideograph also has a story&lt;/STRONG&gt;!&lt;/P&gt;
&lt;P&gt;He was &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2007/11/22/6462768.aspx#6643649" mce_href="http://blogs.msdn.com/michkap/archive/2007/11/22/6462768.aspx#6643649"&gt;worried&lt;/A&gt; that he posted too much detail to be very interesting:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;FONT face="times new roman,times"&gt;I'm sorry if I spoiled your follow-up, but I'm sure you have a different, probably more interesting take on the subject than me -- my &amp;nbsp;post is probably too detailed for anyone but the most dedicated CJK/Unicode geeks. I wasn't going to blog on the subject originally, but there was just too much information to put into the comments to someone else's blog.&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;though speaking personally I disagree. The only thing that would keep me from doing such&amp;nbsp;a post myself here is that I lack the&amp;nbsp;knowledge/wherewithall to do so.... &lt;/P&gt;
&lt;P&gt;Luckily I can simply link to him, instead! :-)&lt;/P&gt;
&lt;P&gt;From Andrew's "case study" post:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;I guess that once the Taiwan source glyph is corrected and the Taiwan stroke count data is amended it should be the end of the story, but the one thing that nags at me (as is the case with so many characters which only have a single Taiwan source reference) is what is the ultimate source of this character and which texts is it used in ? &lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;For Microsoft, it raises an interesting problem if/when the reference glyph is fixed....&lt;/P&gt;
&lt;P&gt;Okay, let's say they do fix the reference glyph, and subsequently, the stroke count.&lt;/P&gt;
&lt;P&gt;What does Microsoft do?&lt;/P&gt;
&lt;P&gt;Note that our Traditional Chinese font that includes &lt;A class="" href="http://www.fileformat.info/info/unicode/char/272f0" mce_href="http://www.fileformat.info/info/unicode/char/272f0"&gt;U+272f0&lt;/A&gt;&amp;nbsp;does not have this problem (we did not pick up the incorrect glyph, possibly the font foundry realizing the same thing Andrew did and not wanting to perpetuate the mistake, but then also not telling us, either -- not&amp;nbsp;to imply that&amp;nbsp;there is or isn't a definite mechanism for such? Or perhaps there is a separate quality issue in the font itself?).&lt;/P&gt;
&lt;P&gt;So either way, at this point it is just an anomaly in the sorting table, a known bug with no official communication on the change yet,&amp;nbsp;but we expect at some point there might be such communication.&lt;/P&gt;
&lt;P&gt;Since we are litedrally based on a standard in this case, no change could even be considered until it is known through official sources.&lt;/P&gt;
&lt;P&gt;In a total stroke based collation such as this one, the difference between 13 and 19 is pretty huge, so one assumes that eventually the change would have to be picked up.&lt;/P&gt;
&lt;P&gt;But even a change to one code point could cause index corruption in a database, which means a new major version would be required for the character.&lt;/P&gt;
&lt;P&gt;And before rushing in to fix &lt;A class="" href="http://www.fileformat.info/info/unicode/char/272f0" mce_href="http://www.fileformat.info/info/unicode/char/272f0"&gt;U+272f0&lt;/A&gt;&amp;nbsp;(which as Andrew mentioned it is not clear where it is needed), we have to consider the bigger problems in the other 9,767 differences and with Extension B in general. &lt;/P&gt;
&lt;P&gt;How reliable is the ret of the data? And how many additional problems are already fixed in the font that ships even in Vista but are not fixed in the collation tables since those tables are based on a standard working from what amounts to a completely different set of reference glyphs?&lt;/P&gt;
&lt;P&gt;I always tended to think of pronunciation-based sorts as being more worrisome technically, since an ideograph can have multiple pronunciations and by putting a stake in the ground for a version and saying that one pronunciation is the most common, we have to allow fro the fact that over time things change, and in the future the most common pronunciation might be different. We had&amp;nbsp;several such changes for Hanja in the Korean collation in Vista, for example.&lt;/P&gt;
&lt;P&gt;But now it seems like we have to look at stroke count data with the same careful eye, never knowing when&amp;nbsp;future corrections would come in based on bugs....&lt;/P&gt;
&lt;P&gt;Maybe an automated program should be run over all of the characters in &lt;STRONG&gt;PMing-ExtB&lt;/STRONG&gt;, counting strokes and comparing against the stroke data in the standard, and then figuring out where other bugs might be.&lt;/P&gt;
&lt;P&gt;But I imagine getting resources for such a review would be a challenge, and the notion of assuming the font is always right here is also flawed -- the font and data&amp;nbsp;could both be wrong, after all.&lt;/P&gt;
&lt;P&gt;The engineer in me has a hard time dealing with the fact that there are an unknown number of mistakes here some of which could perhaps be ferreted out, and the linguist-wannabe does not feel much better about that (though he is less convinced of the overall usefulness of a total-stroke-based collation and is thus less troubled by anomalies).&lt;/P&gt;
&lt;P&gt;Plus, there is nothing to say that there are not also mistakes on the Simplified side too. More worries and more resources needed (and this one troubles that linguist-wannabe&amp;nbsp;a bit more since the stroke count/stroke order based sort has in theory a bit more utility, though the notion that there are millions of people who would know the correct order to draw these ideographs they have never seen from millenia ago is also suspect!).&lt;/P&gt;
&lt;P&gt;It is a mess, to be sure. Inevitably I am back to Andrew again, and his intro text from the post:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;The &lt;/EM&gt;&lt;A href="http://www.unicode.org/charts/PDF/U20000.pdf"&gt;&lt;FONT color=#5588aa&gt;&lt;EM&gt;CJK Unified Ideographs Extension B&lt;/EM&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;EM&gt; &lt;FONT class=gray&gt;[13MB]&lt;/FONT&gt; block that was added to Unicode/10646 in 2001 comprises 42,711 characters, and it is no secret that there are many problems with this huge collection of mostly quite rare characters, including hundreds of cases of unifiable characters that have been erroneously encoded separately and even a handful of completely duplicate characters. There is enough material to keep a dedicated CJK-B blogger busy for years to come, but I certainly don't want to go down that particular path.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I worry more about the ones that cause implementation issues like &lt;A class="" href="http://www.fileformat.info/info/unicode/char/272f0" mce_href="http://www.fileformat.info/info/unicode/char/272f0"&gt;U+272f0&lt;/A&gt;&amp;nbsp;will, but even so I would be just as worried about having to go down that path as he, perhaps more. Technically I worry more for my sucessors who own the area, though I do feel partially responsible since the errors of the Taiwanese standard based on errors in Unicode/10646 were perpetuated into Windows on my watch.&lt;/P&gt;
&lt;P&gt;Should I feel worse that it was literally my request to the subsidiaries to provide the additional data I would need to extend the tables? &lt;/P&gt;
&lt;P&gt;&lt;EM&gt;(They had requested us to extend them and had been refused for a long time based on technological issues that I figued out workarounds for.)&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Well, either way I do feel worse. It wasn't my fault, but from the Windows standpoint it was because of me....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;𧋰&lt;/FONT&gt; &lt;/FONT&gt;&lt;EM&gt;&lt;FONT color=#ff00ff&gt;(&lt;/FONT&gt;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/272f0" mce_href="http://www.fileformat.info/info/unicode/char/272f0"&gt;U+272f0&lt;/A&gt;&lt;FONT color=#ff00ff&gt;&lt;/FONT&gt;&lt;FONT color=#ff00ff&gt;,&amp;nbsp;an&amp;nbsp;Extension B CJK ideograph causing me to lose a bit of sleep!)&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=6643180" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Collation_2F00_Casing/default.aspx">Collation/Casing</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Linguistic/default.aspx">Linguistic</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Fonts_2F00_Typography/default.aspx">Fonts/Typography</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode+Lame+List/default.aspx">Unicode Lame List</category></item><item><title>Microsoft is a Form 'C' shop, Part 1</title><link>http://blogs.msdn.com/michkap/archive/2007/10/29/5756924.aspx</link><pubDate>Mon, 29 Oct 2007 17:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:5756924</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/michkap/comments/5756924.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=5756924</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=5756924</wfw:comment><description>&lt;P&gt;Microsoft has had Unicode as a part of its operating system offerings since the easrliest days of its 32-bit platforms.&lt;/P&gt;
&lt;P&gt;And a lot that support predates asnything that Unicode later chose to provide, thus &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2004/11/28/271121.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2004/11/28/271121.aspx"&gt;&lt;STRONG&gt;we don't use the Unicode Collation Algorithm&lt;/STRONG&gt;&lt;/A&gt; for our sorting, for years &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx"&gt;&lt;STRONG&gt;we did not use Unicode normalization&lt;/STRONG&gt;&lt;/A&gt; for our equivalences, and all kinds of random snafus like that &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2007/08/28/4605786.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2007/08/28/4605786.aspx"&gt;&lt;STRONG&gt;somewhat random Tibetan/Myanmar thing&lt;/STRONG&gt;&lt;/A&gt; with us not picking up Unicode changes when they happened still manage to pop up after all of these years.&lt;/P&gt;
&lt;P&gt;Now for the most part, data coming out of Microsoft's keyboards, data entry methods, functions, methods, and algorithms has always been in what we for years called the precomposed form, which Unicode calls Unicode Normalization Form "C" in their &lt;A class="" href="http://www.unicode.org/reports/tr15/" mce_href="http://www.unicode.org/reports/tr15/"&gt;UAX #15&lt;/A&gt;. Other than hiccups like code page 1258 (discussed &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/04/19/409566.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/04/19/409566.aspx"&gt;&lt;STRONG&gt;here&lt;/STRONG&gt;&lt;/A&gt; and &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/11/11/491349.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/11/11/491349.aspx"&gt;&lt;STRONG&gt;here&lt;/STRONG&gt;&lt;/A&gt;), data always tended to be in Form "C".&lt;/P&gt;
&lt;P&gt;In fact, if you convert data to Form "D" then there are a bunch of places like in collation that you won't get the most accurate results, even in Vista where most of the equivalent forms were added to the tables to try to make the impact of using Form "D" text less noticeable....&lt;/P&gt;
&lt;P&gt;Yet even today if you convert to Form "D" then all kinds of languages from Korean to Tibetan won't always sort as expected or as deisgned. And Vista features like LINGUISTIC_IGNORE* flags won't always return exactly equivalent results if you compare Form "C" text to Form "D" text. You are always better off converting text if you are getting it from other sources before using the NLS API for the text....&lt;/P&gt;
&lt;P&gt;Chalk it up to&amp;nbsp;gremlins in the computers&amp;nbsp;and such.... not converting what they do not seem to&amp;nbsp;handle on their own....&lt;/P&gt;
&lt;P&gt;Now note that products like Access and SQL Server, being based on similar technologies only up[dated less often, still had problems even doday..&lt;/P&gt;
&lt;P&gt;Anyway, future posts in this series will be explaining&amp;nbsp;other uences our "Form 'C'- ness". This is just the intro.&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;ೀ&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/0cc0" mce_href="http://www.fileformat.info/info/unicode/char/0cc0"&gt;U+0cc0&lt;/A&gt;, a.k.a. KANNADA VOWEL SIGN II)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=5756924" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Collation_2F00_Casing/default.aspx">Collation/Casing</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Linguistic/default.aspx">Linguistic</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Int_2700_l+Programming/default.aspx">Int'l Programming</category></item><item><title>Tell them to GO FISH?, aka Carpe Carp [Sarcalogos] (Seize the [Christian] Fish), aka Unicode List (where characters discuss character encoding)</title><link>http://blogs.msdn.com/michkap/archive/2007/10/01/5216797.aspx</link><pubDate>Mon, 01 Oct 2007 10:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:5216797</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/michkap/comments/5216797.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=5216797</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=5216797</wfw:comment><description>&lt;P&gt;If you are a member of &lt;A class="" href="http://www.unicode.org/consortium/distlist.html" mce_href="http://www.unicode.org/consortium/distlist.html"&gt;the Unicode List&lt;/A&gt;, and you do not have any real sense of what you should or should not contribute, then it is worth realizing that you have no&amp;nbsp;disadvantage compared to the people who do contribute regularly -- they don't know much about it either. :-)&lt;/P&gt;
&lt;P&gt;But if there is one rule worth learning, it is simple.&lt;/P&gt;
&lt;P&gt;&lt;FONT size=5&gt;Avoid Hyperbole.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;Making an outrageous or in some other way extreme claim about a character encoding proposal is a huge mistake, even if you suggest it in the context of an "April 1st" prank proposal.&lt;/P&gt;
&lt;P&gt;And it happened this weekend&lt;FONT size=1&gt;&lt;SUP&gt;1,2&lt;/SUP&gt;&lt;/FONT&gt; when perennial mischief maker Phillipe Verdy suggested to Jon Hanna's worry about a particular thread that after reading the suggestions given he'd have to find a different post to use for April 1st:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;EM&gt;Why not proposing the historical Christian fish symbol at this date (April 1st)?&lt;BR&gt;&lt;BR&gt;The fish was used and displayed on many monuments and graves, is still seen on old artistic features (like ceramics) and probably on old scriptures too, by the first Christians, before the Cross during the Roman Empire before it converted officially to the Christian religion, and accepted to use the Cross as an easily recognizable symbol to commemorate those that died on it.&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;An innocent start, with now retired Asmus Freytag suggesting to the Frenchman:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="times new roman,times"&gt;You haven't been in the US in a while, apparently, or you'd know that the fish symbol is widely used there. I'm not (just) referring to its use as a symbol on cars, but in print and other advertising to label commercial enterprises owned by or catering to the "true believers".&lt;BR&gt;&lt;BR&gt;While I'm surprised that it hasn't been proposed before, it's clear that it's so far out of the realm of character encoding that it can be safely considered an April Fool's joke.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;And then William J. Poser's amusing contribution:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;&lt;FONT face="times new roman,times"&gt;&lt;EM&gt;If the Christian fish is encoded, I demand equal space for the Darwin fish.&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;And now we have a party, with at least another 60 posts in increasing distances from a sensible topic with each successive post.&lt;/P&gt;
&lt;P mce_keep="true"&gt;All I know is that I am sure the UTC will see a proposal at some point for the little bugger....&lt;/P&gt;
&lt;P mce_keep="true"&gt;Anyway, Phillipe added some historical info on the Christian fish (ref: &lt;A class="" href="http://www.eureka4you.com/fish/fishsymbol.htm" mce_href="http://www.eureka4you.com/fish/fishsymbol.htm"&gt;here&lt;/A&gt;), which includes info about the "keyboard fish", which looks like this:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;&lt;FONT size=7&gt;&amp;lt;&amp;gt;&amp;lt;&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;And then&amp;nbsp;the conversation&amp;nbsp;moved to bumper stickers and then the various religious/Darwinism arguments, a&amp;nbsp;threat from Her Divine Effulgence Sarasvati to cease and desist with the off-topicality&amp;nbsp;of the thread&lt;FONT size=1&gt;&lt;SUP&gt;3&lt;/SUP&gt;&lt;/FONT&gt; before&amp;nbsp;morphing into a discussion of&amp;nbsp;Egyptian hieroglyphs and the many small fishes that are set to be encoded later.&lt;/P&gt;
&lt;P mce_keep="true"&gt;Like I said, I'm willing to bet this is not the last we have heard about the Christian Fish proposal, and it will not be just an April 1st joke post. I have decided it is an expected risk of putting a bunch of characters in a list to discuss encoding characters....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT size=1&gt;1- For the record the only bright spot in this whole mess is that most of these people seem to have jobs because the threads that gets tons of participation seem to get it on the weekend. &lt;BR&gt;2 - Of course the flip side of this is that some of these people need to try and get some lives. :-)&lt;BR&gt;3 - This request was of course, mostly ignored in the subsequent 20+ posts.&lt;/FONT&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;◉&lt;/FONT&gt;, &lt;FONT size=5&gt;⥼&lt;/FONT&gt;, &lt;FONT size=5&gt;⥽&lt;/FONT&gt;, &lt;FONT size=5&gt;⥾&lt;/FONT&gt;, &lt;EM&gt;and&lt;/EM&gt; &lt;FONT size=5&gt;⥿&lt;/FONT&gt; &lt;EM&gt;(U+25c9, U+297c, U+297d, U+297e, and&amp;nbsp;U+297f, a.k.a. FISHEYE, LEFT FISH TAIL, RIGHT FISH TAIL, UP FISH TAIL, and&amp;nbsp;DOWN FISH TAIL - all packed in like a bunch of sardines on this one)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=5216797" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character[ sequence] has a story #30: The SMILEY (a 25-year old story, in fact)</title><link>http://blogs.msdn.com/michkap/archive/2007/09/19/4998754.aspx</link><pubDate>Wed, 19 Sep 2007 21:44:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:4998754</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/michkap/comments/4998754.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=4998754</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=4998754</wfw:comment><description>&lt;P&gt;Given all of the blather about emoji and emoticons and symbols, the mail I got from Sergey earlier today puts in all in perspective.&lt;/P&gt;
&lt;P&gt;It had the following in it:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="http://www.trigeminal.com/images/smiley.png"&gt;&lt;/P&gt;
&lt;P&gt;Note the date and time, and when this post goes live.&lt;/P&gt;
&lt;P&gt;For more on Scott and Smiley lore, see &lt;A class="" href="http://research.microsoft.com/~mbj/Smiley/Smiley.html" mce_href="http://research.microsoft.com/~mbj/Smiley/Smiley.html"&gt;here&lt;/A&gt; and &lt;A class="" href="http://www.cs.cmu.edu/~sef/sefSmiley.htm" mce_href="http://www.cs.cmu.edu/~sef/sefSmiley.htm"&gt;here&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Considering how often I use this particular item, I feel like I owe him a serious thank you!&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;⌣&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2323" mce_href="http://www.fileformat.info/info/unicode/char/2323"&gt;U+2323&lt;/A&gt;, a.k.a.&amp;nbsp;SMILE)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=4998754" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Potpourri/default.aspx">Potpourri</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask)</title><link>http://blogs.msdn.com/michkap/archive/2007/08/28/4605786.aspx</link><pubDate>Tue, 28 Aug 2007 10:59:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:4605786</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>4</slash:comments><comments>http://blogs.msdn.com/michkap/comments/4605786.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=4605786</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=4605786</wfw:comment><description>&lt;P&gt;So I was chatting with Goldie the other day and I think just after or maybe it was just before I made some ridiculous stretch of&amp;nbsp;a joke joke about Anatevka (forgetting momentarily that she did not go by Golde; her nom de plume was Goldie) she asked me if there was a test case I knew off the top of my head&amp;nbsp;where collation&amp;nbsp;results changed between XP and Server 2003.&lt;/P&gt;
&lt;P&gt;Interestingly, this is a question I have been waiting years for someone to ask, ever since I first pieced together the change that happened! :-)&lt;/P&gt;
&lt;P&gt;You see, prior to Server 2003, there was no version support. You know, those functions I mentioned in posts like &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/05/04/414520.aspx"&gt;this one&lt;/A&gt;&lt;/STRONG&gt;, (&lt;A href="http://msdn.microsoft.com/library/en-us/intl/nls_19ev.asp"&gt;IsNLSDefinedString&lt;/A&gt; and &lt;A title=GetNLSVersion href="http://msdn.microsoft.com/library/en-us/intl/nls_5e7i.asp"&gt;GetNLSVersion&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;As a part of the Server 2003 update, a bunch of code points got removed from the table. I'll list&amp;nbsp;a bunch of them&amp;nbsp;and you tell me if you see a pattern:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face="Consolas,Lucida Console,Courier New,Courier,Fixed" size=1&gt;0x1000&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 2&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ka&lt;BR&gt;0x1001&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 3&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Kha&lt;BR&gt;0x1002&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 4&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ga&lt;BR&gt;0x1003&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 5&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Nga&lt;BR&gt;0x1004&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ca&lt;BR&gt;0x1005&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 7&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Cha&lt;BR&gt;0x1006&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 8&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ja&lt;BR&gt;0x1007&amp;nbsp; 32&amp;nbsp;&amp;nbsp;&amp;nbsp; 9&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Nya&lt;BR&gt;0x1008&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 10&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Reversed Ta&lt;BR&gt;0x1009&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 11&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Reversed Tha&lt;BR&gt;0x100a&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 12&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Reversed Da&lt;BR&gt;0x100b&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 13&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Reversed Na&lt;BR&gt;0x100c&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 14&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ta&lt;BR&gt;0x100d&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 15&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Tha&lt;BR&gt;0x100e&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 16&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Da&lt;BR&gt;0x100f&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 17&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Na&lt;BR&gt;0x1010&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 18&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Pa&lt;BR&gt;0x1011&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 19&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Pha&lt;BR&gt;0x1012&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 20&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ba&lt;BR&gt;0x1013&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 21&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ma&lt;BR&gt;0x1014&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 22&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Tsa&lt;BR&gt;0x1015&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 23&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Tsha&lt;BR&gt;0x1016&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 24&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Dza&lt;BR&gt;0x1017&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 25&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Wa&lt;BR&gt;0x1018&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 26&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Zha&lt;BR&gt;0x1019&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 27&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Za&lt;BR&gt;0x101a&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 28&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Aa&lt;BR&gt;0x101b&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 29&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ya&lt;BR&gt;0x101c&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 30&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ra&lt;BR&gt;0x101d&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 31&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan La&lt;BR&gt;0x101e&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Sha&lt;BR&gt;0x101f&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 33&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Reversed Sha&lt;BR&gt;0x1020&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 34&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Sa&lt;BR&gt;0x1021&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 35&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ha&lt;BR&gt;0x1022&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 36&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan A&lt;BR&gt;0x1026&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 3&amp;nbsp; 0&amp;nbsp; ;Tibetan Vowel Sign I&lt;BR&gt;0x1027&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 4&amp;nbsp; 0&amp;nbsp; ;Tibetan Vowel Sign Short I&lt;BR&gt;0x1028&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 5&amp;nbsp; 0&amp;nbsp; ;Tibetan Vowel Sign U&lt;BR&gt;0x1029&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 6&amp;nbsp; 0&amp;nbsp; ;Tibetan Vowel Sign E&lt;BR&gt;0x102a&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 7&amp;nbsp; 0&amp;nbsp; ;Tibetan Vowel Sign O&lt;BR&gt;0x102b&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 37&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Chuchenyige&lt;BR&gt;0x102c&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 38&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Visarga&lt;BR&gt;0x102e&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 8&amp;nbsp; 0&amp;nbsp; ;Tibetan Anusvara&lt;BR&gt;0x102f&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 39&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Right Brace&lt;BR&gt;0x1030&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp; 9&amp;nbsp; 0&amp;nbsp; ;Tibetan Under Ring&lt;BR&gt;0x1031&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 40&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Ditto&lt;BR&gt;0x1033&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 41&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Single Ornament&lt;BR&gt;0x1034&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 42&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Shad&lt;BR&gt;0x1035&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 43&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Tseg&lt;BR&gt;0x1036&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 10&amp;nbsp; 0&amp;nbsp; ;Tibetan Candrabindu&lt;BR&gt;0x1037&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 11&amp;nbsp; 0&amp;nbsp; ;Tibetan Candrabindu With Ornament&lt;BR&gt;0x1038&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 44&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Comma&lt;BR&gt;0x1039&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 45&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Rinchanphungshad&lt;BR&gt;0x103a&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 46&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Rgyanshad&lt;BR&gt;0x103b&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 12&amp;nbsp; 0&amp;nbsp; ;Tibetan Honorific Under Ring&lt;BR&gt;0x103c&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 47&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Left Brace&lt;BR&gt;0x103d&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 13&amp;nbsp; 2&amp;nbsp; ;Tibetan Vowel Sign Ai&lt;BR&gt;0x103e&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 14&amp;nbsp; 2&amp;nbsp; ;Tibetan Vowel Sign Au&lt;BR&gt;0x1040&amp;nbsp; 12&amp;nbsp;&amp;nbsp; 16&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Zero&lt;BR&gt;0x1041&amp;nbsp; 12&amp;nbsp;&amp;nbsp; 47&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit One&lt;BR&gt;0x1042&amp;nbsp; 12&amp;nbsp;&amp;nbsp; 66&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Two&lt;BR&gt;0x1043&amp;nbsp; 12&amp;nbsp;&amp;nbsp; 84&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Three&lt;BR&gt;0x1044&amp;nbsp; 12&amp;nbsp; 102&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Four&lt;BR&gt;0x1045&amp;nbsp; 12&amp;nbsp; 121&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Five&lt;BR&gt;0x1046&amp;nbsp; 12&amp;nbsp; 140&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Six&lt;BR&gt;0x1047&amp;nbsp; 12&amp;nbsp; 158&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Seven&lt;BR&gt;0x1048&amp;nbsp; 12&amp;nbsp; 176&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Eight&lt;BR&gt;0x1049&amp;nbsp; 12&amp;nbsp; 194&amp;nbsp; 70&amp;nbsp; 2&amp;nbsp; ;Tibetan Digit Nine&lt;BR&gt;0x104a&amp;nbsp; 32&amp;nbsp;&amp;nbsp; 48&amp;nbsp;&amp;nbsp; 2&amp;nbsp; 2&amp;nbsp; ;Tibetan Double Shad&lt;BR&gt;0x104b&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 15&amp;nbsp; 0&amp;nbsp; ;Tibetan Virama&lt;BR&gt;0x104c&amp;nbsp;&amp;nbsp; 1&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp; 16&amp;nbsp; 0&amp;nbsp; ;Tibetan Lenition Mark&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;The problem here? The data is all wrong!&lt;/P&gt;
&lt;P mce_keep="true"&gt;This version of Tibetan, first described in &lt;A class="" href="http://unicode.org/reports/tr2.html" mce_href="http://unicode.org/reports/tr2.html"&gt;Unicode Technical Report #2&lt;/A&gt;, was removed in Unicode 1.1 when the ISO 10646 merger happened, and then Tibetan was added back in Unicode 2.0 in an entirely different place.&lt;/P&gt;
&lt;P mce_keep="true"&gt;If you look at &lt;A class="" href="http://www.unicode.org/Public/UNIDATA/DerivedAge.txt" mce_href="http://www.unicode.org/Public/UNIDATA/DerivedAge.txt"&gt;DerivedAge.txt&lt;/A&gt;, you will see that the new Tibetan was added in July 1996.&lt;/P&gt;
&lt;P mce_keep="true"&gt;But Windows had been carrying data around from Unicode 1.0 since the very beginning of its 32-bit life, possibly as far back as NT 3.5 or even NT 3.1 (I am almost curious enough to go try and find out which, actually!).&lt;/P&gt;
&lt;P mce_keep="true"&gt;In Server 2003, it was decided that this incredibly invalid data had to be removed. &lt;/P&gt;
&lt;P mce_keep="true"&gt;For one thing, it is just really bad to start a formal versioning functionality with crap like that in there.&lt;/P&gt;
&lt;P mce_keep="true"&gt;And for another, this space that was left empty after the 1.1 merge was actually filled as of Unicode 3.0 in 1999 -- with the Myanmar script. And even though Windows did not add weights for it yet (we did not do so until Vista), keeping known bad data seemed&amp;nbsp;like a pretty bad idea...&lt;/P&gt;
&lt;P mce_keep="true"&gt;So, all of the above code points had weight in Windows from the early 32-bit days until XP, and then again in Vista (and were essentially &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/01/18/355210.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/01/18/355210.aspx"&gt;weightless&lt;/A&gt;&lt;/STRONG&gt; in the years between).&lt;/P&gt;
&lt;P mce_keep="true"&gt;And of course the snapshots in Jet 4.0, ACE (the version of Jet that ships with Access &amp;gt;= 2007), SQL Server 7.0, 2000, and 2005 all have these somewhat bogus code points as well....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;EM&gt;Oops for them (plus we can be snotty and superior about it now that is fixed in Windows!)&lt;/EM&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;When one talks to old timers about the 1.1 merge between Unicode and ISO 10646, you have trouble getting a straight answer -- it is like that bit from The Number of the Beast:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;&lt;EM&gt;&lt;FONT face="times new roman,times"&gt;I've given up trying to find out what happened in 1965: "The Year They Hanged the Lawyers." When I asked a librarian for a book on that year and decade, he wanted to know why I needed access to records in locked vaults. I left without giving my name. There is free speech -- but some subjects are not discussed....&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P mce_keep="true"&gt;So that is all I can say about the old &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1000" mce_href="http://www.fileformat.info/info/unicode/char/1000"&gt;U+1000&lt;/A&gt; TIBETAN LETTER KA which died in Unicode in the early 1990s only to rise from&amp;nbsp;its ashes in 1996 at&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/0f40" mce_href="http://www.fileformat.info/info/unicode/char/0f40"&gt;U+0f40&lt;/A&gt;&amp;nbsp;with &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1000" mce_href="http://www.fileformat.info/info/unicode/char/1000"&gt;U+1000&lt;/A&gt; being assigned to MYANMAR LETTER KA in 1999. The same character lived on at Microsoft&amp;nbsp;until&amp;nbsp;2003, only to be reborn along with its Myanmar cousin in Vista....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;ཀ&lt;/FONT&gt; &lt;EM&gt;and&lt;/EM&gt; &lt;FONT size=5&gt;က&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/0f40" mce_href="http://www.fileformat.info/info/unicode/char/0f40"&gt;U+0f40&lt;/A&gt; and &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1000" mce_href="http://www.fileformat.info/info/unicode/char/1000"&gt;U+1000&lt;/A&gt;, a.k.a. TIBETAN LETTER KA and MYANMAR LETTER KA)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=4605786" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Collation_2F00_Casing/default.aspx">Collation/Casing</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode+Lame+List/default.aspx">Unicode Lame List</category></item><item><title>Every character has a story #28: U+1e9e (CAPITAL SHARP S)</title><link>http://blogs.msdn.com/michkap/archive/2007/08/24/4536979.aspx</link><pubDate>Fri, 24 Aug 2007 10:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:4536979</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/michkap/comments/4536979.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=4536979</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=4536979</wfw:comment><description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;That&amp;nbsp;night I saw in the pipeline fair&lt;BR&gt;A character&amp;nbsp;that wasn't there&lt;BR&gt;Non-existence won't stop the encoding; it's true&lt;BR&gt;So it's coming soon to a Unicode near you!&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;It all started with &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx"&gt;Every character has a story #15: CAPITAL SHARP S (not encoded)&lt;/A&gt;&lt;/STRONG&gt;, and then continued in &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2007/05/03/2398227.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2007/05/03/2398227.aspx"&gt;Every character has a story #26: CAPITAL SHARP S (might be encoded?)&lt;/A&gt;&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;And you can read the title of this latest blog post and know what is happening now without any hints from me....&lt;/P&gt;
&lt;P&gt;Though I must admit the trip has been both long and strange.&lt;/P&gt;
&lt;P&gt;It was decided within both&amp;nbsp;ISO&amp;nbsp;10646&amp;nbsp;and Unicode that this interesting character was indeed going to be encoded (as per the &lt;A class="" href="http://www.unicode.org/alloc/Pipeline.html" mce_href="http://www.unicode.org/alloc/Pipeline.html"&gt;pipeline&lt;/A&gt;, it was officially accepted on May 18th of this year and as of April 27th is in Stage 5 of the &lt;A class="" href="http://www.unicode.org/alloc/Caution.html#ISO" mce_href="http://www.unicode.org/alloc/Caution.html#ISO"&gt;ISO process&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;And&amp;nbsp;I have probably learned more about the nature of letters within typography than any experience before or since!&lt;/P&gt;
&lt;P&gt;Immediately after this process started, there was a whole bunch of discussion on &lt;A class="" href="http://www.unicode.org/consortium/distlist.html#uni_list" mce_href="http://www.unicode.org/consortium/distlist.html#uni_list"&gt;the Unicode List&lt;/A&gt; about a &lt;STRONG&gt;very&lt;/STRONG&gt; important topic:&lt;/P&gt;
&lt;P&gt;&lt;FONT size=5&gt;&lt;STRONG&gt;WHAT DOES A CAPITAL SHARP S LOOK LIKE?!?&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;There were a whole bunch of proposals &lt;A class="" href="http://www.typeforum.de/modules.php?op=modload&amp;amp;name=XForum&amp;amp;file=viewthread&amp;amp;tid=353" mce_href="http://www.typeforum.de/modules.php?op=modload&amp;amp;name=XForum&amp;amp;file=viewthread&amp;amp;tid=353"&gt;here&lt;/A&gt;, and much of the conversation then took a southward turn.&lt;/P&gt;
&lt;P&gt;Like people suggesting that DIN should be dissolved by law for supporting the proposal.&lt;/P&gt;
&lt;P&gt;And others pointing out that the &lt;A class="" href="http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3227.pdf" mce_href="http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3227.pdf"&gt;proposal&lt;/A&gt; specified an enlarged version of ß. nothing more and nothing less.&lt;/P&gt;
&lt;P&gt;But I have told you about the Unicode List, the next 100 messages oscillating between discussing typographic innovations that would make sense if the letter did indeed exist based on different theories of its etymology and people who remained unconvinced by the proposal even after it had been accepted since in their view it isn't a freaking letter in the first place.&lt;/P&gt;
&lt;P&gt;Plus lots of SZ vs. SS arguments.&lt;/P&gt;
&lt;P&gt;An informal survey of the Germans I knew all seemed to fall squarely in the camp of the insanity of DIN, though many of them considered the opinion to be redundant....&lt;/P&gt;
&lt;P&gt;And then with a few people talking about the consequences for Unicode properties, just to add the vague scent of relevance to the discussion. :-)&lt;/P&gt;
&lt;P&gt;John Hudson had in my opinion the most amusing observation:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;The irony of the recent exchanges is not lost on me:&lt;BR&gt;&lt;BR&gt;On the one hand, we have Marnen Laibow-Koser, who thinks that this character should &lt;STRONG&gt;not&lt;/STRONG&gt; exist, but that it does, and therefore needs to be encoded.&lt;BR&gt;&lt;BR&gt;On the other hand, we have me, who thinks that this character &lt;STRONG&gt;should&lt;/STRONG&gt; exist, but that it does not, and therefore does not need to be encoded.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Just so.&lt;/P&gt;
&lt;P&gt;For Microsoft, it raises some interesting questions for both collation and case for the next version of Windows.&lt;/P&gt;
&lt;P&gt;I mean, think about the issues I have already talked about in posts like &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/04/10/406880.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/04/10/406880.aspx"&gt;&lt;STRONG&gt;What the %#$* is wrong with German sorting?&lt;/STRONG&gt;&lt;/A&gt;&amp;nbsp;where we make &lt;STRONG&gt;ss&lt;/STRONG&gt; equal to &lt;STRONG&gt;ß&lt;/STRONG&gt; so that the uppercase version "SS" will sort near the ß in a sort ignoring case -- where we do things that make less linguistic sense in order to give regular results that are intuitive.&lt;/P&gt;
&lt;P&gt;So who would expect that if &lt;A class="" href="http://www.fileformat.info/info/unicode/char/00df" mce_href="http://www.fileformat.info/info/unicode/char/00df"&gt;U+00df&lt;/A&gt;&amp;nbsp;is equal to ss that &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1e9e" mce_href="http://www.fileformat.info/info/unicode/char/1e9e"&gt;U+1e9e&lt;/A&gt;&amp;nbsp;wouldn't be made equal to SS? Meaning that in the collation tables, &lt;A class="" href="http://www.fileformat.info/info/unicode/char/00df" mce_href="http://www.fileformat.info/info/unicode/char/00df"&gt;U+00df&lt;/A&gt;&amp;nbsp;and &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1e9e" mce_href="http://www.fileformat.info/info/unicode/char/1e9e"&gt;U+1e9e&lt;/A&gt;&amp;nbsp;would simply be case variants, with no real choice in the matter.&lt;/P&gt;
&lt;P&gt;And as to casing....&lt;/P&gt;
&lt;P&gt;Now just because we make the relationship in casing does not mean we make it in collation. After all, as I have pointed out &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2006/08/08/692390.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2006/08/08/692390.aspx"&gt;several&lt;/A&gt;&lt;/STRONG&gt; &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2006/01/11/511557.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2006/01/11/511557.aspx"&gt;&lt;STRONG&gt;times&lt;/STRONG&gt;&lt;/A&gt; before, collation != case.&lt;/P&gt;
&lt;P&gt;But on the other hand, the case table is used in order to enforce the case insensitivity in the NT object namespace and the file system. And one clear issue is that there is no good reason to allow one to put filenames differing only by the presence of &lt;A class="" href="http://www.fileformat.info/info/unicode/char/00df" mce_href="http://www.fileformat.info/info/unicode/char/00df"&gt;U+00df&lt;/A&gt;&amp;nbsp;and &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1e9e" mce_href="http://www.fileformat.info/info/unicode/char/1e9e"&gt;U+1e9e&lt;/A&gt;&amp;nbsp;in the same directory. Users would either never try it or they would never expect it to work. So it is quite possible that in the next version of Windows (which only does simple casing) it may make the most sense to make the two characters case variants of each other -- to enforce reasonable use of&amp;nbsp;both letters!&lt;/P&gt;
&lt;P&gt;There is still lots of time to decide, though at present I am leaning this way since it will give the most intuitive behavior for end users (even at the expensive of giving slightly unintuitive results for developers).&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=4&gt;ß&lt;/FONT&gt; &lt;EM&gt;and&lt;/EM&gt; &lt;FONT size=6&gt;ß&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00df" mce_href="http://www.fileformat.info/info/unicode/char/00df"&gt;U+00df&lt;/A&gt; and &lt;A class="" href="http://www.fileformat.info/info/unicode/char/1e9e" mce_href="http://www.fileformat.info/info/unicode/char/1e9e"&gt;U+1e9e&lt;/A&gt;, LATIN SMALL LETTER SHARP S and&amp;nbsp;CAPITAL SMALL LETTER SHARP S)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=4536979" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Fonts_2F00_Typography/default.aspx">Fonts/Typography</category></item><item><title>Fractions may be your friends, but they won't pick you up at the airport!</title><link>http://blogs.msdn.com/michkap/archive/2007/07/14/3869052.aspx</link><pubDate>Sat, 14 Jul 2007 22:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:3869052</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/michkap/comments/3869052.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=3869052</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=3869052</wfw:comment><description>&lt;P&gt;To me, fractions will always have a special place.&amp;nbsp;The teacher pointed out we all knew what ½ was, and we all knew what 0.5 was, and we all knew about division. Then he blew my mind when he pointed out they were&amp;nbsp;not connected&amp;nbsp;because they were multiple things to memorize, but because they really were the same thing. &lt;/P&gt;
&lt;P&gt;Fractions are&amp;nbsp;the first time I saw math operations as all being connected. And it mostly because that one teacher was the first one to teach it that way (and the book didn't, it was him). I hope the books are better now, since I think my old teacher is retired....&lt;/P&gt;
&lt;P&gt;He also was the first person I remember telling me that fractions were our friends. I think I almost believed it at the time, as it got me realizing that math had the potential to be cool.&lt;/P&gt;
&lt;P&gt;Thanks, Mr. Snodgrass!&lt;/P&gt;
&lt;P&gt;Anyway, back to fractions.&lt;/P&gt;
&lt;P&gt;In Unicode, there are a whole bunch of fractions encoded. Here they are in numeric order (something not even &lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/01/05/346933.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/01/05/346933.aspx"&gt;&lt;STRONG&gt;Shell number sorting&lt;/STRONG&gt;&lt;/A&gt; will do for you!):&lt;/P&gt;
&lt;TABLE class="" border=1&gt;
&lt;TBODY&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&lt;STRONG&gt;&amp;nbsp;Code Point&amp;nbsp; &lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;STRONG&gt;&amp;nbsp;Character &amp;nbsp;&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;STRONG&gt;&amp;nbsp;Character Name&lt;/STRONG&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;STRONG&gt;&amp;nbsp;1.0 Character Name&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/215b" mce_href="http://www.fileformat.info/info/unicode/char/215b"&gt;U+215b&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅛&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE EIGHTH&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE EIGHTH&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2159" mce_href="http://www.fileformat.info/info/unicode/char/2159"&gt;U+2159&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅙&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE SIXTH&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE SIXTH&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2155" mce_href="http://www.fileformat.info/info/unicode/char/2155"&gt;U+2155&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅕&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE FIFTH&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE FIFTH&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00bc" mce_href="http://www.fileformat.info/info/unicode/char/00bc"&gt;U+00bc&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;¼&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE QUARTER&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE QUARTER&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2153" mce_href="http://www.fileformat.info/info/unicode/char/2153"&gt;U+2153&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅓&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE THIRD&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE THIRD&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/215c" mce_href="http://www.fileformat.info/info/unicode/char/215c"&gt;U+215c&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅜&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION THREE EIGHTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION THREE EIGHTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2156" mce_href="http://www.fileformat.info/info/unicode/char/2156"&gt;U+2156&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅖&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION TWO FIFTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION TWO FIFTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00bd" mce_href="http://www.fileformat.info/info/unicode/char/00bd"&gt;U+00bd&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;½&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION ONE HALF&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION ONE HALF&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2157" mce_href="http://www.fileformat.info/info/unicode/char/2157"&gt;U+2157&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅗&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION THREE FIFTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION THREE FIFTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/215d" mce_href="http://www.fileformat.info/info/unicode/char/215d"&gt;U+215d&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅝&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION FIVE EIGHTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION FIVE EIGHTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2154" mce_href="http://www.fileformat.info/info/unicode/char/2154"&gt;U+2154&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅔&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION TWO THIRDS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION TWO THIRDS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00be" mce_href="http://www.fileformat.info/info/unicode/char/00be"&gt;U+00be&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;¾&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION THREE QUARTERS&amp;nbsp; &lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION THREE QUARTERS&amp;nbsp; &lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2158" mce_href="http://www.fileformat.info/info/unicode/char/2158"&gt;U+2158&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅘&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION FOUR FIFTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION FOUR FIFTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/215a" mce_href="http://www.fileformat.info/info/unicode/char/215a"&gt;U+215a&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅚&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION FIVE SIXTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION FIVE SIXTHS&lt;/TD&gt;&lt;/TR&gt;
&lt;TR&gt;
&lt;TD class=""&gt;&amp;nbsp;&lt;A class="" href="http://www.fileformat.info/info/unicode/char/215e" mce_href="http://www.fileformat.info/info/unicode/char/215e"&gt;U+215e&lt;/A&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&lt;FONT face=tahoma,arial,helvetica,sans-serif size=4&gt;&amp;nbsp;⅞&lt;/FONT&gt;&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;VULGAR FRACTION SEVEN EIGHTHS&lt;/TD&gt;
&lt;TD class=""&gt;&amp;nbsp;FRACTION SEVEN EIGHTHS&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;
&lt;P mce_keep="true"&gt;That last column is really interesting -- it is the name of the character in Unicode 1.0, prior to the merger with ISO 10646. Name changes were done to go along with particular preferences in 10646, which in most cases related to compatibility with other ISO standards that involved names nad inother cases involved particular conventions.&lt;/P&gt;
&lt;P mce_keep="true"&gt;Though I have to wonder how we are going to get the next generation of kids interested in considering fractions to be their friends if we spend all of our time telling them that fractions are vulgar!&lt;/P&gt;
&lt;P mce_keep="true"&gt;Now all of these characters are really thought to be compatibility characters -- the "right" way to do fractions is to use regular numbers and U+2044 (FRACTION SLASH) between them...&lt;/P&gt;
&lt;P&gt;Which would perhaps explain why they are vulgar? :-)&lt;/P&gt;
&lt;P&gt;Actually, the reason they are called vulgar is apparently that the reference glyphs use diagonal slashes rather than horizontal bars. Though the two different ways to write fractions are considered glyph variants of each other (as they should be, since they are). Which means that a font developer can use either way to show them.&lt;/P&gt;
&lt;P&gt;There are other standards that actually do try to distinguish between them and encode both -- for example the ones in the DPRK (which caused for some interesting discussions in WG2 and UTC when the DPRK additions to Unicode were being discussed back in 2001, with interesting conversations about variations selectors, if memory serves. I do remember that no one from UTC wanted to encode&amp;nbsp;that extra set of fractions.&lt;/P&gt;
&lt;P&gt;In any case,&amp;nbsp;I guess the moral of the story using the principles of logic is that &lt;STRONG&gt;your friends are vulgar&lt;/STRONG&gt; (which actually was an alternate title I considered for this post)....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by &lt;/EM&gt;&amp;nbsp;&lt;FONT size=5&gt;⁄&lt;/FONT&gt;&amp;nbsp; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/2044" mce_href="http://www.fileformat.info/info/unicode/char/2044"&gt;U+2044&lt;/A&gt;, a.k.a. FRACTION SLASH)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=3869052" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Fonts_2F00_Typography/default.aspx">Fonts/Typography</category></item><item><title>Every character has a story #26: CAPITAL SHARP S (might be encoded?) </title><link>http://blogs.msdn.com/michkap/archive/2007/05/03/2398227.aspx</link><pubDate>Thu, 03 May 2007 22:15:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:2398227</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/michkap/comments/2398227.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=2398227</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=2398227</wfw:comment><description>&lt;P&gt;&lt;EM&gt;(ref: &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx"&gt;Every character has a story #15: CAPITAL SHARP S (not encoded)&lt;/A&gt;&lt;/STRONG&gt;)&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;When Michel Suignard came back from the WG2 meeting, I scooted over to get the quick word on what interesting things happened....&lt;/P&gt;
&lt;P&gt;I was surprised to hear that the Capital Sharp S was accepted, which means it will probably be discussed at the upcoming UTC meeting.&lt;/P&gt;
&lt;P&gt;Kind of ironic given John Hudson's conversation with Andreas Stötzner I mentioned before.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;(&lt;/EM&gt;&lt;A class="" href="http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3227.pdf" mce_href="http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3227.pdf"&gt;&lt;EM&gt;Here&lt;/EM&gt;&lt;/A&gt;&lt;EM&gt; is the revised proposal for those who are interested)&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;I have talked about the Capital Sharp S proposal &lt;STRONG&gt;&lt;A class="" href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx" mce_href="http://blogs.msdn.com/michkap/archive/2005/09/25/473632.aspx"&gt;previously&lt;/A&gt;&lt;/STRONG&gt;, I won't comment much now though probably I will after the UTC, though the linguistic and technical issues are both likely to be the subject of future posts!&lt;/P&gt;
&lt;P&gt;In the meantime, Adam Twardoch pointed folks to a blog entry from Ivo Gabrowitsch on the subject, for those of you who know German (you can read it &lt;STRONG&gt;&lt;A class="" href="http://www.fontwerk.com/451/kommt-jetzt-das-versal-eszett/" mce_href="http://www.fontwerk.com/451/kommt-jetzt-das-versal-eszett/"&gt;here&lt;/A&gt;&lt;/STRONG&gt;).&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=6&gt;ß&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00df" mce_href="http://www.fileformat.info/info/unicode/char/00df"&gt;U+00df&lt;/A&gt;, a.k.a. LATIN SMALL LETTER SHARP S)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=2398227" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Collation_2F00_Casing/default.aspx">Collation/Casing</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Linguistic/default.aspx">Linguistic</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Int_2700_l+Programming/default.aspx">Int'l Programming</category></item><item><title>Every character has a story #25: U+00a4 (CURRENCY SYMBOL)</title><link>http://blogs.msdn.com/michkap/archive/2007/03/15/1885864.aspx</link><pubDate>Thu, 15 Mar 2007 11:39:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1885864</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>11</slash:comments><comments>http://blogs.msdn.com/michkap/comments/1885864.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=1885864</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=1885864</wfw:comment><description>&lt;P&gt;&lt;EM&gt;(Don't these 'Every character has a story' posts remind you of a Colbert-esque &lt;/EM&gt;&lt;A class="" href="http://en.wikipedia.org/wiki/Better_Know_A_District" mce_href="http://en.wikipedia.org/wiki/Better_Know_A_District"&gt;&lt;EM&gt;Better Know a District&lt;/EM&gt;&lt;/A&gt;&lt;EM&gt;&amp;nbsp;series? Maybe I should rename the series to &lt;STRONG&gt;Better Know a Character&lt;/STRONG&gt;. What does everyone think?)&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Just recently, John asked:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;&lt;FONT face="times new roman,times"&gt;Just a quick question I can't seem to find an answer for: what is the purpose and actual utility of the International Monetary Symbol? (ALT-0164)&lt;BR&gt;&lt;BR&gt;I have never seen it in use (except as a non printable character used as an end-of-cell marker) but I have been asked if it could/should be used in some of our applications.&amp;nbsp;&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Although it is true that in &lt;A class="" href="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx" mce_href="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx"&gt;&lt;STRONG&gt;Unicode Every Character has a Story&lt;/STRONG&gt;&lt;/A&gt;, there are some characters encoded in Unicode whose story eludes us.&lt;/P&gt;
&lt;P&gt;It is easy to look at ¤ (&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00a4" mce_href="http://www.fileformat.info/info/unicode/char/00a4"&gt;U+00a4&lt;/A&gt;, a.k.a. CURRENCY SIGN) and assume it must have a hell of a story.&lt;/P&gt;
&lt;P&gt;People in the wide wide world on the whole don't recognize it, and it isn't actually a currency sign in any country in the world, nor has it ever been.&lt;/P&gt;
&lt;P&gt;Unicode does give it a general category of &lt;A class="" href="http://www.fileformat.info/info/unicode/category/Sc/index.htm" mce_href="http://www.fileformat.info/info/unicode/category/Sc/index.htm"&gt;Sc&lt;/A&gt; (Symbol, Currency), and it does have a place not only in Unicode but 0xA4 in &lt;A class="" href="http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx" mce_href="http://www.microsoft.com/globaldev/reference/sbcs/1252.mspx"&gt;Windows Code Page 1252&lt;/A&gt; and 0xCF in&amp;nbsp;&lt;A class="" href="http://www.microsoft.com/globaldev/reference/oem/850.mspx" mce_href="http://www.microsoft.com/globaldev/reference/oem/850.mspx"&gt;OEM Code page 850&lt;/A&gt; and 0xA4 in &lt;A class="" href="http://www.microsoft.com/globaldev/reference/iso/28591.mspx" mce_href="http://www.microsoft.com/globaldev/reference/iso/28591.mspx"&gt;ISO-8859-1&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;Everywhere you look, there's&amp;nbsp;this currency sign, one that is used for no currency.&lt;/P&gt;
&lt;P&gt;I even tried asking Ken Whistler, who had no conclusive thoughts to add:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;Michael,&lt;BR&gt;&lt;BR&gt;Nothing more than my guesses. It predates my involvement, since it is an ISO 8859-1 thing, whence it got into Unicode.&lt;BR&gt;&lt;BR&gt;And 8859-1 got it from IBM, perhaps. It is:&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; SC010000 International Currency Symbol&lt;BR&gt;&lt;BR&gt;in the old IBM Graphic Character Identification System.&lt;BR&gt;&lt;BR&gt;Maybe some old hand at IBM would know what it was originally intended for.&lt;BR&gt;&lt;BR&gt;I'm gonna guess that it was intended as a placeholder character for EBCDIC code pages, to enable formatting of currency that had symbols not otherwise representable on the code page.&lt;BR&gt;In other words, as a *replacement* currency symbol for a missing currency symbol.&lt;BR&gt;&lt;BR&gt;But that's just a guess, absent information from the horse's mouth.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;A few more discrete inquiries long the lines Ken suggested also&amp;nbsp;yielded nothing, though I did verify that &lt;STRONG&gt;IBM Graphic Character Identification System&lt;/STRONG&gt; connection. &lt;/P&gt;
&lt;P&gt;It is funny that when space was at such a premium in ISO 8859-1 that they would use up a slot for a character that isn't really used for anything. Though it does seem to be in every font and it is not used for anything else (maybe that is why it is used in Word as an end of cell marker in tables:&lt;/P&gt;
&lt;P&gt;&lt;IMG height=443 src="http://www.trigeminal.com/images/CurrencySign.png" width=685&gt;&lt;/P&gt;
&lt;P&gt;Weird, huh?&lt;/P&gt;
&lt;P&gt;To answer John's question, in my opinion, why &lt;EM&gt;not&lt;/EM&gt; use it? It has nothing better to contribute, so have fun! :-)&lt;/P&gt;
&lt;P&gt;Maybe somebody out there knows for sure what IBM's original plans were for it. Plans so important that &lt;A class="" href="http://www.microsoft.com/globaldev/reference/iso/28591.mspx" mce_href="http://www.microsoft.com/globaldev/reference/iso/28591.mspx"&gt;ISO 8859-1&lt;/A&gt;, which managed to miss out on an uppercase version of a letter (a character story for another day, one that I actually &lt;EM&gt;do&lt;/EM&gt; know!), had to include it.&lt;/P&gt;
&lt;P&gt;Because every character has a story, even if&amp;nbsp;we don't know what it really is....&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff00ff&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=5&gt;¤&lt;/FONT&gt; &lt;EM&gt;(&lt;A class="" href="http://www.fileformat.info/info/unicode/char/00a4" mce_href="http://www.fileformat.info/info/unicode/char/00a4"&gt;U+00a4&lt;/A&gt;, a.k.a. CURRENCY SIGN)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1885864" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Encoding_2F00_Codepages/default.aspx">Encoding/Codepages</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story #24: U+0308 (COMBINING DIAERESIS)</title><link>http://blogs.msdn.com/michkap/archive/2006/09/04/738263.aspx</link><pubDate>Mon, 04 Sep 2006 10:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:738263</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>24</slash:comments><comments>http://blogs.msdn.com/michkap/comments/738263.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=738263</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=738263</wfw:comment><description>&lt;FONT face=Tahoma&gt;
&lt;P&gt;I am reminded of a scene from the 1991 film The Doctor starring William Hurt, modified here to be a bit more linguistic than medical: &lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;&lt;FONT face="Times New Roman" size=2&gt;
&lt;P&gt;&lt;STRONG&gt;Linguist&lt;/STRONG&gt;: Nancy, are my repeated vowels pronounced differently?&lt;BR&gt;&lt;STRONG&gt;Nancy:&lt;/STRONG&gt; No, doctor. &lt;BR&gt;&lt;STRONG&gt;Linguist:&lt;/STRONG&gt; That's funny, I always trema when you're near. &lt;/P&gt;&lt;/FONT&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir=ltr&gt;There are essentially two&lt;FONT size=1&gt;&lt;SUP&gt;1&lt;/SUP&gt;&lt;/FONT&gt; different traditions for the meaning of two dots on top of a vowel: &lt;/P&gt;
&lt;P dir=ltr&gt;&lt;STRONG&gt;Umlaut&lt;/STRONG&gt; - Described &lt;A href="http://en.wikipedia.org/wiki/Germanic_umlaut"&gt;in Wikipedia&lt;/A&gt; as a "...modification of a vowel which causes it to be pronounced more similarly to a vowel or semivowel in a following syllable."&lt;/P&gt;
&lt;P dir=ltr&gt;&lt;STRONG&gt;Trema&lt;/STRONG&gt; or &lt;STRONG&gt;Diaeresis&lt;/STRONG&gt; - Described &lt;A href="http://en.wikipedia.org/wiki/Diaeresis"&gt;in Wikipedia&lt;/A&gt;&amp;nbsp;as the "...division of two adjacent vowels as two syllables rather than as a diphthong."&lt;/P&gt;
&lt;P dir=ltr&gt;Now in Unicode and ISO 10646,&amp;nbsp;these two very different diacritical purposes are unified under a single character -- the diaeresis. Which is kind of ironic given that the meaning of 'diaeresis' tends to suggest a division rather than any sort of unification....&lt;/P&gt;
&lt;P dir=ltr&gt;Ignoring that bit of irony in the naming decision, a unification does make sense since they really do look pretty much the same, and a disunification would be a huge target for spoofing (something we really do not need any more of, frankly!). Though to tell the truth, in quality typography the umlaut dots are usually a bit closer to the letter than the trema dots.&lt;/P&gt;
&lt;P dir=ltr&gt;Back in 1993,&amp;nbsp; Deutches Institut für Normung (DIN) sent a proposal to WG2 that stated &lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;&lt;FONT face="Times New Roman" size=2&gt;
&lt;P align=left&gt;Currently, a substantial amount of existing German data distinguishes between Umlaut and Trema. Both diacritics have a similar, but not necessarily identical representation, both have quite different properties e. g. with regards to sorting (cf. DIN 5007).&lt;BR&gt;&lt;BR&gt;In particular, German library data is currently stored according to ISO 5426 "Extension of the Latin alphabet coded character set for bibliographic information interchange" which distinguishes between the two diacritics Umlaut (4/9) and Trema (4/8). However, in ISO/IEC JTC1/SC2 N3125 "Finalized Mapping between Characters of ISO 5426 and ISO/IEC 10646-1 (UCS)" both are mapped to the same UCS character, U0308. There is thus no standardized way to ensure roundtrip compatibility between the two standards. &lt;BR&gt;&lt;BR&gt;For Germany and in particular for its national library (Deutsche Bibliothek) it is imperative for the integrity of German data that it be possible to maintain the distinction between Umlaut and Trema also in the UCS in a standardized way. Lack of ability to do so affects millions of bibliographic data records in the Deutsche Bibliothek alone (to be exact, 14 956 289 records as of October 2002) and about 110 million&amp;nbsp;bibliographic data records in German and Austrian regional library networks.&lt;/P&gt;&lt;/FONT&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir=ltr&gt;In other words, they had a need to distinguish these two diacritics, which are actually not unified in a different ISO standard. Their initial proposal from document N2593:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P align=left&gt;&lt;FONT face="Times New Roman" size=2&gt;We therefore request&lt;BR&gt;&lt;BR&gt;a) the encoding of two new characters, LATIN VARIATION SELECTOR UMLAUT in position U0241 and LATIN VARIATION SELECTOR TREMA in position U0240 (the positions are suggestions only).&lt;BR&gt;&lt;BR&gt;&lt;/FONT&gt;&lt;FONT face=TimesNewRoman size=2&gt;b) the insertion of the following text into informative Annex F "Alternate format characters" as F.2.6 "Latin selectors" &lt;BR&gt;&lt;BR&gt;"LATIN VARIATION SELECTOR UMLAUT (U0241): Uniquely identifies the preceding character as using /being the Umlaut diacritic (cf. ISO 5426, code position 4/9) &lt;BR&gt;&lt;BR&gt;LATIN VARIATION SELECTOR TREMA (U0240): Uniquely identifies the preceding character as using / being the Trema diacritic (cf. ISO 5426, code position 4/8) &lt;BR&gt;&lt;BR&gt;In the absence of any variation selector, neither the character COMBINING DIAERESIS U0308 nor any of the Latin letters with diaeresis can be interpreted as representing uniquely the Umlaut or uniquely the Trema.&lt;BR&gt;&lt;BR&gt;The LATIN VARIATION SELECTOR UMLAUT or the LATIN VARIATION SELECTOR TREMA should only be used directly following the Latin characters shown below:&lt;BR&gt;&lt;BR&gt;00C4 LATIN CAPITAL LETTER A WITH DIAERESIS&lt;BR&gt;00D6 LATIN CAPITAL LETTER O WITH DIAERESIS&lt;BR&gt;00DC LATIN CAPITAL LETTER U WITH DIAERESIS&lt;BR&gt;00E4 LATIN SMALL LETTER A WITH DIAERESIS&lt;BR&gt;00F6 LATIN SMALL LETTER O WITH DIAERESIS&lt;BR&gt;00FC LATIN SMALL LETTER U WITH DIAERESIS&lt;BR&gt;&lt;BR&gt;U0308 COMBINING DIAERESIS&lt;BR&gt;&lt;BR&gt;Neither the LATIN VARIATION SELECTOR UMLAUT nor the LATIN VARIATION SELECTOR TREMA carry a defined meaning when they follow any other character ."&lt;BR&gt;&lt;BR&gt;c) Change in ISO/IEC JTC1 SC2 N3125 (= ISO/TC46/SC4 WG1), section 3 "Mapping of Characters" the table to: &lt;BR&gt;&lt;BR&gt;4/8 Trema, Diaeresis 0308 0240&lt;BR&gt;4/9 Umlaut 0308 0241Z&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir=ltr&gt;Unfortunately, Variation Selectors can only be used on base characters, not on combining characters. so while the scenario is valid, the DIN suggested soluion is not. The UTC discussed possible solutions at length before producing the following recommendation, instead:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;&lt;FONT face=Times-Roman&gt;
&lt;P align=left&gt;&lt;FONT size=2&gt;While recognizing the drawbacks to all of the alternatives to encoding a new &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;COMBINING UMLAUT &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;character outlined in WG2 N2766, we believe that there is a workable alternative solution which has, to date, been overlooked. The solution consists, essentially, of using U+034F &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;COMBINING GRAPHEME JOINER &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;(&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;), in its intended semantics in 10646/Unicode, to make the relevant sorting, searching,&amp;nbsp;and data mapping distinctions required for umlaut &lt;/FONT&gt;&lt;I&gt;&lt;FONT face=Times-Italic&gt;versus &lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;tréma. In particular, the distinction we propose is:&lt;BR&gt;&lt;BR&gt;U+0308 &lt;/FONT&gt;&lt;FONT face=EversonMonoArrows&gt;→ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;umlaut&lt;BR&gt;&amp;lt;&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt; U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; &lt;/FONT&gt;&lt;FONT face=EversonMonoArrows&gt;→ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;tréma&lt;BR&gt;&amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt; U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; &lt;/FONT&gt;&lt;FONT face=EversonMonoArrows&gt;→ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;a umlaut&lt;BR&gt;&amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a &lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;CGJ U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; &lt;/FONT&gt;&lt;FONT face=EversonMonoArrows&gt;→ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;a tréma&lt;BR&gt;&lt;BR&gt;The sequences &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt; U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; and &amp;lt;&lt;/FONT&gt;&lt;FONT face=Times-Bold&gt;&lt;STRONG&gt;a&lt;/STRONG&gt; CGJ U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; are not canonically equivalent. this means that the distinction will not be normalized away on conversion in and out of bibliographic systems. This eases the interoperability problem. Both sequences will &lt;/FONT&gt;&lt;I&gt;&lt;FONT face=Times-Italic&gt;display &lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;as &lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;ä&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;, as they should. Furthermore, the semantics of &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;are such that it should impact only searching and sorting, for systems which have been tailored to distinguish it, while being ignored in other respects in interpretation.&lt;BR&gt;&lt;BR&gt;The reason for treating the existing sequence &amp;lt;&lt;/FONT&gt;&lt;FONT face=Times-Bold&gt;&lt;STRONG&gt;a&lt;/STRONG&gt; U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; as representing the &lt;/FONT&gt;&lt;I&gt;&lt;FONT face=Times-Italic&gt;umlaut &lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;in German bibliographic systems, despite the name of U+0308 &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;COMBINING DIAERESIS&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;, is that this is the unmarked case, representing the vast majority of extant data. The marked form &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt; &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt; &lt;/FONT&gt;&lt;FONT face=EversonMonoBlocks&gt;U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt;&amp;nbsp;should be utilized for the marked case in the data, namely the &lt;/FONT&gt;&lt;I&gt;&lt;FONT face=Times-Italic&gt;tréma&lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;, which is far, far less frequent in German bibliographic data. This minimizes the conversion and data recti&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;fi&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;cation issues, and also guarantees that representations including &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;will be uncommon in data converted out of the German bibliographic records.&lt;BR&gt;&lt;BR&gt;The existence of separate representations for umlaut and for tréma, which are not canonically equivalent (and thus not neutralized by normalization processes in the data) enables German implementations which need to distinguish the two for searching and sorting, to systematically maintain weighting distinctions to do the right thing. &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a &lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; = &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;ä&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; can be treated as equivalent to &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;, &lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;e&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; for sorting purposes, while the tréma &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt; &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt; &lt;/FONT&gt;&lt;FONT face=EversonMonoBlocks&gt;U+0308&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; can be weighted as a secondary variant of &amp;lt;&lt;/FONT&gt;&lt;B&gt;&lt;FONT face=Times-Bold&gt;a&lt;/B&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;&amp;gt; thus resulting in the desired behavior for such systems. &lt;/FONT&gt;&lt;I&gt;&lt;FONT face=Times-Italic&gt;Existing &lt;/I&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;FONT face=Times-Roman&gt;collations which do not distinguish tréma and umlaut in German data will continue to work exactly as they&amp;nbsp; currently do, since in default collation tables &lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;CGJ &lt;/FONT&gt;&lt;/FONT&gt;&lt;FONT face=Times-Roman&gt;&lt;FONT size=2&gt;is ignored in weighting.&lt;BR&gt;&lt;BR&gt;We believe that this proposed solution has the correct mix of technical attributes to enable the German library networks to make the required distinction, to correctly convert existing ISO 5426 bibliographic records, and to implement the desired sorting and searching behavior for German data represented directly in 10646/Unicode.&lt;BR&gt;&lt;BR&gt;At the same time, this solution does not introduce incompatibilities or non-interoperability issues for other existing implementations of 10646/Unicode which handle German data.&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;/BLOCKQUOTE&gt;
&lt;P dir=ltr&gt;It is&amp;nbsp;again ironic&amp;nbsp;that in the (rare)&amp;nbsp;situation where an attempt to distinguish them is required that the default case is suggested as being the unlaut while exceptional case is the diaeresis. :-)&lt;/P&gt;
&lt;P dir=ltr&gt;The use of a combining diacritic is still (to this day) controversial in Unicode when people unfamiliar with the standard who are native speakers of languages like Swedish or Finnish and who&amp;nbsp;are asked to think of these standalone letters as equivalent to a different letter plkus a diacritic. The many people who would prefer all of the Indic languages to separately encode all the instances of base letter plus virama have an analagous complaint.&lt;/P&gt;
&lt;P dir=ltr&gt;&amp;nbsp;&lt;/P&gt;
&lt;P dir=ltr&gt;&lt;FONT size=1&gt;1 - At this point I will take judicial notice of the phenomenon known as the &lt;STRONG&gt;Heavy Metal Umlaut&lt;/STRONG&gt;, described ad nauseum &lt;/FONT&gt;&lt;A href="http://en.wikipedia.org/wiki/Heavy_metal_umlaut"&gt;&lt;FONT size=1&gt;here&lt;/FONT&gt;&lt;/A&gt;&lt;FONT size=1&gt;. It is in fact the Heavy Metal Umlaut that inspired Cathy's desire for a bumper sticker that would say &lt;STRONG&gt;Stop Indiscriminate Umlauting!&lt;/STRONG&gt;, although I find that approach to be a tad reactionary. The importance to our culture of Spin̈al Tap and Blue Öyster Cult is undeniable, as is the need to avoid fear of the reaper and to turn the volume up to 11.....&lt;/FONT&gt;&lt;/P&gt;
&lt;P dir=ltr&gt;&amp;nbsp;&lt;/P&gt;
&lt;P dir=ltr&gt;&lt;FONT color=#ff1493&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt;&amp;nbsp;&lt;FONT size=6&gt;&amp;nbsp;̈&lt;/FONT&gt; &lt;EM&gt;(&lt;A href="http://www.fileformat.info/info/unicode/char/0308"&gt;U+0308&lt;/A&gt;, a.k.a. COMBINING DIAERESIS)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=738263" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Linguistic/default.aspx">Linguistic</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story #23: U+00ad (SOFT HYPHEN)</title><link>http://blogs.msdn.com/michkap/archive/2006/09/02/736881.aspx</link><pubDate>Sat, 02 Sep 2006 20:34:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:736881</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/michkap/comments/736881.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=736881</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=736881</wfw:comment><description>&lt;P&gt;&lt;EM&gt;Last night I was upon the stair&lt;BR&gt;A little hyphen that wasn't there.&lt;BR&gt;It wasn't there again today;&lt;BR&gt;Oh how I wish he'd go away!&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The SOFT HYPHEN has a long if not entirely distinguished history.&lt;/P&gt;
&lt;P&gt;It starts back in &lt;A href="http://www.microsoft.com/globaldev/reference/iso/28591.mspx"&gt;ISO 8859-1&lt;/A&gt;, which puts it at 0xAD, and in a rare exception to the usual practice of not explaining semantics of the encoded characters, it spends a bit of time talking about the soft hyphen, saying:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;A graphic character that is imaged by a graphic symbol identical with, or similar to, that representing hyphen, for use when a line break has been established within a word. &lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;As you can see, we are already in trouble here. It is a graphical character with a visible definhed glyph that is usually invisible and which impacts line break, a formatting operation.&lt;/P&gt;
&lt;P&gt;And of course beyond the sloppiness in the definition there is the fact that it is usually unreasonable to assume that a person would type in this character explicitly. Clearly it is a better answer to have per language dictionaries&amp;nbsp; that contain hyphenation rules in them, as the SOFT HYPHEN "do it yourself" principles are simply not going to work in practice.&lt;/P&gt;
&lt;P&gt;The &lt;A href="http://www.w3.org/TR/html4/"&gt;HTML 4.0 spec&lt;/A&gt; has its own content on the soft hyphen. In section &lt;A href="http://www.w3.org/TR/html4/struct/text.html"&gt;9.3.3. (Hyphenation)&lt;/A&gt; the following text is provided:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The &lt;SPAN class=index-def title="soft hyphen"&gt;&lt;A name=didx-soft_hyphen&gt;soft hyphen&lt;/A&gt;&lt;/SPAN&gt; tells the user agent where a line break can occur.&lt;BR&gt;&lt;BR&gt;Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.&lt;BR&gt;&lt;BR&gt;In HTML, the plain hyphen is represented by the "-" character (&amp;amp;#45; or &amp;amp;#x2D;). The soft hyphen is represented by the character entity reference &amp;amp;shy; (&amp;amp;#173; or &amp;amp;#xAD;)&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Ok, so once again we have a graphic character that is actually a formatting character tied up with line breaking. And the text seems pretty ambivalent about how a browser might be expected to interpret the soft hyphen -- clearly it is not some terrible sin if it does not do special line breaking behavior.&lt;/P&gt;
&lt;P&gt;Doesn't &lt;STRONG&gt;&amp;amp;shy;&lt;/STRONG&gt; sound like perfect character entity reference for a character that may or may not be visible and which, even if visible, should be ignore for searching and sorting? The little bugger even sounds shy!&lt;/P&gt;
&lt;P&gt;Now if you look at the ECMA 94 standard, available online for free &lt;A href="http://www.ecma-international.org/publications/standards/Ecma-094.htm"&gt;here&lt;/A&gt;, it does have a wording that is almost the same as 8859-1's, and the difference may be striking to some but it did not impact me as much. Perhaps it iois edging more towards the formatting role of the character....&lt;/P&gt;
&lt;P&gt;At this point, before I jump into Unicode, I'll mention that the other day &lt;A href="http://tihiy.ahanix.org/"&gt;Tihiy&lt;/A&gt; asked in the Suggestion Box: &lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;Can you explain why Charmap refuses to display characters with 0xAD code? &lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Well, given the rules surrounding SOFT HYPHEN and the fact that it is impossible for any character to break a line when it is displayed in the single line text control in the&amp;nbsp;Windows Character Map, it is obvious why the simple program that builds thae grid can display it even if it will not appear in the textbox below:&lt;IMG height=545 src="http://trigeminal.com/images/shy.png" width=485&gt;&lt;/P&gt;
&lt;P&gt;Of course this does not mean that it isn't there -- the SOFT HYPHEN, if included in the text stream, will be there even if it is usually invisible and ignored.&lt;/P&gt;
&lt;P&gt;Now I say &lt;STRONG&gt;usually&lt;/STRONG&gt; because in operations like collation on Windows, the SOFT HYPHEN will be ignored (it is given no weight) but it will also break compressions. In practice this shoul not matter since one should never break a word in the middle of a compression, and in fact this trick could be used to force collation&amp;nbsp;to work right in cases like &lt;STRONG&gt;&lt;A href="http://blogs.msdn.com/michkap/archive/2005/11/26/495072.aspx"&gt;this one in Hungarian&lt;/A&gt;&lt;/STRONG&gt;, though I'd recommend against it since you would probably not want to break a line in the middle of a word even in that case....&lt;/P&gt;
&lt;P&gt;So, what does Unicode say?&lt;/P&gt;
&lt;P&gt;In 2.0, it said:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;U+00AD &lt;SPAN class=charname&gt;soft hyphen&lt;/SPAN&gt; indicates a hyphenation point, where a line-break is preferred when a word is to be hyphenated. Depending on the script, the visible rendering of this character when a line break occurs may differ (for example, in some scripts it is rendered as a hyphen -, while in others it may be invisible).&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;In other places, the soft hyphen is described as a "discretionary hyphen", which clearly suggests the formatting&amp;nbsp;role as well. It is becoming less and less of a graphic character all the time!&lt;/P&gt;
&lt;P&gt;Unicode 4.0, after extensive discussion and review, made the switch for good, and the following two points are called out in the "changes for 4.0" &lt;A href="http://www.unicode.org/versions/Unicode4.0.0/"&gt;text&lt;/A&gt;, with the following two bullet points:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;FONT face="Times New Roman" size=2&gt;&lt;B&gt;Default Ignorables. &lt;/B&gt;Added Hangul Filler characters, &lt;FONT color=#ff0000&gt;U+00AD &lt;SPAN style="FONT-VARIANT: small-caps"&gt;soft hyphen&lt;/SPAN&gt;&lt;/FONT&gt;, CGJ,&amp;nbsp; and ZWS &lt;/FONT&gt;
&lt;LI&gt;&lt;FONT face="Times New Roman" size=2&gt;&lt;FONT color=#ff0000&gt;&lt;B&gt;Soft Hyphen. &lt;/B&gt;U+00AD &lt;SPAN style="FONT-VARIANT: small-caps"&gt;soft hyphen&lt;/SPAN&gt; was also changed to General Category Cf. Its semantics were clarified: it marks a position for hyphenation, rather than being itself a hyphen character. (The Hyphen property itself was stabilized, and thus not changed to reflect this.) &lt;/FONT&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;And the text in &lt;A href="http://www.unicode.org/unicode/reports/tr14/"&gt;UAX#14: Line Breaking Properties&lt;/A&gt; points out yet another issue that people may not have considered, buried in section &lt;A href="http://www.unicode.org/unicode/reports/tr14/#SoftHyphen"&gt;5.3 Use of Soft Hyphen&lt;/A&gt;:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY it is customarily treated as overriding the action of the hyphenator for that word.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Every time I think about the issue, I am reminded of &lt;STRONG&gt;&lt;A href="http://blogs.msdn.com/michkap/archive/2005/05/11/416624.aspx"&gt;this case&lt;/A&gt;&lt;/STRONG&gt;, whether an attempt to optimize actually&amp;nbsp;inhibited other optimization attempts. Yet another reason to avoid the soft hyphen? :-)&lt;/P&gt;
&lt;P&gt;There is other trivia, like it is removed by nameprep for IDN, Apple's ATSUI does not support it in version 1.1 and later, and Microsoft Typography &lt;A href="http://www.microsoft.com/typography/developers/fdsspec/punc2.htm"&gt;talks about it a bit&lt;/A&gt; as well.&lt;/P&gt;
&lt;P&gt;Now after the 4.0 change, &lt;FONT color=#800080&gt;Markus Kuhn&lt;/FONT&gt; wrote a strongly worded dissenting opinion on the change, which &lt;FONT color=#006400&gt;Ken Whistler&lt;/FONT&gt; responded to:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT color=#800080 size=2&gt;&lt;STRONG&gt;I believe the recent "clarification" of the semantics of the SOFT HYPHEN&amp;nbsp;(U+00AD) character in Unicode 4.0 had an unfortunate outcome. In&amp;nbsp;particular, changing its class from Pd to Cf in UnicodeData.txt breaks&amp;nbsp;backwards compatibility with how this character was widely used in ISO&amp;nbsp;8859-1 terminals for the past 15 years and causes now headaches with the&amp;nbsp;designers of VT100-style terminal emulators with ISO 8859-1 and UTF-8&amp;nbsp;support.&lt;/STRONG&gt;&lt;/FONT&gt;&amp;nbsp;&lt;BR&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;This may well be the case. I don't have any particular iron in this fire, since I was neither in the camp advocating for this change nor was I particularly set up to argue against making the change.&lt;BR&gt;&lt;BR&gt;But the fact that this issue came up, was argued at length, was put up as a public issue for an extended period of time, and then argued some more before it was decided, indicates to me that the status of U+00AD SOFT HYPHEN as a gc=Pd character was causing other people headaches as it stood.&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;As Unicode claims for U+0000 to U+00FF to be compatible with ISO 8859-1,&amp;nbsp;it should also respect the intended and de-facto use of ISO 8859-1 characters and should not change their semantics over a decade later.&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;The establishment of Unicode character properties for Unicode characters does not, ex post facto, change the semantics *of* ISO 8859-1. If that were the case, then any number of character property assignments (including compatibility and canonical decomposition mappings), and character property assignment *changes*, such as those for U+00B7 MIDDLE DOT, could be equally attacked as ex post facto changes to ISO 8859-1.&lt;BR&gt;&lt;BR&gt;But the additional character behavior specified by the Unicode Standard does not impose constraints back onto standards that those characters map to -- including ASCII and ISO 8859-1. Nothing that the Unicode Standard says about *Unicode* characters can suddenly make a conformant ISO 8859-1 implementation nonconformant in the way it handles characters.&lt;BR&gt;&lt;BR&gt;The issue, instead, is interoperability for implementations of Unicode that map back and forth to implementations of 8-bit character encodings (or others), including ISO 8859-1. And I suspect, in the case of SOFT HYPHEN, that the problem we are facing is really that SOFT HYPHEN has had a long history of legacy implementations in two (or more) incompatible ways.&lt;BR&gt;&lt;BR&gt;Certainly the terminal display protocols that insert line-ending SOFT HYPHENS as graphic characters which can be stripped back out when presentation text is restored to content text has a long history. But the other model also way predates the examples you cited, going at least as far back as WordStar's internal use of nondisplaying soft hyphen characters as line break opportunities that only displayed visibly (with a hyphen) at actual line breaks. For WordStar it was 0x1E for 'inactive soft hyphen', which was an inserted line break opportunity for word-wrap, and 0x1F for 'active soft hyphen', which was an actually broken word for word-wrap, displayed (and printed) visibly. (WordStar *predates* ISO 8859-1, by the way, since it was first released in 1979.)&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;FONT color=#800080 size=2&gt;&lt;STRONG&gt;As discussed in detail for example on&lt;BR&gt;&lt;BR&gt;&amp;nbsp; &lt;/STRONG&gt;&lt;/FONT&gt;&lt;A href="http://www.cs.tut.fi/~jkorpela/shy.html"&gt;&lt;FONT color=#800080 size=2&gt;&lt;STRONG&gt;http://www.cs.tut.fi/~jkorpela/shy.html&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/A&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;the ISO 8859-1 standard defines, in section 6.3.3 the SOFT HYPHEN as&amp;nbsp;"[a] graphic character that is imaged by a graphic symbol identical &amp;gt; with, or similar to, that representing hyphen".&lt;BR&gt;&lt;BR&gt;The ISO 8859-1 standard uses unfortunately only the rather unclear words&amp;nbsp;"for use when a line break has been established within a word" as the&amp;nbsp;complete definition of the intended usage of this character. This&amp;nbsp;clearly falls short completely of setting up a document processing model&amp;nbsp;and defining unambiguously what role SOFT HYPHEN plays it its various&amp;nbsp;phases and functions. &lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;Yep. And that has contributed to the confusion for years. It didn't help that 8859-1 didn't image 0xAD SOFT HYPHEN with a hyphen glyph in the chart, but instead with a "SHY" acronym, implying that it was, in fact, a "funny" character that might not always display visibly. That, plus the less than clear wording in the note on SOFT HYPHEN (now in Clause 5.3.3 in 8859-1) was symptomatic of the aversion of SC2 standards to define "character processing" behavior, but also reflected, I suspect, a deliberate willingness to allow for inconsistent processing models. It isn't much of a stretch to interpret the wording in Clause 5.3.3 as:&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp; "A graphic character that [when imaged] is imaged by a&amp;nbsp;graphic symbol identical with, or similar to, that&amp;nbsp;representing HYPHEN, ..."&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR&gt;which opens the door to the Word/WordPerfect etc. style interpretation.&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;The definition "graphic character that is imaged by a graphic symbol&amp;nbsp;identical with, or similar to, that representing hyphen" made it clear to users familiar with the above mentioned problem that the SOFT HYPHEN is just an alternative of the normal graphical character HYPHEN, for use when a hyphen is inserted by a line formatting routine.&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;I don't think it was quite so clear as that.&lt;BR&gt;&lt;BR&gt;[ snip HTML discussion ]&lt;BR&gt;&lt;/FONT&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;This HTML 4 reinterpretation is essentially the semantics that Unicode then adopted as well.&lt;BR&gt;&amp;nbsp;&lt;BR&gt;Nevertheless, there is a vast number of VT100 terminal emulators, printers, and similar 8-bit output devices out there that treat the SOFT HYPHEN as a full graphical character, as had been suggested by ISO 8859-1 &lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;The problem with this is that is assumes that "graphical character" is well-defined and never involves ambiguities of display for SC2 standards. &lt;BR&gt;&lt;BR&gt;If you look at ISO 8859-8 (Hebrew), when it was revised to make the use of LEFT-TO-RIGHT MARK and RIGHT-TO-LEFT MARK part of the standard, so that implicit order bidi with 8859-8 was well-defined, those characters were *also* described as "graphic characters", cloning the wording right out of the longstanding and traditional, if somewhat bizarre wording used to describe the SPACE character. For LEFT-TO-RIGHT MARK:&lt;BR&gt;&lt;BR&gt;"A graphic character the visual representation of which consists of the absence of a graphic symbol, which acts like a left-to-right character in a bidirectional context..."&lt;BR&gt;&lt;BR&gt;If, for an 8859 standard, a character which *never* has a visible display glyph (except for charts or "Show Hidden" contexts) can be considered to be a "graphic character", you can see why the situation for SOFT HYPHEN can be considered less than&lt;BR&gt;clear.&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;and by the old application need to distinguish between content and hyphenation hyphens in formatted presentation data streams.&lt;BR&gt;&lt;BR&gt;It is used today by a number of UTF-8 terminal applications to decide, by how many character cell positions the cursor will advance if the Unicode character provided as an argument is sent to the terminal. The rules for generating its semantics from Unicode tables are very simple and include the rule&lt;BR&gt;&lt;BR&gt;&amp;nbsp; - Other format characters (general category code Cf in the Unicode database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.&lt;BR&gt;&lt;BR&gt;With the change of SOFT HYPHEN from general category code Pd to Cf in the Unicode 4.0 database, this causes now terminal behaviour to change from wcwidth(0x00ad) = 1 to wcwidth(0x00ad) = 0. In other words, what used to be a spacing graphical character in accordance with ISO 8859-1 that always advances the cursor by one cell after printing the glyph of a hyphen is not an ignoreable and usually invisible format character.&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;It seems like the obvious fix here is to exempt U+00AD from the generic class treatment of Cf characters, both in the stated documentation and in the implementation. In other words, for the purposes of those UTF-8 terminal applications, SOFT HYPHEN is not an "other format character", but is an exception that should go on behaving exactly as you have it currently defined.&lt;BR&gt;&lt;BR&gt;By the way, implementers cannot, now, assume that gc=Cf characters (format controls) should *always* be invisibly displayed. The addition of the various Arabic prepositive numeric accumulators (U+0600 ARABIC NUMBER SIGN and the like) have added a subclass of format controls which *do* have visible display glyphs. And there is now a separate Unicode character property, Default_Ignorable_Code_Point, which should also be taken into account when deciding whether a&amp;nbsp;particular character, by default, should be displayed with a zero glyph or a black box glyph, for example, if uninterpreted.&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;In this sense, Unicode 4.0 breaks with the well-established tradition of interpreting the SOFT HYPHEN as a graphical character in output devices.&lt;BR&gt;&lt;BR&gt;It would have been nice, if Unicode hadn't done that. Unicode could instead have chosen to add a new ignorable formatting character for&amp;nbsp;marking possible hyphenation points in documents, which could be called for instance HYPHENATION POINT. A formatting function can then either discard a HYPHENATION POINT (if it ended up inside a formatted line), or convert it into the graphical SOFT HYPHEN character, where the hyphenation point ended up at the end of a line in the presentation data stream. &lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;This possible approach was also debated, but was rejected. The opponents of that approach can speak for themselves, but if I recall, this approach would, itself, have had at least as many legacy compatibility issues.&lt;/FONT&gt;&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;This would have preserved backwards compatibility with the zillions of ISO 8859-1 output devices out there that treat SOFT HYPHEN as a graphical character.&lt;BR&gt;&lt;BR&gt;What shall I now do as the implementor of an ISO 8859-1 terminal emulator when I receive a SOFT HYPHEN?&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;Exactly what you are currently doing.&lt;BR&gt;&lt;/FONT&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT color=#800080 size=2&gt;Will the next edition of ISO 8859 be changed, to remove the definition of the SOFT HYPHEN as a graphical character?&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;Of course not. It will stay exactly as it is.&lt;BR&gt;&lt;BR&gt;The ambiguity in "graphic character" will say unchanged in the SC2 standards. Note that 10646 itself keeps the traditional SC2 definition:&lt;BR&gt;&lt;BR&gt;&amp;nbsp; A character, other than a control function, that has a visual&amp;nbsp;representation normally handwritten, printed, or displayed.&lt;BR&gt;&amp;nbsp; &lt;BR&gt;but then proceeds to encode a whole host of space characters and format control characters which normally *don't* have a visual representation. These are then swept under the rug with the same logical nicety used for SPACE:&lt;BR&gt;&lt;BR&gt;&amp;nbsp; "A graphic character the visual representation of which consists&amp;nbsp;of the absence of a graphic symbol."&lt;BR&gt;&amp;nbsp;&amp;nbsp; &lt;BR&gt;Uh, huh. O.k., well, then... ;-)&lt;/FONT&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp;&lt;BR&gt;&lt;FONT size=2&gt;&lt;STRONG&gt;&lt;FONT color=#800080&gt;Or, my preferred outcome, do you agree that all this SOFT HYPHEN = Cf revision was probably a mistake and we should undo everything quickly in the next revision?&lt;/FONT&gt;&lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR&gt;&lt;FONT color=#006400 size=2&gt;I think that is most unlikely at this point. The issue for SOFT HYPHEN was up for public review for rather a long time. The decision was not hurried for it, but extended through a number of UTC meetings, precisely because people were worried about compatibility and legacy issues. But I don't think the issue should be reopened and redecided differently. The only thing worse than a poor decision by a standards committee is waffling about decisions by a standards committee.&lt;BR&gt;&lt;BR&gt;And in this case, I don't really see why you cannot keep on doing what you are currently doing for the UTF-8 terminal emulations. If you document how U+00AD behaves in those emulations, and you should be fine.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;And after that. Michel Suignard pointed out an issue that had been overlooked by some, which was 10646 stepping up! &lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT size=2&gt;Note that unusally, the latest text from ISO 10646 both in the&lt;BR&gt;10646-1:2000 2nd amendment and the consolidated version capture&lt;BR&gt;verbosely the latest view on this as follows:&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT size=2&gt;SOFT HYPHEN (00AD): SOFT HYPHEN (SHY) is a format character that indicates a preferred intra-word linebreak opportunity. If the line is broken at that point, then whatever mechanism is appropriate for intra-word line-breaks should be invoked, just as if the line break had been triggered by another mechanism, such as a dictionary lookup. Depending on the language and the word, that may produce different visible results, such as: &lt;/FONT&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;FONT size=2&gt;inserting a graphic symbol indicating the hyphenation and breaking the line after it,&lt;/FONT&gt; 
&lt;LI&gt;&lt;FONT size=2&gt;inserting a graphic symbol indicating the hyphenation, breaking the line after the symbol and changing spelling in the divided word parts,&lt;/FONT&gt; 
&lt;LI&gt;&lt;FONT size=2&gt;not showing any visible change and simply breaking the line at that point.&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;FONT size=2&gt;
&lt;P&gt;The inserted graphic symbol, if any, can take a wide variety of shapes, such as HYPHEN (2010), ARMENIAN HYPHEN (058A), MONGOLIAN TODO SOFT HYPHEN (1806), as appropriate for the situation. When encoding text that includes explicit line breaking opportunities, including actual hyphenations, characters such as HYPHEN, ARMENIAN&lt;BR&gt;HYPHEN, and MONGOLIAN TODO SOFT HYPHEN may be used, depending on the language. &lt;BR&gt;&lt;BR&gt;When a SOFT HYPHEN is used to represent a possible hyphenation point, the character representation is that of the text sequence without hyphenation (for example: "tug&amp;lt;00AD&amp;gt;gumi"). When encoding text that includes hard line breaks, including actual hyphenations, the character representation of the text sequence must reflect the changes due to hyphenation (for example: "tugg&amp;lt;2010&amp;gt;" / "gumi").&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT size=2&gt;This was discussed at length during the UTC and WG2.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;And Kent Karlsson also pointed out some facts that had been ignored by Markus and others who were stating their opinions:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT size=2&gt;That text is unfortunately too easy to misread (and overinterpret!). Having talked to one of the authors, he was very surprised at the Kuhn/Korpela interpretation. I think it is a case of being too close to a text to have seen how it could be misread by others.&amp;nbsp; Kuhn's interpretation was definitely not intended (and very few interpret it that way).&lt;BR&gt;&lt;BR&gt;The intent of that text, that you partially quoted, is that SOFT HYPHEN is graphic (and imaged) WHEN an (automatic) line break has been made (while it is otherwise invisible, which was not clearly stated). Unicode, SC2/WG2, SC2/WG3, IBM, MS, Adobe, and many others agree on that. Whether it is visible just before an explicit line break (e.g. an LF), is still not clearly stated (though in practice it is, by the already deployed software that does suppport SHY).&lt;BR&gt;&lt;BR&gt;What IS new with Unicode 4.0 is that **when imaged** the SOFT HYPHEN may take any suitable hyphen shape (to be nitpicking, it best not to see the SOFT HYPHEN as ever being imaged (except in a "show invisibles" mode, it is just a hyphenation point indication, and the hyphen being imaged when there is a line break is not the actual SHY character).&amp;nbsp; E.g., in Mongolian texts, it should be imaged as a MONGOLIAN TODO SOFT HYPHEN (the "soft" in that name has been decided to be a mistake). In an Armenian text, SOFT HYPHEN *when imaged* takes on the shape of an ARMENIAN HYPHEN.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Some may also remember my &lt;STRONG&gt;&lt;A href="http://blogs.msdn.com/michkap/archive/2006/06/26/648040.aspx"&gt;Not all GetUnicodeCategory methods are created equal&lt;/A&gt;&lt;/STRONG&gt; post, which clearly notes the fact that there was some managed code that was depending on the old categorization of SOFT HYPHEN....&lt;/P&gt;
&lt;P&gt;All in all, it makes for a fascinating story. :-)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;&lt;FONT color=#ff1493&gt;This post brought to you by &lt;/FONT&gt;&lt;/EM&gt;&lt;/FONT&gt;&lt;EM&gt;&lt;FONT color=#ff1493&gt;&lt;A href="http://www.fileformat.info/info/unicode/char/00ad"&gt;U+00ad&lt;/A&gt;&lt;/FONT&gt;&lt;/EM&gt;&lt;EM&gt;&lt;FONT color=#ff1493&gt;, a.k.a. SOFT HYPHEN&lt;/FONT&gt;&lt;/EM&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=736881" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Collation_2F00_Casing/default.aspx">Collation/Casing</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Encoding_2F00_Codepages/default.aspx">Encoding/Codepages</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Int_2700_l+Programming/default.aspx">Int'l Programming</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Fonts_2F00_Typography/default.aspx">Fonts/Typography</category></item><item><title>Every character has a story #22: U+0c27 (CARON)</title><link>http://blogs.msdn.com/michkap/archive/2006/08/14/698983.aspx</link><pubDate>Mon, 14 Aug 2006 10:21:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:698983</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/michkap/comments/698983.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=698983</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=698983</wfw:comment><description>&lt;FONT face=Tahoma&gt;
&lt;P&gt;The CARON has a long and unhappy history, one that is tied up with that whole &lt;A href="http://www.fileformat.info/info/unicode/category/Sk/index.htm"&gt;&lt;STRONG&gt;Sk&lt;/STRONG&gt;&lt;/A&gt;/&lt;A href="http://www.fileformat.info/info/unicode/category/Lm"&gt;&lt;STRONG&gt;Lm&lt;/STRONG&gt;&lt;/A&gt; general category thing I talked about in &lt;STRONG&gt;&lt;A href="http://blogs.msdn.com/michkap/archive/2006/08/10/694593.aspx"&gt;this post&lt;/A&gt;&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;Ken Whistler laid it out for the CARON just recently, starting with the meandering path through UnicodeData.txt:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;UnicodeData-1.1.5.txt:&lt;BR&gt;&lt;BR&gt;02C7;CARON;Lm;0;L;0020 030C;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third &lt;BR&gt;tone;;;&lt;BR&gt;&lt;BR&gt;UnicodeData-2.0.14.txt:&lt;BR&gt;&lt;BR&gt;02C7;CARON;Sk;0;L;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;&lt;BR&gt;&lt;BR&gt;UnicodeData-4.0.0d1.txt:&lt;BR&gt;&lt;BR&gt;02C7;CARON;Sk;0;ON;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;&lt;BR&gt;&lt;BR&gt;UnicodeData-4.0.0d2.txt:&lt;BR&gt;&lt;BR&gt;02C7;CARON;Lm;0;ON;;;;;N;MODIFIER LETTER HACEK;Mandarin Chinese third tone;;;&lt;BR&gt;&lt;BR&gt;So: gc=Lm --&amp;gt; Sk --&amp;gt; Lm&lt;BR&gt;&lt;BR&gt;The difference for U+02C7 being that Unicode 1.1 mistakenly indicated that it was a spacing clone of a diacritic (by that 0020 030C decomposition), which was corrected in Unicode 2.0.&lt;BR&gt;&lt;BR&gt;The name anomaly comes from the fact that U+02C7 was mapped to the 8859-2 0xB7 CARON, and the 10646 merger and SC2 rules made us change the name from MODIFIER LETTER HACEK as a result.&lt;BR&gt;&lt;BR&gt;But the intent of U+02C6 and U+02C7 as encoded characters in Unicode, since the days of Unicode 1.0, has always been completely parallel -- which is why their General Category history is equally sorry.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;It becomes a great example of making things look less stable than they [usually] are. But as I said in &lt;A href="http://blogs.msdn.com/michkap/archive/2006/08/10/694593.aspx"&gt;&lt;STRONG&gt;this post&lt;/STRONG&gt;&lt;/A&gt;, the difference between the two &lt;A href="http://www.fileformat.info/info/unicode/category/index.htm"&gt;general categories&lt;/A&gt; is that one is meant to be usable in identifiers, and the other is not. As Ken pointed out: &lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;This whole problem with modifier letters and gc=Sk versus gc=Lm is like that proverbial pebble in the shoe, I'm afraid. Every few years it becomes a "problem" to sort out again, and ends up with a few more characters jiggered one way or another across that boundary.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;He then talked about the history of Sk:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;For those who care about the history here, gc=Sk didn't exist at all in the original set of General Category values invented by Mark. All the characters *named* MODIFIER LETTER WHATEVER&lt;BR&gt;in UnicodeData-1.1.5.txt got the gc=Lm value.&lt;BR&gt;&lt;BR&gt;Mark introduced gc=Sk in Unicode 2.0 to solve a different problem, which was the UTC then groping towards an identifier syntax that would do the right thing for Unicode strings based on Unicode character properties. See Section 5.14 Identifiers, pp. 5-25 to 5-27 in Unicode 2.0 if you can find a copy. In Unicode 2.0, identifiers were constructed on the [alphabetic] property, plus a number of additions and exceptions. But [alphabetic] itself, whose values were printed in the book, by the way, at pp. 4-14 to 4-15, claimed to include "modifier letters". That was problematical, and some of the modifier letters that clearly didn't look like they belonged in identifiers, were&lt;BR&gt;drained from [alphabetic] by inventing the new General Category Sk (symbol modifier), so they got classed with the other symbols, rather than getting lumped with the letters and such under [alphabetic].&lt;BR&gt;&lt;BR&gt;Incidentally, the only place in the Unicode 2.0 standard, other than &lt;BR&gt;&lt;BR&gt;&lt;A href="http://www.unicode.org/Public/2.0-Update/ReadMe-2.0.14.txt"&gt;http://www.unicode.org/Public/2.0-Update/ReadMe-2.0.14.txt&lt;/A&gt;&lt;BR&gt;&lt;BR&gt;where General Category values are explicitly enumerated&amp;nbsp;was in the discussion of locating text element boundaries (Section 5.13), where the addition of gc=Sk got overlooked and was not yet properly accounted for. The boundary specification assumed that "MODIFIER LETTERS" were, well, modifier letters, and the description even explicitly says:&lt;BR&gt;&lt;BR&gt;&amp;nbsp; Lm = Modifier Letter (includes spacing versions of non-spacing marks)&lt;BR&gt;&amp;nbsp; &lt;BR&gt;So that part of the 2.0 standard was inconsistent with the changes that had been made to deal with identifier syntax.&lt;BR&gt;&lt;BR&gt;It was Unicode 3.0 that revised the identifier syntax to make the classes specifically be based on General Category values, rather than [alphabetic] with exceptions. And there were a significant number of General Category changes which were driven by this. See Appendix D of Unicode 3.0, which notes the then-significant issue of trying to establish convergence between the Unicode definition of identifiers and the ISO TR 10176 definition of identifiers, which was being bandied about at that point as essential for formal programming languages. Page 979 of TUS 4.0:&lt;BR&gt;&lt;BR&gt;&amp;nbsp;&amp;nbsp; General Category. A series of General Category changes were made to assist the convergence of the Unicode definition of identifier with ISO TR 10176.&lt;BR&gt;&amp;nbsp;&amp;nbsp; &lt;BR&gt;Post Unicode 3.0 was when Mark staked out more territory in character properties, took over PropList.txt and started producing derived properties, using sets of tools for checking consistency, and introducing more properties of the Other_XYZ type to enable more robust derivation rules. The period between Unicode 3.0 and Unicode 4.0 saw all kinds of jiggering of General Category values that resulted from this, including the long list of proposed changes in L2/02-267.&lt;BR&gt;&lt;BR&gt;Most of the revisions that resulted were undoubtedly improvements, but in the area of "MODIFIER LETTERS" things have just gotten more confused, in my opinion. In part this has resulted from an essential disconnect between the people proposing new characters for new scripts and additions of miscellaneous abstruse and oddball stuff for Latin, and the people maintaining and extending the Unicode Character Database.&lt;/FONT&gt;&lt;BR&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;And then at the end of this description came the most amusing summary:&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;To make this perhaps too pointedly ad hominem, but nevertheless fairly accurate, Michael Everson does not fully understand character properties or their interactions as demonstrated by Mark Davis' manifold property tools, and Mark Davis does not fully understand the functioning of modifier letters in newly encoded scripts and the numerous technical extensions for the Latin script. This tends to leave both of them, and the UTC as well, scratching their heads over the "Is it Lm? Or is it Sk?" decision that inevitably has to be made for all of these additions.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I guess you could say that not only does every character have a story; the truth is that some of them inspire monologues! :-)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT color=#ff1493&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; &lt;FONT size=6&gt;&lt;STRONG&gt;ˇ&lt;/STRONG&gt;&lt;/FONT&gt; &lt;EM&gt;(U+02c7, a.k.a. CARON, f.k.a. MODIFIER LETTER HACEK)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/FONT&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=698983" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Unicode_2F00_standards/default.aspx">Unicode/standards</category><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story #21: U+0f77 U+0f79 (TIBETAN VOWEL SIGN VOCALIC [RR|LL])</title><link>http://blogs.msdn.com/michkap/archive/2006/06/28/648940.aspx</link><pubDate>Wed, 28 Jun 2006 10:01:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:648940</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/michkap/comments/648940.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=648940</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=648940</wfw:comment><description>&lt;P&gt;&lt;FONT face=Tahoma&gt;Peter Constable asked some Unicode folks: &lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Tahoma&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;DIV class=Section1&gt;
&lt;P&gt;&lt;FONT size=2&gt;&lt;FONT face="Times New Roman"&gt;I’m just curious to know why 0f77 and 0f79 were given compatibility decompositions rather than canonical decompositions? (I don’t see any obvious reason why canonical decompositions would not have been feasible.)&lt;BR&gt;&lt;BR&gt;(Yes, I know this can’t be changed – that’s not my objective.)&lt;BR&gt;&lt;BR&gt;Peter Constable&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/DIV&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;And Ken Whistler stepped up with a good historical look at these two characters (which in my humble opinion deserves a more permanent location for others to see!):&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;&lt;FONT face="Consolas,Courier New"&gt;&lt;STRONG&gt;0F71;TIBETAN VOWEL SIGN AA;Mn;129;NSM;;;;;N;;;;;&lt;BR&gt;&lt;BR&gt;0F76;TIBETAN VOWEL SIGN VOCALIC R;Mn;0;NSM;0FB2 0F80;;;;N;;;;;&lt;BR&gt;0F77;TIBETAN VOWEL SIGN VOCALIC RR;Mn;0;NSM;&amp;lt;compat&amp;gt; 0FB2 0F81;;;;N;;;;;&lt;BR&gt;0F78;TIBETAN VOWEL SIGN VOCALIC L;Mn;0;NSM;0FB3 0F80;;;;N;;;;;&lt;BR&gt;0F79;TIBETAN VOWEL SIGN VOCALIC LL;Mn;0;NSM;&amp;lt;compat&amp;gt; 0FB3 0F81;;;;N;;;;;&lt;BR&gt;0F80;TIBETAN VOWEL SIGN REVERSED I;Mn;130;NSM;;;;;N;;;;;&lt;BR&gt;0F81;TIBETAN VOWEL SIGN REVERSED II;Mn;0;NSM;0F71 0F80;;;;N;;;;;&lt;BR&gt;&lt;BR&gt;0FB2;TIBETAN SUBJOINED LETTER RA;Mn;0;NSM;;;;;N;;*;;;&lt;BR&gt;0FB3;TIBETAN SUBJOINED LETTER LA;Mn;0;NSM;;;;;N;;;;;&lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR&gt;&amp;nbsp; &lt;BR&gt;&lt;FONT face="Consolas,Courier New"&gt;&lt;STRONG&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NFD&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NFC&lt;BR&gt;0F76&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB2 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB2 0F80&lt;BR&gt;0F77&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F77&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F77&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged (strongly)&lt;BR&gt;0FB2 0F71 0F80&amp;nbsp; 0FB2 0F71 0F80&amp;nbsp; 0FB2 0F71 0F80&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;0F78&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB3 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB3 0F80&lt;BR&gt;0F79&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F79&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F79&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged (strongly)&lt;BR&gt;0FB3 0F71 0F80&amp;nbsp; 0FB3 0F71 0F80&amp;nbsp; 0FB3 0F71 0F80&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F80&lt;BR&gt;0F81&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged&lt;BR&gt;0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR&gt;Note that the preferred forms appear in both NFD and NFC, with the decomposed form for 0F81 resulting from the non-starter exclusion and the decomposed forms for 0F76 and 0F78 resulting from explicit addition to the script-specific composition exclusions.&lt;BR&gt;&lt;BR&gt;If you gave 0F77 and 0F79 *canonical* decompositions, then:&lt;BR&gt;&lt;BR&gt;&lt;FONT face="Consolas,Courier New"&gt;&lt;STRONG&gt;0F77 --&amp;gt; &amp;lt;0FB2, 0F81&amp;gt; --&amp;gt; &amp;lt;0FB2, 0F71, 0F80&amp;gt;&lt;BR&gt;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&amp;nbsp;&amp;nbsp; 130&lt;BR&gt;&amp;nbsp; &lt;BR&gt;0F79 --&amp;gt; &amp;lt;0FB3, 0F81&amp;gt; --&amp;gt; &amp;lt;0FB3, 0F71, 0F80&amp;gt;&lt;BR&gt;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&amp;nbsp;&amp;nbsp; 130&lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR&gt;&lt;FONT face="Consolas,Courier New"&gt;&lt;STRONG&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NFD&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; NFC&lt;BR&gt;0F76&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB2 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB2 0F80&lt;BR&gt;0F77&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB2 0F71 0F80&amp;nbsp; ????&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged (strongly)&lt;BR&gt;0FB2 0F71 0F80&amp;nbsp; 0FB2 0F71 0F80&amp;nbsp; 0FB2 0F71 0F80&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;0F78&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB3 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB3 0F80&lt;BR&gt;0F79&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0FB3 0F71 0F80&amp;nbsp; ????&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged (strongly)&lt;BR&gt;0FB3 0F71 0F80&amp;nbsp; 0FB3 0F71 0F80&amp;nbsp; 0FB3 0F71 0F80&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F80&lt;BR&gt;0F81&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- discouraged&lt;BR&gt;0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 0F71 0F80&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;lt;-- preferred&lt;BR&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR&gt;Now you've made your life more difficult and normalization implementations maybe more complex. The decompositions &amp;lt;0FB2, 0F71, 0F80&amp;gt; have to be prevented from recomposing. They won't&amp;nbsp; decompose partwise, because &amp;lt;0F71, 0F80&amp;gt; is blocked from recomposing, and &amp;lt;0FB2, 0F80&amp;gt; is also blocked from recomposing, but the sequence of 3 has, at least in principle, a target it should recompose to, unless blocked. Depending on how you set up your tables, you might or might not get this right, and in any case, you end up introducing the strongly discouraged characters as a source of valid sequences that you have to contend with in NFC and NFD, whereas under the current scheme you don't.&lt;BR&gt;&lt;BR&gt;Also, this was all part of a very head-breaking set of problems for Tibetan when decompositions and canonical combining classes were being reviewed for the introduction of normalization in the first place.&lt;BR&gt;&lt;BR&gt;In Unicode 2.0, 0F77 and 0F79 *were* given canonical decompositions, but they were *different* decompositions, to wit:&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT face="Consolas,Courier New"&gt;0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71&lt;BR&gt;0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71&lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&lt;BR&gt;*and* they had funky fixed position class assignments, as well:&lt;BR&gt;&lt;BR&gt;&lt;STRONG&gt;&lt;FONT face="Consolas,Courier New"&gt;0F77 = 0F76 + 0F71 = 0FB2 + 0F80 + 0F71&amp;nbsp; (not in canonical order)&lt;BR&gt;&amp;nbsp;135&amp;nbsp;&amp;nbsp;&amp;nbsp; 134&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 143&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&lt;BR&gt;&amp;nbsp;&lt;BR&gt;0F79 = 0F78 + 0F71 = 0FB3 + 0F80 + 0F71&amp;nbsp; (not in canonical order)&lt;BR&gt;&amp;nbsp;137&amp;nbsp;&amp;nbsp;&amp;nbsp; 136&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 6&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 143&amp;nbsp;&amp;nbsp;&amp;nbsp; 129&lt;BR&gt;&lt;/FONT&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;BR&gt;That was clearly hosed, as it broke all kinds of rules that we were trying to establish for normalization, including ensuring that all decomposition mappings produced sequences in canonical order and ensuring, as much as was possible, given the constraints in place, that the resulting sequences would follow the logic of the script *and* that NFC forms would decompose if that was what the users of the script preferred (hence the introduction of script-specific composition exclusions for several scripts, including Tibetan).&lt;BR&gt;&lt;BR&gt;During that conversion from Unicode 2.0 to Unicode 3.0 with normalization, the UTC did the best it could with the mess for Tibetan. It was clear after the analysis that 0F77 and 0F79 should never have been encoded at all -- which was why they got those "strongly discouraged" labels -- but there was nothing to do about that mistake at that point. The compatibility decompositions&lt;BR&gt;were the best compromise to keep them from contaminating the normalization processing of the rest of the Tibetan vowels.&lt;BR&gt;&lt;BR&gt;--Ken&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Anyway, like I said, this seemed to me like good historical information to put out there, and certainly to help show that &lt;STRONG&gt;Every character has a story&lt;/STRONG&gt;!&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma color=#ff1493&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;FONT face="Iskoola Pota" size=8&gt;ཷ&lt;/FONT&gt;&amp;nbsp;&lt;EM&gt;and&lt;/EM&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;FONT face="Iskoola Pota" size=8&gt;ཹ&lt;/FONT&gt; &lt;EM&gt;(&lt;A href="http://www.fileformat.info/info/unicode/char/0f77"&gt;U+0f77&lt;/A&gt; and &lt;A href="http://www.fileformat.info/info/unicode/char/0f79"&gt;U+0f79&lt;/A&gt;, a.k.a. TIBETAN VOWEL SIGN VOCALIC RR and TIBETAN VOWEL SIGN VOCALIC LL)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=648940" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story #20: U+210e (PLANCK CONSTANT)</title><link>http://blogs.msdn.com/michkap/archive/2006/04/21/580328.aspx</link><pubDate>Fri, 21 Apr 2006 10:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:580328</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/michkap/comments/580328.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=580328</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=580328</wfw:comment><description>&lt;P&gt;&lt;FONT face=Tahoma&gt;Unicode is a standard that fits all kinds of different needs.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Obviously the need to represent text that may contain important symbols used in science has always been important.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;At some fundamental level, the identity of that symbol goes beyond an italicized small letter 'h', and therefore&lt;/FONT&gt;&lt;FONT face=Tahoma&gt; the need to represent Planck's constant even in plain text&amp;nbsp;was a real one.&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Thus, it was encoded at U+210e. And has been since at least Unicode 1.1 (and I believe 1.0).&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;There is a nice description of the constant itself up on &lt;A href="http://en.wikipedia.org/wiki/Planck_constant"&gt;Wikipedia&lt;/A&gt;. It looks something like this:&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma size=6&gt;ℎ&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;For fun you can even italicize it? :-)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma size=6&gt;&lt;EM&gt;ℎ&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Recently, Andreas Prilop asked on the Unicode List:&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;The symbol for Planck's constant (6.626E-34 J·s) is an italic "h". Why is there a special Unicode character U+210E for it?&lt;BR&gt;&lt;BR&gt;The symbol for the elementary charge is an italic "e". But of course there is no special Unicode character for it.&lt;BR&gt;&lt;BR&gt;The symbol for the speed of light is an italic "c". But of course there is no special Unicode character for it.&lt;BR&gt;&lt;BR&gt;The symbol for the fine structure constant is an italic "alpha". But of course there is no special Unicode character for it.&lt;BR&gt;&lt;BR&gt;etc. ad inf.&lt;BR&gt;&lt;BR&gt;So what's this U+210E for? IMHO, this character should be listed as deprecated.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Deborah Goldsmith of Apple was first to point out the answers to these questions:&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;To differentiate it for purposes of representing mathematics in plain&amp;nbsp;text....&lt;BR&gt;&lt;BR&gt;....Note that there is not a MATHEMATICAL ITALIC SMALL H&amp;nbsp;precisely because U+210E exists.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;And she is right -- if you look at the &lt;A href="http://www.unicode.org/charts/"&gt;Unicode charts&lt;/A&gt; at the &lt;A href="http://www.unicode.org/charts/PDF/U1D400.pdf"&gt;Mathematical Alphanumeric Symbols&lt;/A&gt; block, there is a reserved space at U+1d455. The reason for these spaces is described in &lt;A href="http://www.unicode.org/reports/tr25/"&gt;UTR#25: Unicode in Mathematics&lt;/A&gt;&amp;nbsp;right after Table 2.1:&lt;/FONT&gt;&lt;/P&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;FONT face="Times New Roman" size=2&gt;The plain letters have been unified with the existing characters in the Basic Latin and Greek blocks. There are 24 double-struck, italic, Fraktur and script characters that already exist in the Letterlike Symbols block (U+2100—U+214F). These are explicitly unified with the characters in this block and corresponding holes have been left in the mathematical alphabets.&lt;/FONT&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Kind of says it all, doesn't it? :-)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma color=#ff1493&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; "ℎ" &lt;EM&gt;(&lt;A href="http://www.fileformat.info/info/unicode/char/210e"&gt;U+210e&lt;/A&gt;, a.k.a. PLANCK CONSTANT)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=580328" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item><item><title>Every character has a story; some of them have cartoons!</title><link>http://blogs.msdn.com/michkap/archive/2006/04/14/576530.aspx</link><pubDate>Fri, 14 Apr 2006 20:20:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:576530</guid><dc:creator>Michael S. Kaplan</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/michkap/comments/576530.aspx</comments><wfw:commentRss>http://blogs.msdn.com/michkap/commentrss.aspx?PostID=576530</wfw:commentRss><wfw:comment>http://blogs.msdn.com/michkap/rsscomments.aspx?PostID=576530</wfw:comment><description>&lt;P&gt;&lt;FONT face=Tahoma&gt;Adam Hill contacted me via that &lt;A HREF="/michkap/articles/280092.aspx"&gt;&lt;STRONG&gt;contact link&lt;/STRONG&gt;&lt;/A&gt; to point out &lt;A href="http://blogamundo.net/dev/"&gt;Hacklog (Blogamundo)&lt;/A&gt;. With a subtitle like "Poking holes in the language barrier since approximately one month from now" I guess it is saying something interesting!&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;A quick perusal shows lots of interesting topics --&amp;nbsp;it even &lt;A href="http://blogamundo.net/dev/2005/12/30/interview-with-microsoft-internationalization-guy/"&gt;linked to me&lt;/A&gt; after the Channel 9 interview....&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Anyway, &lt;/FONT&gt;&lt;FONT face=Tahoma&gt;Adam&amp;nbsp;specifically gave me a heads up about a potential regular feature:&lt;/FONT&gt;&lt;/P&gt;&lt;FONT face=Tahoma&gt;&lt;FONT size=2&gt;
&lt;BLOCKQUOTE dir=ltr style="MARGIN-RIGHT: 0px"&gt;
&lt;P&gt;&lt;A href="http://blogamundo.net/dev/2006/04/11/loonicode0001/"&gt;Unicode comic strips&lt;/A&gt;, the End Days are nigh.&lt;/FONT&gt;&lt;FONT size=2&gt;&lt;BR&gt;&lt;BR&gt;Pat promises/threatens to do more :)&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;/FONT&gt;&lt;/FONT&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Well, since they all have stories, and some may now have cartoons, I have to wonder how long before we get the first graphic novel?&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;Blogamundo seems to me like a quite&amp;nbsp;worthy addition to the &lt;STRONG&gt;&lt;A HREF="/michkap/archive/2004/12/18/325156.aspx"&gt;blogs I read&lt;/A&gt;&lt;/STRONG&gt;, in any case. :-)&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;FONT face=Tahoma color=#ff1493&gt;&lt;EM&gt;This post brought to you by&lt;/EM&gt; "⚇" &lt;EM&gt;(&lt;A href="http://www.fileformat.info/info/unicode/char/2687"&gt;U+2687&lt;/A&gt;, a.k.a. WHITE CIRCLE WITH TWO DOTS)&lt;/EM&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=576530" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/michkap/archive/tags/Every+Character+Has+a+Story/default.aspx">Every Character Has a Story</category></item></channel></rss>