Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
The recent post about Are ligatures supposed to be thought of as 'single characters'? had a comment from RubenP that I thought could use some further conversation:
It must be said, but all the ClearType fonts with automatic fi ligatures look exceptionally bad for the sequence 'fij'; if you remember, the ij is quite frequent in Dutch, so that's a little troublesome. (To me at least ;-) But then again, the few fonts that contain a combining acute accent, hardly ever actually combine it with the j, and if they do, the accent is markably different from the accent on the (pre composed) i. Adding acutes to ij is actually something you'd want in Dutch (the acute is an emphasis mark and ij is a vowel; well a diphtong actually). But because of the very poor support for this kind of thing, even the official rule has become i acute + j, rather than i acute + j acute. Oh, and how does one stop these ligatures from happening? For example, in Turkish? IIRC the fi ligature is a big no-no in Turkish typography, because you cannot distinguish it from f + dotless i. With such silly things, I guess non-American digital typography still has a long way to go...
It is a fair point. What is often hinted at (like in Bill Hills's first post on fontblog) is that the two languages that got the most research and attention when it comes to ClearType and the many ClearType fonts are English and Japanese. And there id no shortcut to skip that research step....
It becomes obvious, when one considers the needs of languages like Dutch and Turkish such as those that RubenP pointed out, that not all of the Western Latin script languages were truly having their individual needs considered when the development of some of the so-called "C* fonts" took place.
The needs here are inded sometimes script-specific but more often language-specific. And it is way too easily (when adding features that might be thought to look good for one language) to unintentionally screw over another language. Not to screw it over too much, mind you. Just to screw it over about the usual amount, if you know what I mean.
It's not like you can change these defaults later -- imagine what it would do to page flow and formatting in documents if such a global change were made -- a backcompat nightmare, to say the least!
Perhaps, in retrospect, a more generic approach to these kinds of issues like the fi ligature could have been done in the C* fonts. After all, this is a lesson we already learned in Microsoft Sans Serif and Tahoma. But typeface design at its best is a much more organic process than trying to imitate another font. So in the end if a particular feature is on by default in a font and that feature is not so good for your language, then perhaps using logic to come to the conclusion that this is not the best font for the language in question is in order? :-)
So while it is true that many people are excited about the optional language features in OpenType and the exciting readability of ClearType, I find myself much more excited about the next ten years -- when the work that has happened here can be further tuned to cater to the needs of even more languages than the ones for which ClearType is optimized now. And when the ability to work with optional OpenType features is available in products like Microsoft Word and Publisher. When the promises devlopered upon in technologies in Vista and Office 2007 are extended to cover so much more of the world....
In the meantime, my Visual Studio font is either Consolas or Courier New, depending on how much "Terminal Services to XP" work I have to do (since "ClearType over TS to an XP box" is not really quite there just yet!).
Makes for an exciting future, in any case. :-)
This post brought to you by fi and ij (U+fb01 and U+0133, a.k.a. LATIN SMALL LIGATURE FI and LATIN SMALL LIGATURE IJ)
I spent way too much time vacillating between which song title to use for post's title -- whether the Morrissey cred (and my more than moderately transparent desire to appeal to people who would recognize it) really did outweigh the more obviously wide recognizability of the song from the Coke commercials. I finally decided while talking to a colleague that I could go with two titles, for reasons I'll explain in a moment. Other runner-ups for titles were rejected such as one from The Beatles (too sacred) and The Carpenters (probably even fewer people would recognize it!).
Ssang is a Korean word, the Hangul of which is 쌍 (U+c30d, a.k.a. HANGUL SYLLABLE SSANGSIOS A IEUNG). And don't think I didn't notice the fact that SSANG is inside the name for SSANG (HANGUL SYLLABLE SSANGSIOS A IEUNG), either -- it's like a weird form of onomatopoeia or maybe like the old joke "in order to understand recursion, you must first understand recursion." :-)
The meaning of the word is pair as in two. So if we take that Hangul syllable and split it into it's constituent Jamo:
ᄊ U+110a, a.k.a. HANGUL CHOSEONG SSANGSIOSᅡ U+1161, a.k.a. HANGUL JUNGSEONG Aᆼ U+11bc, a.k.a. HANGUL JONGSEONG IEUNG
The word SSANG is used in the Jamo names to handle those doubled letters -- like to say that it isn't just ᄉ (SIOS); it's two of them or ᄊ (SSANGSIOS).
SSANG is used for all of the doubled Jamo currently encoded in Unicode:
ᄁ U+1101, a,k,a. HANGUL CHOSEONG SSANGKIYEOKᄄ U+1104, a.k.a. HANGUL CHOSEONG SSANGTIKEUTᄈ U+1108, a.k.a. HANGUL CHOSEONG SSANGPIEUPᄊ U+110a, a.k.a. HANGUL CHOSEONG SSANGSIOSᄍ U+110d, a.k.a. HANGUL CHOSEONG SSANGCIEUCᄔ U+1114, a.k.a. HANGUL CHOSEONG SSANGNIEUNᄙ U+1119, a.k.a. HANGUL CHOSEONG SSANGRIEULᅇ U+1147, a.k.a. HANGUL CHOSEONG SSANGIEUNGᅘ U+1158, a.k.a. HANGUL CHOSEONG SSANGHIEUHᆻ U+11bb, a.k.a. HANGUL JONGSEONG SSANGSIOSᇐ U+11d0, a.k.a. HANGUL JONGSEONG SSANGRIEULᇮ U+11ee, a.k.a. HANGUL JONGSEONG SSANGIEUNG
(You may recall when I talked about in Traditional vs. Modern Sorts about how North Korea and South Korea have two entirely different linguistic philosophies about how Hangul should collate that centers on what one would do with Jamo like these.)
I was running into an interesting issue with old Hangul the other day. I used the following Old Hangul Syllable that I contrived from valid Jamo sequences. If you have a font that shapes them it looks something like this:
The name of this syllable would be HANGUL SYLLABLE RIEUL-SSANGKIYEOK A-EU KIYEOK-NIEUN.
Now in trying to construct this syllable one sees that there are no Jamo for the Leading (RIEUL-SSANGKIYEOK), Vowel (A-EU) or Trailing (KIYEOK-NIEUN) Jamo, but there is one for part of the lead Jamo -- that SSANGKIYEOK part. So couldn't one use the sequence:
ᄅ까ᅳᆨᆫ
here?
It turns out, unfortunately, that this will render with that first Jamo left out:
In order to get it to render correctly, you have to not use the SSANG version of the Jamo; instead you have to specify the two Jamo separately:
ᄅᄀ가ᅳᆨᆫ
I was unable to find a source for the reason behind this specific issue, perhaps it is something specific to Microsoft's implementation. And since Unicode does not specify any sort of equivalence between a SSANG Jamo and the Jamo just put in twice, in its own way I guess it is a good thing that they don't look the same. I am curious whether there was specific logioc behind the decision or not.
Though of course with names like HANGUL SYLLABLE RIEUL-SSANGKIYEOK A-EU KIYEOK-NIEUN to consider, perhaps deciding it the other way might have at least made name construction easier.
On the other hand I don't know of all that many utilities that do Old Hangul name construction; perhaps the algorithm behind that would make an interesting interview question?
At the moment there are not too many fonts out there that take advantage of the Microsoft OpenType implementation's features here, and not a whole lot of the other implementations seem to be out there, either. The latter might be why there haven't been too many people complaining about the song Microsoft is ssanging, and the latter might be why there is little in the way of suggested implementation standards out there right now. None of the fonts listed here for example, which do support the range, do any shaping with the Jamo in the range.
The Microsoft implementation is based on the idea of specific known Old Hangul sequences (defined in the appendix here), so presumably if other sequences were determined to be valid as more Old Hangul syllables were identified (and there is a new proposal from South Korea to WG2 that even ignoring all its flaws does seem to suggest there are at least 8-10 such sequences that are not in this appendix), then they can always be added....
I'll talk about collation and the impact on it another day.
So, does anyone want to take a stab at the name building algorithm? :-)
This post brought to you by 쌍 (U+c30d, a.k.a. HANGUL SYLLABLE SSANGSIOS A IEUNG)
I heard from Cristian Secară from Romania again not too long ago:
Hi Michael, First: thank you for your feedback regarding the Windows Keyboard Layouts page (your blog on 05 November). Second: Tudor pointed me out, and then read on your blog (your blog on 19 November) about the European Union Expansion Font Update for Romanian and Bulgarian. One thing I want to add here: although the covered fonts are ok for physical written documents (and thanks to MS for that !), there is at least one more important font required here: Verdana. This is required for virtual written documents (i.e. internet). Just from the early days of RC 1 beta testing I encountered discussions on forums, where some are saying something like "hey, look, Vista has the new keyboard layout included, I am using it right now !", while the answer was something like "hey, stop using it, we have difficulties in reading your post". Why ? Because the originator used Vista to write his post and the reader was on XP, so the ș and ț appeared as blank squares to the reader.There is nothing unexpected about that user behaviour giving the existing situaton, but if a font update is on the way already, why then not complete the picture starting from right now ? Some points here (my own speculations): it may be important that existing XP users will not be disapointed about the new Vista, just because one single important font update is missing; I mean if the average XP user without much knowledge will see blank squares in messages written by Vista enthusiasts, they may have a bad influence in convincing others that Vista has "problems" with the Romanian language; again, in short term this may be an expected behaviour, but why not complete the fix if the fix has been alreadystarted in some way ? it would be fine if Verdana is updated, because this can lead to write and read pages in correct Romanian as quick as possible; I know this will be a long process (months/years), but why not start from right now ? Verdana is the most used font for web pages, at least, today in XP. I consider an update to this font to be equal in importance like the Arial and Times New Roman requested by EU. I am speaking for Romanian, I don't know what font the Bulgarian sites uses for most ... Cristi PS: I told to Tudor that the minimum font update should be Arial, Courier New, Times New Roman and Verdana, perhaps Georgia too; but I can live for now just with Arial, Times New Roman and Verdana update :) -- Cristian Secară
Hi Michael, First: thank you for your feedback regarding the Windows Keyboard Layouts page (your blog on 05 November). Second: Tudor pointed me out, and then read on your blog (your blog on 19 November) about the European Union Expansion Font Update for Romanian and Bulgarian. One thing I want to add here: although the covered fonts are ok for physical written documents (and thanks to MS for that !), there is at least one more important font required here: Verdana. This is required for virtual written documents (i.e. internet). Just from the early days of RC 1 beta testing I encountered discussions on forums, where some are saying something like "hey, look, Vista has the new keyboard layout included, I am using it right now !", while the answer was something like "hey, stop using it, we have difficulties in reading your post". Why ? Because the originator used Vista to write his post and the reader was on XP, so the ș and ț appeared as blank squares to the reader.There is nothing unexpected about that user behaviour giving the existing situaton, but if a font update is on the way already, why then not complete the picture starting from right now ? Some points here (my own speculations):
Verdana is the most used font for web pages, at least, today in XP. I consider an update to this font to be equal in importance like the Arial and Times New Roman requested by EU. I am speaking for Romanian, I don't know what font the Bulgarian sites uses for most ... Cristi
PS: I told to Tudor that the minimum font update should be Arial, Courier New, Times New Roman and Verdana, perhaps Georgia too; but I can live for now just with Arial, Times New Roman and Verdana update :) -- Cristian Secară
Well, I suppose I could claim that as soon as I connected with Judy of the Microsoft Typography PM team that my extensive authority and influence led to everyone dropping what they were working on and providing an update to the update to include Verdana.
In truth, when I mentioned this mail to her, she let me know that they were already working to update the update to include Verdana.
And it is now available!
You can check out the updated European Union Expansion Font Update (which still has neither Romanian nor Bulgarian translations of the download page available but I continue to hope for the future!), a download that not only includes
Times New Roman (regular ȘșȚțЍѝ, bold ȘșȚțЍѝ, italic ȘșȚțЍѝ, and bold italic ȘșȚțЍѝ)
and
Arial (regular ȘșȚțЍѝ, bold ȘșȚțЍѝ, italic ȘșȚțЍѝ, and bold italic ȘșȚțЍѝ)
but now also includes
Verdana (regular ȘșȚțЍѝ, bold ȘșȚțЍѝ, italic ȘșȚțЍѝ, and bold italic ȘșȚțЍѝ)
Enjoy!
(I still owe several posts here about the Romanian letters and keyboards, and will be getting to them soon!)
This post brought to you by Ș (U+0218, a.k.a. LATIN CAPITAL LETTER S WITH COMMA BELOW)
Yes, C (not Claire, or Chrsitine; the other C, I mean) is probably grimacing after just seeing the title of this blog.
I can almost hear the two us acting out a bit from a Louis XIV song:
She said oh come on boy aren't you tired of talking Persia yet?I said little girl what do you really expect?
I guess you have to be there. In my head, I mean....
So anyway, Afshar's question via the Contact Link was:
Hi Michael, There is a good standard for Persian (Farsi) keyboard layout called ISIRI 9147 (former version was ISIRI 2901) but none versions of windows comply with it including Vista, XP, 98, 95. I always wonder why Microsoft doesn't like to use this standard and even more each version's keyboard layout is changing by new versions? ISIR stands for Institute Of Standards & Industrial Research Of Iran (http://www.isiri.org/). ISIRI in Iran is something like ANSI in U.S. I am a Persian (Farsi) user of MS Windows and an i18n interested C# developer. I'm very pleased to help on this issue if I can. Please let me know if I can do anything. afshar
Hi Michael,
There is a good standard for Persian (Farsi) keyboard layout called ISIRI 9147 (former version was ISIRI 2901) but none versions of windows comply with it including Vista, XP, 98, 95. I always wonder why Microsoft doesn't like to use this standard and even more each version's keyboard layout is changing by new versions? ISIR stands for Institute Of Standards & Industrial Research Of Iran (http://www.isiri.org/). ISIRI in Iran is something like ANSI in U.S.
I am a Persian (Farsi) user of MS Windows and an i18n interested C# developer. I'm very pleased to help on this issue if I can. Please let me know if I can do anything.
afshar
And there was another, similar note from Nasser as well:
Dear Kaplan, MichaelAs I Know you are a real good connector between microsoft and developers, I want to ask you to do something for me.I Live in Iran and I always like to work with standard application. Microsoft is going to finalize Windows 7 and we want microsoft to implement a standard Keyboard Layout for Iran. Now in all windows versions (2008,vista,xp,etc) microsoft provides a keyboard layout for Farsi that is not a standard one.The Institute of Standards & Industrial Research of Iran had announced two keyboard layouts: one ISIRI 2901 that realeased on about 13 years ago and the other one ISIRI 9147 that was the Persian Keyboard Layout version2 and released about 2 years ago.We want to ask from microsoft to implement this keyboard Layout (ISIRI 9147) into windows seven and other following Operating System like Midori or viseversa Please show us the best way to connect to microsoft to ask this from them, or connect us to them, or better than everything ask them to implement this keyboard layout into windows. FarsiWeb had implemented this layout and this layout is publicly available from here you can download it and use it but please do something for us, we use from a wide variety of unstandard application becouse of this bad kayboard layout that microsoft provided on windows. there is a good group on google formerly named Persian Computing there are some good guys that can be helpfull for you like Behdad Esfahbod and Rouzbeh PourNader that were in ISIRI 9147 implementation team.I am wonder if you help me, thank you so much , Nasser
Dear Kaplan, MichaelAs I Know you are a real good connector between microsoft and developers, I want to ask you to do something for me.I Live in Iran and I always like to work with standard application. Microsoft is going to finalize Windows 7 and we want microsoft to implement a standard Keyboard Layout for Iran. Now in all windows versions (2008,vista,xp,etc) microsoft provides a keyboard layout for Farsi that is not a standard one.The Institute of Standards & Industrial Research of Iran had announced two keyboard layouts: one ISIRI 2901 that realeased on about 13 years ago and the other one ISIRI 9147 that was the Persian Keyboard Layout version2 and released about 2 years ago.We want to ask from microsoft to implement this keyboard Layout (ISIRI 9147) into windows seven and other following Operating System like Midori or viseversa Please show us the best way to connect to microsoft to ask this from them, or connect us to them, or better than everything ask them to implement this keyboard layout into windows.
FarsiWeb had implemented this layout and this layout is publicly available from here you can download it and use it but please do something for us, we use from a wide variety of unstandard application becouse of this bad kayboard layout that microsoft provided on windows.
there is a good group on google formerly named Persian Computing there are some good guys that can be helpfull for you like Behdad Esfahbod and Rouzbeh PourNader that were in ISIRI 9147 implementation team.I am wonder if you help me, thank you so much , Nasser
Well, let me explain.
Microsoft does not have a subsidiary in Iran, and we don't sell software there.
We aren't allowed to, actually.
And while recent developments like US Lifts Iran, Sudan, Cuba Internet Services Export Ban are interesting (note that this had not yet happened when those two messages were sent to me), the meaning of these initial steps in terms of how companies would engage in reviewing or implementing standards in these countries is not entirely clear.
Note from the announcement that article links to:
U.S. companies can now export instant messaging, e-mail and social-networking tools, blogging software, Web browsers and photo and movie sharing software, as long as the software is publicly available at no cost to the user, the Department of Treasury said in a press release.
It is unclear whether keyboards and such that only go into products that the new guidelines wouldn't cover (i.e. Windows) would in fact be included - I am not a lawyer, but it seems unlikely.
I know I wouldn't just randomly start doing the work before people a whole bunch of levels over me told me it was okay.
So I guess the direct answer to afshar and Nasser as to why Microsoft isn't looking at these standards would be that Microsoft (or at least the small part of it in which I sit!) hasn't been given the okay to be looking into supporting those things.
If and when that changes and Microsoft can more directly engage in-country, some aspects of our support (like LIPs and keyboards and locales and so on) can be targeted more to the new customers in threse markets made available, as opposed to the expatriate market that is the current primary focus....
Previous blogs from this series:
So after going way off-topic in the last part, I thought I'd come back to the reservation.
Ready?
Every version of Windows -- plus XP SP2 and XP SP3 -- has added new locales.
We've never taken any out, once they've shipped.
And we've only ever removed four sorts (described in Four exceptions to prove the rule).
Anyway, so it is hardly breaking news that we added locales in Windows 8.
The full list we've added (already mentioned at the IUC35) is:
This is a very interesting list, for several reasons, one of which is the wide differences in reasons each one was added.
In particular, there are several here that I am quite pleased about, all of which I recused myself most of the justification conversations because I was personally rooting for them and didn't want an inappropriate bias to take away from my advocacy...
Now these aren't UI languages mind you -- these are just locales -- though as I pointed out earlier, locales are huge. It takes your breath away the first time you see your locale on the Standards and Formats list -- I know because I've witnessed it several times over the years (more on this another time).
So they are cool on their on, and I'll give part of the list of why I think a few of them are especially cool:
There is Central Kurdish (Arabic) - Iraq, which allowed me to be proven wrong about my earlier statements almost 2.5 years ago in The Whey doesn't get a locale, either.
And there is Tamil - Sri Lanka, which allows me to be a part of keeping ~1/3 of my promises about Tamil locales in Windows.
And there is Hawaiian (United States), which I despondently assumed would never be added, for no reason I could fathom, until the decision was made to not refuse to add it....
For the record, I'll say that Shawn Steele's advocacy of Hawaiian was a definite factor in its approval -- at the point where we asked about it, we were past the point of expecting to succeeed; we were like the Chicago Cubs (assuming we already lost), which made the approval all the more sweet!
There is more here that is interesting and fun, which I'll be covering in future parts of the series....
I got mail yesterday from Frank Grießhammer of Adobe:
Michael, Maybe you remember – I asked you some questions once before – most of which I could actually deal with myself. I have been creating a bunch of keyboard layouts for Windows and Mac, to be shipped with the Adobe Pi fonts. The motivation behind this project is providing the user with a method to ‘key’ their symbol glyphs. The layouts were created for the Mac first, the XML was converted to a .klc file using a Python script, the final layout being compiled with MS Keyboard Layout Creator. In the process, I made the following observation, which might be interesting material for your blog: As of Unicode 5, Unicode values with 5 digits exist. Both Mac and Windows have (at least) four possible shift states for keyboards, I kept them unified across the platforms: ‘alt’ on the Mac would become ‘altGr’ on Windows (e.g. the right ‘alt’-key). The observation I made has to do with the altGr/altGr+Shift states: If there is a 5-digit unicode value mapped to a key in either of those states, the respective key just won’t return anything on Windows. I filed a bug with your colleagues, and in a long email conversation we came down to the point that unicode support in the OS is – to say the least – leaving a lot to desire. It’s broken. We could rule out MSKLC being the culprit, it really came down to the OS. By the way: I tested my example layout on a developer version of Windows 8 as well, and the same problem exists. Now here’s my question: Do you maybe have an explanation for that? I see your latest post is about keyboards, so that might fit in just nicely. Best greetings, Frank Grießhammer Type Design & Font Production Adobe Systems
It's funny how definitions shift and change over time.
Although the architecture of keyboards on Windows has been stable and unchanged since at least NT 3.1, and keyboards have been create-able via the DDK for almost as long, the creation of keyboard layouts was largely a rare operation by a few key players who provided their own runtime layers that ran atop Windows (such as Keyman), and the "competition" of those companies that did this work? Mostly hacks.
The number of DDK downloads did not increase much during this time -- people were mostly working atop the system.
And you could fit all of the people who really understood the innards of the input stack in one minivan. And you'd have room to hold the cooler containing the beer that the group would inevitably be drinking at some point. :-)
After the Microsoft Keyboard Layout Creator was released, all of that changed.
Nearly three quarters of a million downloads of two versions of MSKLC over this last almost decades and an essentially uncountable number of layouts later, and the principal means of going beyond the new bar of what is possible in keyboards has principally been based on work talked about in this Blog.
There are people writing Python scripts to generate .KLC files, and Apple's Boot Camp ships layouts using the format too.
There is a small part of me that is sad that the ownership of MSKLC is elsewhere, because there's a part of me that would love to be creating new versions that fix bugs and add features. This Blog, for all of its virtues, is a rather unwieldy tool to act as the path to upgrading the collective understanding of what can be done with input.
But I have other work now that keeps me pretty busy that I also enjoy, and there are only so many hours in a day, so....
Anyway, on to the email from Frank.
It's a bug.
And it is a bug in MSKLC, or to be more precise in the kbdutool.exe command line tool that MSKLC delegates its actual layout creation too.
More to the point, it's a bug I introduced....
Suddenly the number of people I helped with this widely downloaded tool and widely read Blog seems a lot less impressive to me. Even though the usage scenario is uncommon....
For my penance, I will provide the workaround for Frank and the others running into this problem of supplementary characters in the AltGr and Shift+AltGr keyboard states.
I'll warn you that it is kind of a pain in the butt and will require a bunch of the kind of work Cathy and John and I used to have to do creating keyboard text files years and years ago.
So it is a workaround, not a fix. As anyone who has ever had to author those text files by hand will readily attest.
Sorry about that....
Anyway, I'll provide the workaround tomorrow, so pop by. It will be worth your time, I promise....
I could not resist the Brittney Spears reference, sorry!
But there is yet another Language Interface Pack, this one for Nepali!
A tiny little bit of info about the Nepali language:
Nepali (sometimes also referred to as "Nepalese") is the official language of Nepal where it is spoken by roughly half the population as a mother tongue and by about 2 million people as a second language. It is related to Hindi but has borrowed less words from Persian and English (instead using more Sanskrit derivations) and it has been influenced by the neighboring Tibeto-Burman languages.
Very cool!
This post brought to you by "न" (U+0928, a.k.a. DEVANAGARI LETTER NA)
(computerized apologies to Ray Charles for the title of the post!)
Will anyone forget when I asked the question What do you get when you combine a base character with a buttload of diacritics?
I was of course talking about fonts there. This time I am going to take a slightly different approach, and talk about collation.
I will give the string, the code points, and the sort key. We'll start simply, with one letter:
eU+00650e 21 01 01 01 01 00
Now we will go with something a little more complicated (the difference from above marked in RED):
ẽU+1ebd0e 21 01 19 01 01 01 00
or its alter ego in normalization form D:
ẽ
Hmmm... let's look at another diacritic:
ê
êU+0065 U+03020e 21 01 12 01 01 01 00
Ok, and now for the kicker:
ễU+1ec50e 21 01 29 01 01 01 00
ễU+0065 U+0302 U+03030e 21 01 29 01 01 01 00
But wait -- where did the 29 come from? I mean the first one had no DW (diacritc weight), and the next two had 19 and 12, respectively.
I had talked in previous posts about sort keys about how the minimal weight is 2, but that this weight would only be seen when it was needed as a placeholder, e.g. in the following string:
eễU+0065 U+1ec50e 21 0e 21 01 02 29 01 01 01 00
So, if you take that (sometimes invisible) 2 that as there on the 'e' always and combine it with the 17 on the tilde and the 10 on the circumflex, you get 29.
Easy.
Now what happens when you get that buttload of diacritics? Let's add them one at a time:
U+00650e 21 01 01 01 01 00 U+0065 U+03000e 21 01 0f 01 01 01 00 U+0065 U+0300 U+03010e 21 01 1b 01 01 01 00 U+0065 U+0300 U+0301 U+03020e 21 01 2b 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+03030e 21 01 42 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+03040e 21 01 57 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+03050e 21 01 95 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+03060e 21 01 a8 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+03070e 21 01 b6 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+03080e 21 01 c7 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+03090e 21 01 06 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a0e 21 01 1e 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b0e 21 01 39 01 01 01 00 U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b U+030c0e 21 01 4b 01 01 01 00
U+00650e 21 01 01 01 01 00
U+0065 U+03000e 21 01 0f 01 01 01 00
U+0065 U+0300 U+03010e 21 01 1b 01 01 01 00
U+0065 U+0300 U+0301 U+03020e 21 01 2b 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+03030e 21 01 42 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+03040e 21 01 57 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+03050e 21 01 95 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+03060e 21 01 a8 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+03070e 21 01 b6 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+03080e 21 01 c7 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+03090e 21 01 06 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a0e 21 01 1e 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b0e 21 01 39 01 01 01 00
U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b U+030c0e 21 01 4b 01 01 01 00
Uh oh! Eventually we wrap....
We only have one byte of space to store that diacritic weight (any more than a byte would run into the next character's byte), and when we run out there were really only three choices:
The problem with #2 is that it pretty sharply limits what one could do in a potentially unpredictable way, and the problem with #1 is that all such strings would be equal. Now with option #3 there is a good chance that there will be a difference between strings being compared, though it will sometimes unfortunately make a string that is clearly greater than another string feeling like it is less than it -- a cure that may be worse than the disease....
Well, I won't argue whether one of the other choices might have been better; we are kind of stuck with it now (there are technically a few cases that wrap that are less theoretical than the case above, lest you try to dismiss the example as being a bit too unrealistic!).
But at least that answers the question about what happens when you try to collate a buttload of diacritics....
This post brought to you by "e" (U+0065, a.k.a. LATIN SMALL LETTER E)
In Fall 2004, Cathy Wissink and I were in San Jose at the Unicode Technical Committee meeting (being held at Apple) along with 20+ of our colleagues from various companies involved with internationalization. We spoke at the IMUG (International Mac User's Group) meeting one evening, giving a much longer version of the talk that has been done before at both prior Internationalization and Unicode Conferences and at the Microsoft Global Development & Deployment Conference. Things were a little bit closer to shipping so more could be said, and since we were given more time we were definitely allowed to say more.
The title of the talk? Windows for the Rest of the World -- Customizing Windows for Emerging Markets. This post will contain a few slides of the content from that talk. :-)
One thing we talked about quite a bit was about locales and how long it took to get them added. Some stats:
These numbers are only impressive when one ignores how many languages and cultures that are not being covered around the world. We then pointed out the problem with the traditional methods we have been using to add NLS data:
The presentation got into detail about a lot of the things that we are doing to try to help here, some of which I have talked about before (like MSKLC), and others that I will likely cover in future posts. But for now I will talk about one of the many things GIFT is doing to help with the issues above: ELKs!
ELK stands for Enabling Language Kit. These useful beasts will (on a per locale basis) install as needed any or all of the following:
Obviously some (like locale information) always had to be done, but others (like fonts or shaping engines) were only required for a few.
Lest you are afraid at this point that ELKs are typical vaporware that is never actually shipped, Microsoft Windows XP Service Pack 2 ships with 25 new ELK locales! Those locales are:
(I swear that this list was even more impressive when it was done with PowerPoint animations, showing up one item at a time!)
Definitely not vaporware -- you can install XP SP2 and see support for all of these locales today. And things will continue on in the future!
And like I said, there are a lot of other items discussed in the presentation, which will be covered in future posts. It's all about getting out of the way...
This post sponsored by "ᕣ" (U+1563, CANADIAN SYLLABICS N-CREE THII)
The Unicode List is up to its old fun and games again (well, actually its the participants, not the list itself), and this time it is not about the Unicode BOM.
I talked a little about this problem when I was saying International Domain Names? The sign on the door says 'Gone Phishing'....
Then some people started really getting into it because a bunch of hackers "found" a homograph spoofing issue. They even registered an evil URL (www.pаypal.com -- the first "a" is U+0430, a CYRILLIC SMALL LETTER A) which in browsers that support the new IDN/punycode stuff becomes www.xn--pypal-4ve.com.
Then those folks at the Unicode List weighed in (in a thread with 116 posts the last time I looked)....
The "solution" that many people have touted involves a list of common cross-script items that might be expected (like Kana and Kanji). And then to show the actual punycode names, since that way people could tell they were being spoofed.
Anyone else see the flaw here?
The feature is for international domain names. If it were just ASCII then a confusing string would indeed warn users that bad things were going to happen. But if we were all using ASCII we wouldn't need IDN in the first place, now would we?
Doesn't it make the whole feature suck just a little bit for its target users if they are left seeing eird crap every time they go to a site that uses their native language for the URL?
I almost weighed into the thread to point out the obvious problems in approach but I did not want to add to the noise (and most likely be drowned out by the people who point out that there is no way to make it secure and how IDN will bring down the internet). So I did not become post #117.
Oops, a few more while I was typing this, mine would have been #120. Sometimes in this post-Kitty Genovese era in which we all live, it is better to not get involved....
This post brought to you by "а" (U+0430, a.k.a. CYRILLIC SMALL LETTER A)A letter that is feeling quite popular these days and which would like to point out that this site is not ВӀоgs.Мsdn.соm/miсhкар no matter what the URL looks like...
Jonathan Payne asked if I had an international thought about the palindrome pseudo interview question at this site:
http://channel9.msdn.com/ShowPost.aspx?PostID=19171
I did. :-)
Using the new StringInfo stuff in Whidbey Beta 2:
bool IsPalindrome(string st) { StringInfo si = new StringInfo(st); int count = si.LengthInTextElements; if (count == 0) return false; for (int i = 0; i < (count / 2); i++) { string st1 = si.SubstringByTextElements(i, 1); string st2 = si.SubstringByTextElements(count - i - 1, 1); if (CultureInfo.CurrentCulture.CompareInfo.Compare(st1, st2) != 0) { return(false); } } return (true);}
bool IsPalindrome(string st) { StringInfo si = new StringInfo(st); int count = si.LengthInTextElements;
if (count == 0) return false;
for (int i = 0; i < (count / 2); i++) { string st1 = si.SubstringByTextElements(i, 1); string st2 = si.SubstringByTextElements(count - i - 1, 1);
if (CultureInfo.CurrentCulture.CompareInfo.Compare(st1, st2) != 0) { return(false); } }
return (true);}
Quickest way to handle all those cool issues like cultural sensitivity and combining characters and supplementary characters and such!
We do get our fair share of silly questions here in NLS.
I should perhaps explain what I mean by silly. :-)
I don't think I'd ever consider a question where somebody is asking about language and how it might work in a certain situation and call that silly. I mean, that's how people learn. It's the kinds of questions that I ask of native speakers and of linguists, and even if they smile or laugh I never get the sense that they are thinking me silly for the question.
But today, somebody who is thinking about 64-bit Windows and who assumed that one day strings that are greater than 2 GB would be common looked at our signature for CompareString:
int CompareString( LCID Locale, DWORD dwCmpFlags, LPCTSTR lpString1, int cchCount1, LPCTSTR lpString2, int cchCount2);
and suggested that perhaps those int parameters containing the string lengths ought to be size_t instead.
Now I would like to forget about the argument that this is a public API that is been around since NT 3.1. It's obviously important here, and makes a suggestion a little bit silly, but not everyone really pays attention to what's in NLS API or how long it's been there.
I'd also like to forget about the argument that 2 GB strings are uncommon, because one day they may not be. Especially in the 64-bit world. There may be a perfectly valid reason to have huge strings.
The real problem I have here, and what makes the question in silly to me, is the notion that you need to do linguistic comparisons on strings that are greater than 2 GB in size.
There is simply no way to justify this is a reasonable use of the collation functionality in NLS API.
Perhaps some of you may disagree with this notion, and I'll be curious how people respond to this post. If you are somebody disagrees, please be sure to include information about your "reasonable example" so that people have a chance to appropriately judge the judgment being used. :-)
This post brought to you by "§" (U+00A7, a.k.a. SECTION SIGN)
The chcp.com utility is a simple little program sitting in the \WINDOWS\SYSTEM32 subdirectory. Running it with /? willl give some helpful information about its purpose:
C:\WINDOWS\system32>chcp /?Displays or sets the active code page number. CHCP [nnn] nnn Specifies a code page number. Type CHCP without a parameter to display the active code page number.
C:\WINDOWS\system32>chcp /?Displays or sets the active code page number.
CHCP [nnn]
nnn Specifies a code page number.
Type CHCP without a parameter to display the active code page number.
There is also more information in the Windows XP documentation, which does hint at a problem in its small list of "supported" code pages:
Code page Country/region or language 437 United States 850 Multilingual (Latin I) 852 Slavic (Latin II) 855 Cyrillic (Russian) 857 Turkish 860 Portuguese 861 Icelandic 863 Canadian-French 865 Nordic 866 Russian 869 Modern Greek
437
United States
850
Multilingual (Latin I)
852
Slavic (Latin II)
855
Cyrillic (Russian)
857
Turkish
860
Portuguese
861
Icelandic
863
Canadian-French
865
Nordic
866
Russian
869
Modern Greek
None of the ACP values are there, though this is I think a bit of social engineering -- to keep people thinking of it as the OEM code page. The 125x series code pages also work well here.
However, another set that is missing from the list is the ideographic code pages. You cannot use chcp to change to one of the ideographic code pages unless it is also the default system OEM code page.
Thus on a system with an 0x0409 default system code page:
C:\WINDOWS\system32>chcp 932Invalid code page C:\WINDOWS\system32>chcp 936Invalid code page C:\WINDOWS\system32>chcp 949Invalid code page C:\WINDOWS\system32>chcp 950Invalid code page
C:\WINDOWS\system32>chcp 932Invalid code page
C:\WINDOWS\system32>chcp 936Invalid code page
C:\WINDOWS\system32>chcp 949Invalid code page
C:\WINDOWS\system32>chcp 950Invalid code page
This is a known and expected limitation for which there is no workaround....
This post brought to you by "Ā" (U+0100, a.k.a. LATIN CAPITAL LETTER A WITH MACRON)
I bought a Zune the other day.
This is something of a departure for me, as I have not bought a portable music player smaller than a Tablet or laptop since the Walkman I picked up over two decades ago. And generally speaking I don't go for the latest entertainment devices, something that has been true since I bought an Atari 5200 around the time I bought that Walkman.
But I figured what the hell, I'd go to the MS Company Store and pick one up. :-)
I bought a white one, which is (according to Carolyn) the least cool of the three colors. But I never claimed to be cool, so that was no problem.
So I set up my new cool Zune.
I found that I was quite happy with the sound quality and especially happy that I was able to sync the over 25gb of music from my Lattitude D820 and still have enough room for all eight episodes of Love Monkey in WMV format.
(that would be Paul Bryan and Aimee Mann from the episode entitled The One That Got Away)
But I admit I was a little bit less happy when I scrolled through those many gigabytes of music.
As you can imagine, I have lots of music on my machine that is imported from other countries, and lots of that music is not in English.
It was very cool to see it in the Zune app on my laptop:
It even included my humorous playlists!
On the other hand, it was decidely doubleplusuncool to see what it looked like on the device:
Hmmmm. Well, I guess I have to install some fonts. I better go find out how....
But I find article 928210 in the MS Knowledge Base (Boxes (□) or other characters appear instead of letters in the name of a song or an album when you browse for music on your Zune device) which confirms my diagnosis with some decidely un-international text:
CAUSE This behavior occurs when the name of the media contains characters that are not part of the United States English (en-us) font that is installed on the Zune device.
CAUSE
This behavior occurs when the name of the media contains characters that are not part of the United States English (en-us) font that is installed on the Zune device.
followed by a suggested seven step workaround, step #6 of which is:
Modify the album name or the song name so that it contains U.S. characters to help you identify it.
Hmmm.
I asked around and nobody had any advice on installing some more fonts. :-(
What are we republicans that we assume that US text == English text? Grrr.
Well, I am not going to return my Zune. But the behavior as well as the text of the KB article (which does not even put in the text that this is a bug) are bad enough that the Zune makes it to the Unicode Lame List of What's [Internationally] Weak this week....
Microsoft Typography to the bridge! Chekov, fire an MS UI Gothic torpedo at the Zune to port!
Update 3:14 PM: In the final insult, this somewhat obnoxious KB article has translations into Japanese, Simplified Chinese, Traditional Chinese, French, German, Italian, and Spanish, all of whom screw up the NULL glyph.
This post brought to you by バ (U+30d0, a.k.a. KATAKANA LETTER BA)
I am reminded of a scene from the 1991 film The Doctor starring William Hurt, modified here to be a bit more linguistic than medical:
Linguist: Nancy, are my repeated vowels pronounced differently?Nancy: No, doctor. Linguist: That's funny, I always trema when you're near.
There are essentially two1 different traditions for the meaning of two dots on top of a vowel:
Umlaut - Described in Wikipedia as a "...modification of a vowel which causes it to be pronounced more similarly to a vowel or semivowel in a following syllable."
Trema or Diaeresis - Described in Wikipedia as the "...division of two adjacent vowels as two syllables rather than as a diphthong."
Now in Unicode and ISO 10646, these two very different diacritical purposes are unified under a single character -- the diaeresis. Which is kind of ironic given that the meaning of 'diaeresis' tends to suggest a division rather than any sort of unification....
Ignoring that bit of irony in the naming decision, a unification does make sense since they really do look pretty much the same, and a disunification would be a huge target for spoofing (something we really do not need any more of, frankly!). Though to tell the truth, in quality typography the umlaut dots are usually a bit closer to the letter than the trema dots.
Back in 1993, Deutches Institut für Normung (DIN) sent a proposal to WG2 that stated
Currently, a substantial amount of existing German data distinguishes between Umlaut and Trema. Both diacritics have a similar, but not necessarily identical representation, both have quite different properties e. g. with regards to sorting (cf. DIN 5007).In particular, German library data is currently stored according to ISO 5426 "Extension of the Latin alphabet coded character set for bibliographic information interchange" which distinguishes between the two diacritics Umlaut (4/9) and Trema (4/8). However, in ISO/IEC JTC1/SC2 N3125 "Finalized Mapping between Characters of ISO 5426 and ISO/IEC 10646-1 (UCS)" both are mapped to the same UCS character, U0308. There is thus no standardized way to ensure roundtrip compatibility between the two standards. For Germany and in particular for its national library (Deutsche Bibliothek) it is imperative for the integrity of German data that it be possible to maintain the distinction between Umlaut and Trema also in the UCS in a standardized way. Lack of ability to do so affects millions of bibliographic data records in the Deutsche Bibliothek alone (to be exact, 14 956 289 records as of October 2002) and about 110 million bibliographic data records in German and Austrian regional library networks.
In other words, they had a need to distinguish these two diacritics, which are actually not unified in a different ISO standard. Their initial proposal from document N2593:
We therefore requesta) the encoding of two new characters, LATIN VARIATION SELECTOR UMLAUT in position U0241 and LATIN VARIATION SELECTOR TREMA in position U0240 (the positions are suggestions only).b) the insertion of the following text into informative Annex F "Alternate format characters" as F.2.6 "Latin selectors" "LATIN VARIATION SELECTOR UMLAUT (U0241): Uniquely identifies the preceding character as using /being the Umlaut diacritic (cf. ISO 5426, code position 4/9) LATIN VARIATION SELECTOR TREMA (U0240): Uniquely identifies the preceding character as using / being the Trema diacritic (cf. ISO 5426, code position 4/8) In the absence of any variation selector, neither the character COMBINING DIAERESIS U0308 nor any of the Latin letters with diaeresis can be interpreted as representing uniquely the Umlaut or uniquely the Trema.The LATIN VARIATION SELECTOR UMLAUT or the LATIN VARIATION SELECTOR TREMA should only be used directly following the Latin characters shown below:00C4 LATIN CAPITAL LETTER A WITH DIAERESIS00D6 LATIN CAPITAL LETTER O WITH DIAERESIS00DC LATIN CAPITAL LETTER U WITH DIAERESIS00E4 LATIN SMALL LETTER A WITH DIAERESIS00F6 LATIN SMALL LETTER O WITH DIAERESIS00FC LATIN SMALL LETTER U WITH DIAERESISU0308 COMBINING DIAERESISNeither the LATIN VARIATION SELECTOR UMLAUT nor the LATIN VARIATION SELECTOR TREMA carry a defined meaning when they follow any other character ."c) Change in ISO/IEC JTC1 SC2 N3125 (= ISO/TC46/SC4 WG1), section 3 "Mapping of Characters" the table to: 4/8 Trema, Diaeresis 0308 02404/9 Umlaut 0308 0241Z
Unfortunately, Variation Selectors can only be used on base characters, not on combining characters. so while the scenario is valid, the DIN suggested soluion is not. The UTC discussed possible solutions at length before producing the following recommendation, instead:
While recognizing the drawbacks to all of the alternatives to encoding a new COMBINING UMLAUT character outlined in WG2 N2766, we believe that there is a workable alternative solution which has, to date, been overlooked. The solution consists, essentially, of using U+034F COMBINING GRAPHEME JOINER (CGJ), in its intended semantics in 10646/Unicode, to make the relevant sorting, searching, and data mapping distinctions required for umlaut versus
While recognizing the drawbacks to all of the alternatives to encoding a new
It is again ironic that in the (rare) situation where an attempt to distinguish them is required that the default case is suggested as being the unlaut while exceptional case is the diaeresis. :-)
The use of a combining diacritic is still (to this day) controversial in Unicode when people unfamiliar with the standard who are native speakers of languages like Swedish or Finnish and who are asked to think of these standalone letters as equivalent to a different letter plkus a diacritic. The many people who would prefer all of the Indic languages to separately encode all the instances of base letter plus virama have an analagous complaint.
1 - At this point I will take judicial notice of the phenomenon known as the Heavy Metal Umlaut, described ad nauseum here. It is in fact the Heavy Metal Umlaut that inspired Cathy's desire for a bumper sticker that would say Stop Indiscriminate Umlauting!, although I find that approach to be a tad reactionary. The importance to our culture of Spin̈al Tap and Blue Öyster Cult is undeniable, as is the need to avoid fear of the reaper and to turn the volume up to 11.....
This post brought to you by ̈ (U+0308, a.k.a. COMBINING DIAERESIS)