Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
Previous posts in this series:
This time, I will be just quickly talking about the changes in Vista. Qucik, because not very much has changed....
One thing that has not changed is that diaog for adding fonts that I talked about back in Part 2 of the series. Sorry folks, I know people have been wanting this one to go away. It won't be going away for Vista, though.
Another thing that has not changed much is the typical way people use to install and remove fonts -- dragging them in and out of the Fonts folder. Although, since Administrative permissions are still required to install fonts into the Fonts folder, the addition of the UAC feature to Vista will change the experience for some people. I mean, since even an Admin is not really an Admin anymore unless they okay the elevation.
Which gets us to something that has changed -- copying files to the Fonts folder and then opening the folder in an Explorer window, one of the weirdest ways to install a font programatically that I could ever imagine, will no longer work in Vista. As a feature, it never worked all that well anyway. Hopefully people won't miss it too much, if people do I'd love to know what you were doing with it....
Perhaps one of the biggest changes for fonts in Vista is that you no longer need to specially install other language fonts via checkboxes in Regional and Language Options. All languages are installed automatically, which is a wonderful thing for almost everybody (though there is a small group of people who unhappy with the huge font list. I look forward to an update to the ChooseFont dialog in a future version that manages the huge font list a little bit better.
Otherwise, it is business as usual for the Fonts folder, in Vista.
I'll be talking about Unicode version support of fonts in a future post....
This post brought to you by F (U+0046, a.k.a. LATIN CAPITAL LETTER F)
The recent post about Are ligatures supposed to be thought of as 'single characters'? had a comment from RubenP that I thought could use some further conversation:
It must be said, but all the ClearType fonts with automatic fi ligatures look exceptionally bad for the sequence 'fij'; if you remember, the ij is quite frequent in Dutch, so that's a little troublesome. (To me at least ;-) But then again, the few fonts that contain a combining acute accent, hardly ever actually combine it with the j, and if they do, the accent is markably different from the accent on the (pre composed) i. Adding acutes to ij is actually something you'd want in Dutch (the acute is an emphasis mark and ij is a vowel; well a diphtong actually). But because of the very poor support for this kind of thing, even the official rule has become i acute + j, rather than i acute + j acute. Oh, and how does one stop these ligatures from happening? For example, in Turkish? IIRC the fi ligature is a big no-no in Turkish typography, because you cannot distinguish it from f + dotless i. With such silly things, I guess non-American digital typography still has a long way to go...
It is a fair point. What is often hinted at (like in Bill Hills's first post on fontblog) is that the two languages that got the most research and attention when it comes to ClearType and the many ClearType fonts are English and Japanese. And there id no shortcut to skip that research step....
It becomes obvious, when one considers the needs of languages like Dutch and Turkish such as those that RubenP pointed out, that not all of the Western Latin script languages were truly having their individual needs considered when the development of some of the so-called "C* fonts" took place.
The needs here are inded sometimes script-specific but more often language-specific. And it is way too easily (when adding features that might be thought to look good for one language) to unintentionally screw over another language. Not to screw it over too much, mind you. Just to screw it over about the usual amount, if you know what I mean.
It's not like you can change these defaults later -- imagine what it would do to page flow and formatting in documents if such a global change were made -- a backcompat nightmare, to say the least!
Perhaps, in retrospect, a more generic approach to these kinds of issues like the fi ligature could have been done in the C* fonts. After all, this is a lesson we already learned in Microsoft Sans Serif and Tahoma. But typeface design at its best is a much more organic process than trying to imitate another font. So in the end if a particular feature is on by default in a font and that feature is not so good for your language, then perhaps using logic to come to the conclusion that this is not the best font for the language in question is in order? :-)
So while it is true that many people are excited about the optional language features in OpenType and the exciting readability of ClearType, I find myself much more excited about the next ten years -- when the work that has happened here can be further tuned to cater to the needs of even more languages than the ones for which ClearType is optimized now. And when the ability to work with optional OpenType features is available in products like Microsoft Word and Publisher. When the promises devlopered upon in technologies in Vista and Office 2007 are extended to cover so much more of the world....
In the meantime, my Visual Studio font is either Consolas or Courier New, depending on how much "Terminal Services to XP" work I have to do (since "ClearType over TS to an XP box" is not really quite there just yet!).
Makes for an exciting future, in any case. :-)
This post brought to you by fi and ij (U+fb01 and U+0133, a.k.a. LATIN SMALL LIGATURE FI and LATIN SMALL LIGATURE IJ)
I talked about some of this once before in CompareString ignores case by lowercasing...., as you may recall.
There are some people who think of ignoring case as required to keep things from sorting in the 'ASCII order' -- by which I mean
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
But this really is not true -- and (for example) file names will never sort that way in Explorer.
The difference is between
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz
and
AabBcCDdeEfFgGHhIijJkKlLMmNnOopPqQRrSstTuUvVWwXxyYzZ
!
In other words, what we are talking about when it comes to comparisons and ignoring case, we are talking about IGNORING case and giving no deterministic order between items that only differ in case.
Other people (for example the Shell) just sort of implicitly assume that the way that they order files should use NORM_IGNORECASE since NTFS, FAT, and FAT32 all ignore case.
Though I am forced to point again to this article to hopefully help people to realize that the argument is flawed. Or perhaps a moment to think about the difference between 'Which comes first?' vs. 'Are they equal?' will make it clearer.
The bottom line is that if one chooses to make the answer to an IDENTITY question a case-insensitive one, that in no way means that the question of ORDERING of items also needs to be made case insensitive. In fact, if they are being done as two separate operations than it is likely better if they are not done the same way at all.
So why not do two separate operations in two separate ways? :-)
Others argue that the performance is improved if there are less things to look for, though in practice this does not tend to be an issue. The reason for this is that case is never an issue unless two elements have identical primary weights (which is unusual in most situations), and if that happens then the "case winner" is stored, and subsequent case differences are ignored. Just like as if you passed the flag anyway.
The bottom line is that people actually do kind of like determinism, a certain way that patterns will behave in a consistent manner. And if you don't ignore case, things can be ordered a bit more deterministically!
Some of the wittier readers might think of the other ignore flags like NORM_IGNOREKANA and NORM_IGNOREWIDTH and realize that the same issues apply to them as well. But NORM_IGNORECASE is kind of special due to the confusion about filesystems and their case insensitivity and such.
Now all of the above are referring to CompareString and its managed analogues. I'll talk about where sort keys fit into this another time....
This post brought to you by リ (U+ff98, a.k.a. HALFWIDTH KATAKANA LETTER RI)
Okay, I admit it. When I pronounce the word italics, I say EYE-talics, not IH-talics. But I do say IH-talian, not EYE-talian when I see the word Italian.
I point this out because although she had never corrected me on this particular point even once before, or even ever hinted that the pronunciation was wrong, soon after she had some typography program managers reporting to her, Cathy pointed this out to me one day.
But in the end I think she was just enjoying correcting me; after all, both forms are acceptable in dictionaries for italics but not for Italian! I mean, the point of language is communication, and as long as people get the message neither pronunciation is really going to confuse anyone....
This post has little or nothing to do with that, but it is about italics. :-)
Well, actually it is about U+0453, a.k.a. CYRILLIC SMALL LETTER GJE.
It seems that depending on the font you choose, the italicizes differently. For example, in Tahoma it looks like this:
and with the new Segoe UI font it looks like this:
Boy, that Segoe UI one looks like it has a bug, doesn't it? I mean who on earth would expext a character that looks more like:
(a lowercase r or small gamma than anything else) in any font, including Segoe UI, look more like a reversed s just because it was italicized?
Turns out it is not a bug!
Simon Daniels talked with Steve Matteson of Ascender who had this to say:
the 'backwards s' is the preferred italic form for Russian lowercase Ghe but for Macedonian lowercase Gje it needs to stay the 'small gamma' shape. Sorry I don't know the specifics on why.
Looking up in Wikipedia's article about the Cyrillic script, it does say a bit about this:
In the absence of Roman and Italic traditions, Cyrillic type fonts are properly classified as upright (Russian: pryamoi shrift) and cursive (kursivnyi). Cursive or hand-written shapes of many letters, especially the lowercase letters, are entirely different from the upright shapes. As in Latin typography, a sans-serif face may have a mechanically-sloped oblique font (naklonnyi).In Bulgarian, Macedonian, and Serbian, some cursive letters are different from those used in other languages. These cursive letter shapes are often used in upright fonts as well, especially for road signs, inscriptions, posters and the like, less so in newspapers or books.
The article also links to another page that has a much fuller explanation, entitled Serbian Cyrillic Letters BE, GHE, DE, PE, TE. The page also talks a bit about the tradition of italics in typography, and the expectations here. A worthwhile read if you are interested in solutions here.
Now I will not go so far as to say that Tahoma and fonts that don't use this form are tailored for Bulgarian, Macedonian, or Serbian; in fact, I'll note that although Microsoft ships a 'Tahoma' and a 'Tahoma Bold' that we don't ship a 'Tahoma Italics'. Which kind of removes the easiest way to have an alternate form for the small Ghe, doesn't it? :-)
Between that and the fact that there is currently a Russian localization of Windows but not a Macedonian one, it sort of makes sense that the default form in the UI font of much of Vista and Office 2007 would follow the Russian glyph preference....
It does mean that the issue Chris Pirillo has pointed out here about the inconsistency of application of the new UI font (discussed previously here) might be a bit more worrying for Vista and Office 2007, since between MS Sans Serif, Tahoma, Segoe UI, and Microsoft Sans Serif, only Segoe UI is getting it right. This might make the Russian localization of Windows a bit more challenging with this inconistency of font being used, huh?
Luckily the uppercase form (U+0403) does not have this difference, so at worst it will just like we capitalize like morons in Russian in those places where the UI font is inconsistent? :-)
This post brought to you by ѓ (U+0453, a.k.a. Italicized CYRILLIC SMALL LETTER GJE)
(Nothing technical in this post, sorry!)
I swear that none of what I am about to talk about has been intentional. I am merely a victim of circumstance.
I have been taking Lipitor for a borderline cholesterol level which, when combined with my lack of discipline about diet, made folks in the medical establishment feel like I should perhaps try and be safe rather than sorry.
And I have been taking Copaxone daily for my MS for the last few years, mainly because although I preferred the once-a-week Avonex, I was one of the small number of people who suffered flu-like symptoms, and I was tired of being sick once a week. I used to hate the notion of 'shooting up' daily, but I decided to get over it and just pretend it was like I was actually shooting up something ilicit -- so I could have all the fun of being a drugie without any of the downsides of a life of crime and poverty....
And since August 23rd I have been taking Novantrone, as I have mentioned in this blog before. And so far the Echocardiograsm is still looking good. So, Bob willing I'll be on it for a couple more years.
Now I did not stop taking Copaxone during the time I have been taking Novantrone. I talked about it with my neurologist and at first she pointed out that if I was not tolerating the Novantrone that I'd just be back on the Copaxone anyway. And later I just never got around to stopping it, so I didn't.
I also have lots of friends who send me new articles every time they see something on the web about Multiple Sclerosis. It is almost always sensationalistic, mainly because of the combination of the facts that people reporting on these things don't understand them, and even if they did the truth is never as sexy as they need to get people interested. So I usually take what the send with a grain of salt.
But two news items in particular were interesting to me:
Lipitor-Copaxone Combo May Fight MS -- despite its upbeat nature and the fact that the positive results are with the animal model for MS, Experimental Autoimmune Encephalomyelitis (EAE) -- since MS cannot itself occur in mice -- and many EAE cures do not actually help with MS, it may well be good news. Drug combo fuels hope for multiple sclerosis -- the positive results in this three-year open label Copaxone/Novantrone combination therapy are fairly exciting (and I look forward to the article that should be in the upcoming issue of Neurology), though once again one has to be careful to look too positively at popular news reports.
Lipitor-Copaxone Combo May Fight MS -- despite its upbeat nature and the fact that the positive results are with the animal model for MS, Experimental Autoimmune Encephalomyelitis (EAE) -- since MS cannot itself occur in mice -- and many EAE cures do not actually help with MS, it may well be good news.
Drug combo fuels hope for multiple sclerosis -- the positive results in this three-year open label Copaxone/Novantrone combination therapy are fairly exciting (and I look forward to the article that should be in the upcoming issue of Neurology), though once again one has to be careful to look too positively at popular news reports.
It seems that I have unintentionally been involved with two interesting combination therapies? :-)
I'll probably talk more about the second one after I read the article in Neurology. It will be years before anybody comes up with anything on the first one, but I'll just suggest no changes in my drug regimen for now....
(Note: the title of this blog post has incorrect Greek text in it, to help highlight a bug that will be explained later in the post!)
I had no idea when I posted Sometimes, uppercasing sucks that I'd find so many people who were unaware of how much of this sort of 'natural language processing' wasn't happening in Windows or the .NET Framework.
There is actually a more generalized problem here though.
It is the same problem that happens when a developer makes sure a font size is a hard-coded 8pt and it has to show Chinese Han
Or when a developer italicizes text and makes Arabic look like crap and Japanese look really ugly.
Or when a developer hard codes the location of tokens in a string so that the localizer trying to translate it to German cannot change the word order and is forced to write German text that looks ridiculous.
Or when a developer bolds all text and makes a Tibetan string that was on the thin edge of readability look like smudges on the monitor.
Or when a developer does not allow a dialog to be mirrored and messes up the Hebrew UI version.
And yes, it happens when a developer decides to uppercase a string because they believe it is required for some type of emphasis.
The bug here is NOT in Windows or the .NET Framework in these cases. It is not a platform problem at all. It is an application problem.
By taking important decisions like font size/weight/style, like case, like position and token order of strings out of the hands of the people who are most qualified to understand the requirements for acceptance in the market, the localizers, they are actually doing a crappy job on the localizability front -- preparing an application so that localization can go smoothly.
Now I am sure one day having integrated linguistic services that can automatically apply all of the specific language rules for operations like case folding without manual per-language intervention would be a wonderful internationalization feature, one that will enrich the platform tremendously.
Just as I think that if all the rules about how best to handle text rendering in all languages in regard to font attributes/styles could be captured in some internationalized "give me the font for _____" type function that will look up all the rules and apply them per market would be cool. And which once again will enrich the platform markedly.
And just as I think that if localization could happen automatically on the fly with no need for anyone to do the actual translation work and have the results look correct to native speakers of the language would be truly awesome.
But frankly, we are nowhere near any of these goals, and until we are, localizers are the key ambassadors who present the application that people working in/with another language market can use and enjoy.
So if you are not writing localizable applications (and any of these problems including Greek or French or Dutch casing are included here), then for now it is your bug.
So please fix it.
(Note that fixing may be a simple as not making a string all caps or bolding it or italicizing -- in other words, don't potentially destroy properly crafted text for a market in an effort to emphasize it)
As a side note, I'll point out that the Community Server skin that my blog uses, which forces titles to be ALL CAPS (and which I was only able to modify it enough to make SMALL CAPS happen) is a great example of such a localizability bug, this one on Telligent Systems and Community Server, which would impact any Greek title names with diacritics in them. The title in the editor is Ρύθμιση σήματος, and as you can see it capitalizes incorrectly according to recommended/preferred practice in Greek, coing out as ΡΎΘΜΙΣΗ ΣΉΜΑΤΟΣ rather than ΡΥΘΜΙΣΗ ΣΗΜΑΤΟΣ (with the diacritics gone). This bug requires titles that are to be completely capitalized and is just the type of application bug that this post is talking about (though in the end it would be a browser bug, of course, but since the particular skin I am using gives no way to turn off the ALL CAPS behavior, it becomes a Community Server bug, too, given the limitations in CSS!)
As a side note, I'll point out that the Community Server skin that my blog uses, which forces titles to be ALL CAPS (and which I was only able to modify it enough to make SMALL CAPS happen) is a great example of such a localizability bug, this one on Telligent Systems and Community Server, which would impact any Greek title names with diacritics in them.
The title in the editor is Ρύθμιση σήματος, and as you can see it capitalizes incorrectly according to recommended/preferred practice in Greek, coing out as ΡΎΘΜΙΣΗ ΣΉΜΑΤΟΣ rather than ΡΥΘΜΙΣΗ ΣΗΜΑΤΟΣ (with the diacritics gone).
This bug requires titles that are to be completely capitalized and is just the type of application bug that this post is talking about (though in the end it would be a browser bug, of course, but since the particular skin I am using gives no way to turn off the ALL CAPS behavior, it becomes a Community Server bug, too, given the limitations in CSS!)
Otherwise, to be honest, you should not bother localizing your application. It is way too difficult and expensive of a process to get it wrong.
In short, do it right, or don't freaking bother. :-)
This post brought to you by ಬ (U+0cac, KANNADA LETTER BA)
Warning to readers: this post is completely and totally my own opinions based on my efforts to assist with Tamil's representation in Unicode, and truly have nothing to do with Microsoft's opinions on the matter (whatever they are). If you quote anything from my words here as being 'According to Microsoft' then be aware that you are a complete moron whose only saving grace is that being a moron is a venial and not a mortal sin. You have been warned!
As I write this post, the lyrics of Roger Waters wash over me:
And if the cloud bursts, thunder in your ear You shout and no one seems to hear. And if the band you're in starts playing different tunes I'll see you on the dark side of the moon.
It is quite ironic that these words, (from the song Brain Damage, on Pink Floyd's Dark Side of the Moon), seem to so easily link to the insanity that I have seen from afar related to Tamil Unicode - New Encoding (TUNE).
You can see the introduction page for Tamil Virtual University's Request For Comments here.
What this standard amounts to is an attempted re-encoding of the Tamil script using Unicode's PUA (Private Use Area) in an attempt to make Tamil into a simple script (rather than a comple one), to build collation support directly into the order of the code points in the encoding, to encourage ISV's like Adobe to support Tamil.
The fact that this ignores the rules in Unicode related to the re-encoding of scripts that already exist, the fact that collation is never designed as a part of the order of code points in any language (even English!), the fact that INFITT (the INternational Forum for Information Technology in Tamil) and it's 'WG02' Unicode working group (of which I am a member along with several native Tamil speakers from around the world) is on record as disagreeing with the bulk of the claims and assertions made by TUNE supporters, the fact that the Unicode Technical Committee is on record as considering many of the fundamental aspects of TUNE to be entirely unsupportable -- all of these things are ignored.
After WG02 made its feelings clear on the matter, the TUNE supporters had their own working group created (WG08) and although I am officially the liaison between INFITT and Unicode, have never been given any communication related to TUNE to present to the Unicode Consortium (I have been told this is due to my obvious bias against TUNE, though no one from WG08 has communicated to Unicode through other means, either).
So yes it is a request for comments, but one in which if the comments are negative, the commenter can expect little more than to be ignored, or dismissed due to bias.
So there are two kinds of people here -- those who agree with TUNE, and those who are wrong.... :-(
Tamil Nadu has had a similar appoach to 8-bit standards, where they rejected the TSCII standard that was widely used outside of Tamil Nadu and instead formulated their own TAB/TAM standards. Historically their recent efforts in areas such as encodings and keyboards have not been as well received by members of the Tamil Diaspora as other orthographic changes in the language and script in the last 30-35 years.
Anyway, back to being out of TUNE. :-)
For those who are in Tamil Nadu:
"...TVU is organizing a one day conference for obtaining the public opinion and to deliberate on the comments received. The proposed one day conference will have an inaugural session, a session for open discussion in the forenoon. The conference will be held in the Clive Hall at Taj Coromondal Hotel., Nungambakkam during 9.30 a.m. on 2nd September 2006."
If any of my readers are in Tamil Nadu and would like to attend this one day conference, please let me know what happens (and if you contribute anything be sure not to mention you agree with anything I say, given my bias and all!). Given the step backwards that I truly believe this whole effort represents, I am truly hoping that those in Tanil Nadu and TVU who are championing the new standard can be finally convinced that they are out of TUNE....
The lunatic is in their headThe lunatic is in their headthey raise the blade, they make the changeThey rearrange it; it's insanethey lock the doorand throw away the keythere's someone in their head but it's not meAnd if bad standards thunder in their ear We shout and TVU doesn't seems to hear. And if the standard they're in starts implementing TUNE We'll see them on the dark side of the moon.
This post brought to you by க் (U+0b95 U+0bcd, a.k.a. TAMIL LETTER KA + TAMIL SIGN VIRAMA, a.k.a. TAMIL KA puLLi, a.k.a. TAMIL LETTER K)A letter that is separataely encoded in TUNE, along with several hundred othere)
What does the татарча (Tatar) language have in common with اردو, മലയാളം, Qhichwa Simi, فارسی, isiZulu, ಕನ್ನಡ, नेपाली, Afrikaans, कोंकणी, Setswana, বাংলা, తెలుగు, ਪੰਜਾਬੀ, and Lëtzebuergisch ?
That's easy.... it too has a Language Interface Pack, available for download right here!
Some background info on Tatar (courtesy of Soren):
Number of speakers: 6-7 million Name in the language itself: татарча The Tatar language is one of the two official languages of the republic of Tatarstan in the Russian Federation (Russian being the other one). Tatar is spoken there by around 5.7 million speakers; smaller communities of Tatar speakers can be found in neighboring regions like Bashqortostan, in southwestern Siberia and in central Asia and eastern Europe. During the Soviet era, Tatar lost ground to Russian; it is estimated that in the last 30 years of the Soviet Union more than 8 percent of the population of Tatarstan switched from Tatar to Russian as their preferred language. The language of high education as well as the mass media is still predominantly Russian, and in urban areas more Russian is heard. But the Tatar language is being promoted by an active language policy in the republic, and since the end of the 20th century there has been a renaissance of the language. Tatar has a large number of dialects, which can be classified into three major groups: Central, Western/Misharian and Eastern/Siberian. Modern standard Tatar shows features mostly of both the Central dialects (especially in lexicon, phonology) and the Western/Misharian dialects (more in morphology). Tatar is an agglutinative language. Fun facts: In Turkish, the Tatar language is called Turkish Tatar (Tatar Türkçesi) to stress its membership in the Turkic language family. Tatar literature flourished in the empire of the Golden Horde, founded by Ghengis Khan's grandson, Batu Khan (The empire existed from the early 13th to the middle of the 15th century). Classification: Tatar belongs to the Northern Kypchak branch of the Turkic languages, which might belong to the (disputed) Altaic language family. The classification of Tatar itself is not undisputed either (as for most Turkic languages). The closest relative of Tatar is Bashkir, other relatives include Crimean Tatar or Kazakh. Script: Until the late 1920s Tatar was written in a modified Arabic script (which did not suit Tatar very well and imposed very complex spelling rules). The Latin alphabet introduced then was replaced by a Cyrillic one already in 1939. The second introduction of a Latin alphabet, which was made official in September 2001, was reverted by the Russian Supreme Court. Therefore today Tatar is written in a Cyrillic script with 6 special characters unknown in Russian.
Number of speakers: 6-7 million
Name in the language itself: татарча
The Tatar language is one of the two official languages of the republic of Tatarstan in the Russian Federation (Russian being the other one). Tatar is spoken there by around 5.7 million speakers; smaller communities of Tatar speakers can be found in neighboring regions like Bashqortostan, in southwestern Siberia and in central Asia and eastern Europe.
During the Soviet era, Tatar lost ground to Russian; it is estimated that in the last 30 years of the Soviet Union more than 8 percent of the population of Tatarstan switched from Tatar to Russian as their preferred language. The language of high education as well as the mass media is still predominantly Russian, and in urban areas more Russian is heard. But the Tatar language is being promoted by an active language policy in the republic, and since the end of the 20th century there has been a renaissance of the language.
Tatar has a large number of dialects, which can be classified into three major groups: Central, Western/Misharian and Eastern/Siberian. Modern standard Tatar shows features mostly of both the Central dialects (especially in lexicon, phonology) and the Western/Misharian dialects (more in morphology).
Tatar is an agglutinative language.
Fun facts:
Classification: Tatar belongs to the Northern Kypchak branch of the Turkic languages, which might belong to the (disputed) Altaic language family. The classification of Tatar itself is not undisputed either (as for most Turkic languages). The closest relative of Tatar is Bashkir, other relatives include Crimean Tatar or Kazakh.
Script: Until the late 1920s Tatar was written in a modified Arabic script (which did not suit Tatar very well and imposed very complex spelling rules). The Latin alphabet introduced then was replaced by a Cyrillic one already in 1939. The second introduction of a Latin alphabet, which was made official in September 2001, was reverted by the Russian Supreme Court. Therefore today Tatar is written in a Cyrillic script with 6 special characters unknown in Russian.
Enjoy!
This post brought to you by т (U+0442, a.k.a. CYRILLIC SMALL LETTER TE)
Case differences in casing scripts (Latin, Cyrillic, Greek, Armenian, Ecclesastical Georgian, Coptic, Glagolitic, etc.) ought to be easy.
But it's not. And not just for the reasons I have talked about in the past.
All the technical folks want is a simple set of mappings that have a 100% roundtripping capability and no change in size of the string. It is needed for the filesystem, for the NT object namespace, and so on.
But their hopes must unfortunately be dashed if those technical folks wanted their simple needs to match the needs of customers, since individual languages have their own specific preferences and expectations here.
Only some of which are supported by Windows or the .NET Framework. And dare I say it, most of them are not supported.
A great example of this can be seen in Greek, which has so many different traditions across it's history from ancient to modern times that we are lucky to have sites like this one to try and wade through the issues, which go way beyond the Greek final sigma issue I have talked about previously.
Starting with ancient Greek, there are three different preferences that call for three entirely different conventions for case mapping, as described here:
And then moving in to modern times, the debate about the (currently out of favor but still taught and used) polytonic vs. (currently in favor and highly recommended) monotonic systems. And case is where it gets interesting for us, as described here:
Greek differs from Latin in that it capitalises letters with diacritics differently, depending on whether the entire word is in capitals (whereupon diacritics are eliminated), or the initial is capitalised only, as in the first word in a sentence or in a title (whereupon the diacritics are retained, although they appear to the left of the letter rather than above it.) Thus, polytonic ἄνθρωπος capitalises to ΑΝΘΡΩΠΟΣ, but in titlecase to Ἄνθρωπος; monotonic άνθρωπος capitalises to ΑΝΘΡΩΠΟΣ and Άνθρωπος.
even without the roundtripping requirement, it is clearly hard to decide what the default behavior should be.
And how do you balance the legitimate and illegetimate needs of roundtrip-ability with the needs of a script that wants a convention to drop the accents upon capitalization (thus losing them forever since you can't exactly get them back)?
The answer, just like it was in the post "Michael, why does ToTitleCase suck so much?", is not very well. Of course the practices for ancient texts are by and large completely ignored, but the default case mappings in modern practice don't really match the Greek expectation of dropping the accent, either.
Perhaps a simple example would help. :-)
Take the word Ρύθμιση (Regulation) The code points are:
03a1 03cd 03b8 03bc 03b9 03c3 03b7
If you run this through Windows or .NET, it will uppercase to the entirely reversible ΡΎΘΜΙΣΗ, which is:
03a1 038e 0398 039c 0399 03a3 0397
But the expectation of people in Greece is more likely to be ΡΥΘΜΙΣΗ, which is
03a1 03a5 0398 039c 0399 03a3 0397
That second character would be expected to lose it's TONOS, so that if you lowercased the uppercased string, you would get back ρυθμιση, not ρύθμιση.
Unless you created a font that would literally display U+038e without displaying the Tonos, which would give one the best of both worlds with the only bad part being that confusability of such a solution.
Note that there are no title case mappings to help mitigate this, so ToTitleCase is once again not useful....
And of course this example ignores the even thornier problem with what to do when it is on the first letter, but you get the idea.
The solution for ancient texts is even more elusive, especially given the many differences in user expectations.
This post really just scratches the surface, if you are interested in the area then I highly recommend the links I pointed to, which go into even greater detail on the difficulties involved with Greek.
Now this is an area where potential improvements can be considered in the future, but there are no immediate built-in solutions available. All I can say for now is that it is one's best interests to avoid converting Greek strings to uppercase if one wants to avoid having a bad situation in a localized application....
This post brought to you by ύ (U+03cd, a.k.a. GREEK SMALL LETTER UPSILON WITH TONOS)
Not too long ago, Thakara asked in the Suggestion Box:
Hi, I’m working on a Transliterating Input Method for the Sinhala language. One that would allow Sinhala to be entered phonetically. I.e., you would enter ‘ka’ to get KAYANNA (“\u0D9A”), ‘kaa’ to get (“\u0D9A\u0DCF”), ‘kae’ to get (“\u0D9A\u0DD0”) and ‘k’ to get (“\u0D9A\u0DCA”), and so on. And it should work with any (or at least most) existing applications. The need for this is that the existing layout for Sinhala (Wijesekara) is very hard to use with a non-Sinhala keyboard. I.e., it would require an actual Sinhala keyboard with Sinhala letters printed on the keys. It is very hard to enter Sinhala with, say, a US keyboard. For the relative lack of Sinhala keyboards on the market and to avoid the hassle of having to buy a Sinhala keyboard just to type a few sentences in Sinhala, it is useful to have such a phonetic mechanism. Since this is how we type Sinhala informally (e.g. while chatting), most Sri Lankans are used to such ‘phonetic’ typing. After some poking around I came to the conclusion that IMM is old hat and the new way is to use the Text Services Framework (TSF) to build input methods. Then, I started looking for a .NET binding for TSF (since it’ll be much easier) but found there was none. Therefore, I started with VC++ 8 (with CLR support) to build my input method, hoping to use .NET facilities for common tasks such as reading/writing XML configs files and the composition window and some GUI elements. However, working with TSF, I came across many problems. First of all, there seems to be very little documentation about TSF, even on the Internet. The TSF reference cannot even be reached from the Visual Studio 2005 MSDN index. The API seems to be so complex, so obfuscated that it led me to suspect that TSF is a phased-out API. TSF and .NET does not seem to mix properly as well. I got access violations while trying to load a mixed-code input method DLL in some applications (Notepad.exe) while working fine in others. The questions I have are these: *) Is it possible to build a transliterating input method (as I plan to do) with TSF? *) Is TSF “alive”? Has it been phased-out/deprecated in favor of something else? *) Is mixing .NET with TSF bad? Do I have to work in pure C++ (*pain!) I would be very glad if you could shed some light on these questions, so that I can be sure I’m not on a wild goose chase with TSF. Thanks! Tharaka
The Text Services Framework is definitely alive and well -- in fact, in Vista virtually all of the Input Mehod Editors (IMEs) have been converted to use it, and the input methods for Yi and Amharic both use it as well.
Unfortunately, I do not know of any specific way to allow for a managed (.NET) TSF Text Input Processor. I will inquire further but I suspect that this is not possible given how it has to be integrated into essentially any thread using it for input, whether managed or not.
But the good news is that such a transliterating input method is quite possible with the Text Services Framework. And companies like Murasu have actually created such input methods for Tamil and other languages already, when simple keyboards are simply inadequate. This is the model for the input methods used for Amharic in Vista, for example.
It is even quite easy in Vista using the same techniques I used to create the Cantonese and Unicode IME samples I have been working on. If you wanted to send me the table containing all of the equivalances you are using, i.e.
"ka" = "\u0D9A""kaa" = "\u0D9A\u0DCF""kae" = "\u0D9A\u0DD0""k" = "\u0D9A\u0DCA"
and so on, I'll see if I can add another sample to the list....
This method does not currently work in versions of Windows prior to Vista, although to be honest the font and shaping support for Sinhala is also not widely available (other than the earlier version that was released as described here, and significant enhancements to the font and shaping engine have happened since then).
Some form of an input method like this, if it gains wider acceptance in the community and by language experts, could eventually find itself considered for inclusion in a future version of Windows!
So Thakara, if you can just send me your email contact info via that Contacting Michael... link, we can talk further about how to get the info transferred and get the sample put together!
This post brought to you by ඐ (U+0d90, a.k.a. SINHALA LETTER ILUUYANNA)
The other day when I was talking about You say ĭtalics, I say ītalics. It is much more complicated in Cyrillic, the difference between the way italic/oblique font styles are thought of in different languages/locales was one of the interesting issues, something that Ssimon Daniels mentioned in response to RubenP's point about "And ditto on the synthetic obliques. I mean, why does Verdana have a real oblique, but Tahoma doesn't. It's the same bloody typeface!":
Tahoma was created for UI, and our traditionally our UI doesn't use italics (some languages we localize into don’t really have an Italic concept). However, the point is well taken, as we can't control where a font gets used we decided to include Italics (true ones) in Segoe UI, based on the amount of fake Tahoma Italic we’ve seen over the years on the web and elsewhere. As for Frutiger Next Italics. Linotype obviously lifted that idea straight from Myriad as a way of getting back at Adobe. ;-) Michael, if you’re interested in Italics a post on Meiryo Italics would be a good one.
I actually use fake Tahomia Italic in this blog, so obviously I agree with Simon's point about lack of control over the font usage....
But the point he raised about Meiryo (tha new Japanese font in Vista) is quite interesting (even if it was not as funny as the Linotype one!). It gets down to the core issue of who is in control when it comes to typography decisions -- the user or the font.
You see, in Meiryo only the Latins have a slanted form in the Italic font, not all glyphs. So if I take a string like:
Very interesting. 非常に興味深い。
It will slant the Japanese text, which really violates Japanese traditions.
But in Meiryo, it is a little different. Like in this screenshot in vist'a Wordpad:
The fact that the text is marked Italic is really not terribly relevant to Meiryo, it would seem!
Now while this really is in keeping with Japanese typographic traditions, it has been reported as a bug by several different people since Meiryo was first added to Vista, primarily from users who are used to slanted characters.
But it does kind of underscore that font settings, whether they are size, weight, or obliqueness, are actually a preference, one that the font itself might be designed to ignore.
This is not something that everyone is comfortable with (just as people may not like that the letters are such different sizes in different fonts), but it is actually how they are designed....
This post brought to you by い (U+3044, a.k.a. HIRAGANA LETTER I)
I was over on Language Log the other day and I read Eric Baković's Between Good and Evil. Somewhere between the points in the post and the reference to Carl Sagan's thought experiment about the dragon I was reminded about a bit in Sagan's Contact (not the movie but the book, which was free of both Hollywood simplifications and the type of things one imagines Jodie Foster wants out of parts she takes).
I think it puts the question of atheism vs. agnosticism in a slightly more reasonable light than a purely scientific, a purely linguistic, or even a purely religious standpoint might:
"You don't want to believe in God." Joss said it as a simple statement. "You figure you can be a Christian and not believe in God. Let me ask you straight out: Do you believe in God? "The question has a peculiar structure. If I say no, do I mean I am convinced God doesn't exist, or do I mean I'm not convinced that he does exist? Those are two very different statements." "Let's see if they are so very different, Dr. Arroway. May I call you 'Doctor'? You believe in Occam's Razor, isn't that right? If you have two different, equally good explanations of the same experience, you pick the simplest. The whole history of science supports it, you say. Now, if you have serious doubts about whether there is a a God -- enough doubts so you're unwilling to commit yourself to the faith -- then you must be able to imagine a world without God: a world that comes into being without God, a world that goes about its everyday life without God, a world where people die without God. No punishment. No reward. All the saints and prophets, all the faithful who have ever lived -- why, you'd have to believe they were foolish. Deceived themselves, you'd probably say. That would be a world in which we weren't here on Earth for any good reason -- I mean for any purpose. It would all just be complicated collisions of atoms -- is that right? Including the atoms that are inside of human beings. "To me, that would be a hateful and inhuman world. I wouldn't want to live in it. But if you can imagine that world, why straddle? Why occupy some middle ground? If you believe all that already, isn't it much simpler to say there's no God? You're not being true to Occam's Razor. I think you're waffling. How can a thoroughgoing conscientious scientist be an agnostic if you can even imagine a world without God? Wouldn't you just have to be an atheist? "I thought you were going to argue that God is the simpler hypothesis," Ellie said, "but this is a much better point. If it were only a matter of scientific discussion, I'd agree with you Reverend Joss. Science is essentially concerned with examining and correcting hypotheses. If the laws of nature explain all the available facts without supernatural intervention, or even do only as well as the God hypothesis, then for the time being I'd call myself an athesist. Then if a single piece of evidence was discovered that doesn't fit, I'd back off from atheism. We're fully able to detect some breakdown in the laws of nature. The reason I don't call myself an atheist is because this isn't mainly a scientific issue. It's a religious issue and a political one. The tentative nature of scientific hypothesis doesn't extend into those fields. You don't talk about God as a hypothesis. You think you've cornered the truth, so I point out that you may have missed a thing or two. But if you ask, I'm happy to tell you: I can't be sure I'm right."
Anyway, it just occurred to me while I was reading and thought I'd share. :-)
This post brought to you by ܞ (U+071e, a.k.a. SYRIAC LETTER YUDH HE)
It does not always pay to be clear and unambiguous. Sometimes, the lack of clarity can be helpful....
Here is an example of this.
If you have not installed Vista, you can probably see many of the screen shots of the installation process in the various betas. One of the early dialogs looks something like this (you can find this and various permutations on the internet):
Basically you get a choice of installation language, formats, and keyboard layout.
It causes an interesting problem, truth be told -- because previously the actual keyboard layout name was hidden from everyone other than the few people who opened up the Language Bar settings dialog (shown here on XP):
So previously, most people (starting in XP) would only ever see the language and would never see the layout. Because the act of setting the user locale would add a keyboard layout. And the Language Bar would usually only show the language.
For example if you changed your system locale to Dutch, you would have a keyboard added that looked like Dutch according to the Language Bar. But secretly, it was installing something very different (which you can see if you look at that settings dialog):
US International? Huh?
It's true. The fact is that few people like the "Dutch" keyboard. The differences get pretty substantial in short order if you look at them side by side:
Anyway, if you look at various sites on the web like this one, you'll see what I mean. Of course as far as I can tell, the "United States (International)" keyboard is not that well thought of either, but it is in most cases considered better than the Dutch one.
But think back to the XP situation -- most people don't realize it.
So what happens in this new setup UI in Vista? Suddenly they see "United States (International)" for the keyboard, and assume that this is some kind of US Imperialism feature added to Vista, and a clear regression since the keyboard always used to claim to be Dutch.
You can see it here, with a sort of pseudo locale sort of thing going on as well:
And the obvious question that the person is asking -- why is this keyboard my default all of the sudden? Even if they simply never realized it was their default all along....
So the new and arguably clearer UI in setup is hoist by its own petard -- the very attempt to provide clarity has revealed an issue that was previously well-served by the obfuscation of the platform!
Ah well, it will be knowledge, which is power. And people throughout the Netherlands (and other places) will learn this lesson, within zero to one calls to product support.
And this is (in my humble opinion) a bug, or at least a small design flaw in the new, clearer UI.
I'd argue that we should tell people about this to avoid paying for the support call, but of course if we tell them then they don't need to call.
Maybe someone in PSS could put in a Vista KB article that calls it a bug? :-)
This post brought to you by € (U+20ac, a.k.a. EURO SIGN)
Regular reader KJK:Hyperion asked in the Suggestion Box:
...when will Transliteration Utility support Romaji and Hiragana transliteration for Japanese? That's basically the only one I need. At the moment I use http://www.j-talk.com/nihongo/ but I'd prefer an off-line tool.
The tool that he is referring to is the Microsoft Transliteration Utility v1.0, which Thierry Fontenelle talks about in English here (and in French here).
I happened to be in an email thread with one of the authors of the tool (Nick Cipollone) and Thierry, and I figured I'd ask them this question. :-)
And Nick gave me the scoop:
Our basic strategy with Transliteration Utility was just to get the thing out the door with a few representative types of modules that people could use as models to create their own. The only modules that were specifically requested by anyone were the Inuktitut Syllabary <-> Romanization modules (requested by the Canadian sub), the rest were basically things we had lying around. We had intended to put out “module expansion packs” every now and then, once we had enough new modules to justify it. We haven’t developed any new ones for public consumption since Transliteration Utility shipped in January, though. We also hoped as a stretch goal that individuals or companies other than Microsoft might eventually provide module expansion packs, although this hasn’t happened to our knowledge yet either.
Our basic strategy with Transliteration Utility was just to get the thing out the door with a few representative types of modules that people could use as models to create their own. The only modules that were specifically requested by anyone were the Inuktitut Syllabary <-> Romanization modules (requested by the Canadian sub), the rest were basically things we had lying around.
We had intended to put out “module expansion packs” every now and then, once we had enough new modules to justify it. We haven’t developed any new ones for public consumption since Transliteration Utility shipped in January, though. We also hoped as a stretch goal that individuals or companies other than Microsoft might eventually provide module expansion packs, although this hasn’t happened to our knowledge yet either.
Well, that sounds like a call to arms for me, what do all of you think? :-)
The tool itself is a pretty cool thing, and it may be worth looking into building a new transliteration model in its Module Development Console:
The text in the Module Development Console lays out what is involved, and it looks pretty straightforward (all you would need is good knowledge of the languages and the transliteration in question to fill it in!):
[Input]// Insert a several-word description of the module's input.// For example:// Romanization[Output]// Insert a several-word description of the module's output.// For example:// Cyrillic[Description]// Give a several-sentence description of the module.[Preprocess]// If you need to preprocess your input before applying// rules specify the procedure here. // For example:// ToLower// ToUpper(tr-TR)[States]// If you need any states other than the two predefined ones// (START and DEFAULT) then declare their names here. // For example:// CONSONANT// VOWEL[FollowingContextMacros]// Insert any following context macro definitions here.// For example:// Cons b c d f g h j k l m n p q r s t v w x y z// ConsOrEnd <END> :Cons:// Vowel a e i o u// VowelAtEnd a<END> e<END> i<END> o<END> u<END>[EscapeSpanDelimiters]// If you need to be able to prevent spans of the input// from being processed you can specify one pair of strings// to indicate the beginning and end of such escaped spans. // For example:// { }// /* */[Rules]// List your rules here. For example:// a --> x// a(<END>) --> y// [START] fa --> z [VOWEL]
[Input]// Insert a several-word description of the module's input.// For example:// Romanization
Anyone want to give it a shot? :-)
This post brought to you by ぱ (U+3071, a.k.a. HIRAGANA LETTER PA)
So during the season finale of Entourage, one of the sage pieces of advice given to Vince about his plan to fire Ari was that he really ought to get a feel for what is out there in terms of agents before he actually fires Ari. It is all about having a good fallback in case the resource you have been relying on turns out to not be able to come with the goods....
One of the things that has been built into the Language Interface Pack design, required due to the fact that it is only a partial localization, is the notion of a fallback language to use in case the resources do not exist in the expected language.
Hoefully you would not fire the LIP since it is not screwing up like Ari did, it is just doing its best. But clearly English is not always the best choice for language fallback.
In Vista, this notion has been further extended so that it does not have to only be embedded in the locale itself but can also be a user overridable choice. The Regional and Language Options tab will look as follows when a partial localization is chosen:
Notice how there is even a third language choice you can make if the second one is thought of as a partial localization. Importantly, note that neither the second nor the third choice will be visible if a partial localization is not chosen.
Which is not to say Dutch is a partial localization in Vista, but it is temporarily with not all the strings localized, and it allows the testng of this feature to happen more widely.
Now all of this is very cool, but I think the design is a bit flawed, in my opinion.
Because we actually encourage people to build their own applications with MUI-like features and we encourage them to use both the MUI resource loading and UI language functions, and clearly our business cases that are used to decide which localizations are full ones and which ones are partials will not always match those of an ISV writing an application with MUI functionality.
Therefore, the option to say "if you can't find this language, then use that one" should always exist. Because even if a full localization is involved, what we call full and what a user does are two different things....
This post brought to you by ຫ (U+0eab, LAO LETTER HO SUNG)