Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
According to Bill Vaughn in his post Petitions and other Silliness:
When I visited the speaker’s lounge I felt like a caged bear with kids poking me with sharp sticks. It seems that the Microsoft folks in attendance took exception to the Visual Basic 6.0 petition that I signed along with a number of other MVPs. A couple implied that I would be lucky to keep my MVP status because I chose to speak up.
If you were one of those folks who implied such a thing, then shame on you.
An MVP is not an unpaid shill. He or she is a Most Valued Professional. This is a program which (as this site says) "...is a worldwide award and recognition program that strives to identify amazing individuals in technical communities around the world. Microsoft MVPs are recognized for both their demonstrated practical expertise and willingness to share their experience with peers in Microsoft technical communities."
I used to be an MVP. And this is simply not done. Once (a few years ago) I know of one incident where a Product Manager implied retribution for public statements an MVP made (ironically it was also about Visual Basic). I also know that after feedback from several internal and external people that the Product Manager was himself punished, and required to apologize for doing it.
Whoever did this was entirely out of line.
Bill is actually a former employee of Microsoft who has forgotten more about several MS products than many of the people who are employees here will ever know.
Even the thought of someone using their position as an employee to try to unfairly influence someone in this way pisses me off to no end. This is not why we are encouraged to go out and speak at conferences. As I said before, I did not even sign the freaking petition but I swear I am tempted to do it now, just to see if someone has the nerve to claim I do not have the right to do so. If I did not have philosophical qualms about what the petition was trying to do, I would probably do so right now, and dare these people (whoever they are) to try to show me the door.
In the newsgroups the other day, Ernst K. Locker asked:
Are there any major drawbacks to using Big5 other than I would be restricted to Chinese?
It is an interesting question, one that people often ask for different languages when they start understand what is the cost of going to Unicode. That cost can be related to the extra performance hit of strungs that are twice as big, the hit of conversion any time you need to move into or out of Unicode. It is entirely reasonable toask the question Ernst did, to determine if there is a real benefit.
Starting with the question about Big5, it is an industrial standard out of Taiwan that roughly maps to the portion of the CNS-11643 standard that was picked by industry.
Looking at Unicode today (as of the just releaseed Unicode 4.1), there are 71,226 Han ideographs. The latest updates to CNS-11643 seem to suggest that over 56,000 of them are in one sense or another attested in Taiwan.
Looking at the Big5 code page on Windows, there are 20321 characters in it (and not all of them are ideograhs; that includes the ASCII stuff some other single-byte stuff, and some of the Kana characters).
This seems to give the best possible answer to whether there are drawbacks other than being restricted to Chinese -- how about being restricted in how much Chinese you can use here, too?
You can see similar problems with over 20,000 ideographs in Korean National standards, over 50,000 missing from code page 936 for China, and thousands missing from code page 932 for Japanese.
And to a somewhat lesser extent, the same is true of almost every code page. Vietnamese uses all of the following combining marks, only some of which actually have representation in the code page: grave, hook above, tilde, acute, dot below, breve, circumflex, and horn. This was probably less of an oversight as the fact that there is not enough room.
Try asking Dr. International. As he pointed out in "Arabic: Script or Language?" there is certainly a lot of the Arabic script that could not be fit into code page 1256. There are many languages, including Baluchi, Berber, Farsi, Kashmiri, Kazakh, Kirghiz, Kurdish, Pashto, Sindhi, Uighur, Urdu, and others, that the code page simply has no room for. They need Unicode, too.
Of course it goes without saying that the many "Unicode only" languages that have been added to Windows, starting with Windows 2000 and continuing in Windows XP and XP SP2, that clearly require Unicode, unless your users speak fluent question mark.
And it happens with many other languages, too. And the code pages that are nominally underneath them.
The cost of not moving to Unicode is getting higher and higher all the time. The time to move to Unicode is not even today, it is last week!
This post brought to you by "ێ" (U+06ce, a.k.a. ARABIC LETTER YEH WITH SMALL V)One of the many Arabic script characters that exist in Unicode that are not in code page 1256.
Looks like they are starting to gear up for Tech·Ed 2005 (starting June 5-10).
You can see session info here.
My sessions (slightly modified abstracts from before, mainly to get rid of that odd use of the word "exploit" and a few other weirdnesses):
DAT290 Databases for the World: Designing Multilingual Databases Using SQL Server 2005 Wednesday, June 8 10:15 AM - 11:30 AM--> Speaker(s): Michael Kaplan Session Type(s): Breakout Track(s): Database Development SQL Server 2005 introduces new and enhanced international features that are very important to customers in many parts of the world -- from proper Indic support to recognition of the supplementary characters used in East Asia, as well as many more important international scenarios. This session highlights the new features, and provides both code samples and advice on best practices to make good use of them.
No schedule yet, but that will make its way there eventually.
Also, for what its worth, DBA319 will be both a development and an administration topic.
To answer the questions of others, there is interest in some of the international Tech·Ed events for these topics, but no word on who will definitely want them yet. Folks who want to see these talks or any of the others in the Global Development and Deployment virtual track might want to look into whatever voting or other selection techniques are available at the International Tech·Ed they fancy.
(thanks for the pointer by Kate Gregory)
GREEK SMALL LETTER FINAL SIGMA is the sort of character that only ever gets to have the last word.
The character (ς) only ever gets used when it is the last character in the word; otherwise you are supposed to use U+03c3 (σ, a.k.a. GREEK SMALL LETTER SIGMA).
Some of the backstory about its presence in Unicode is explained on this site:
The final sigma is a positional variant of sigma (U+03C3 Greek Small Letter Sigma, σ), such as also occurs in Hebrew and Arabic. It might legitimately be questioned whether Unicode needed a separate codepoint for the two lowercase sigmas; and indeed, Beta Code has done without the differentiation. The use of distinct codepoints in the legacy scheme of Latin-7 has decided the matter, however.
That site also talks a bit about the history of the actual letter's use in the language.
U+03c2 has some prominent mentions in parts of the section of the Unicode FAQ about character properties, case mappings, and names:
Q: Do the case mappings in Unicode allow a round-trip? A: No, there are instances where two characters map to the same result. For example, both a sigma and a final sigma uppercase to a capital sigma.
Q: Do the case mappings in Unicode allow a round-trip?
A: No, there are instances where two characters map to the same result. For example, both a sigma and a final sigma uppercase to a capital sigma.
and
Q: Near the end of the SpecialCasing.txt, there are the two lines on SIGMA that look weird to me. Can you explain them:# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA A: Both of these are conditional (column 5); that is, in normal Greek text a 03C3 (non-final sigma) should be written as 03C2 (final sigma) if it is at the end of a word, and a 03C2 (final sigma) should be written as a 03C3 (non-final sigma) if it is not at the end of a word. That's what these two lines would mean if they were uncommented. However, they are commented, just for that reason: the SpecialCasing file is not intended to normalize the appearance of a small sigma.
Q: Near the end of the SpecialCasing.txt, there are the two lines on SIGMA that look weird to me. Can you explain them:# 03C3; 03C2; 03A3; 03A3; FINAL; # GREEK SMALL LETTER SIGMA# 03C2; 03C3; 03A3; 03A3; NON_FINAL; # GREEK SMALL LETTER FINAL SIGMA
A: Both of these are conditional (column 5); that is, in normal Greek text a 03C3 (non-final sigma) should be written as 03C2 (final sigma) if it is at the end of a word, and a 03C2 (final sigma) should be written as a 03C3 (non-final sigma) if it is not at the end of a word. That's what these two lines would mean if they were uncommented. However, they are commented, just for that reason: the SpecialCasing file is not intended to normalize the appearance of a small sigma.
There was also an interesting thread in the Unicode List about this character back in early 2001:
Carl W. Brown: It is final when followed by a hyphen or combining diacritical mark? Can you have a final sigma in the middle of a word? Patrick T. Rourke: Don't know what the Unicode rules are, but the answer is no. The final sigma form is not used if the sigma is in a medial position in the word but at the end of the line (e.g., when it occurs at the point of hyphenation in a hyphenated word at line end). Also, there is no reason why a consonant other than rho should be followed by a combining diacritical mark, except say an underdot for use in papyrological or epigraphical texts. The upper case sigma is the same regardless of position; there is no differentiation between upper case final sigma and upper case initial/medial sigma. If a font uses the lunate sigma for the initial/medial form, it must use it for the final form as well, and vice versa. Nick Nicholas: On the latter, yes in some 19th century typographical traditions, where the final sigma is used to differentiate the prefix pros- from pro-; e.g. you'll see Lambros in his _Neos Hellenomnemon_ journal write, say, PRO*S*ABBATON = pro-sabbaton, but PRO*@*AGW = pros-agw. (Sorry about non-Unicode; I'm on a Mac and have left my lookup-list at the office.) This tradition has not been maintained, and I don't think it was ever mainstream in Western Europe. I think I've also seen it done with other such prefixes, like eis-. Diacritics following a final sigma would only occur in Modern Greek dialectology --- e.g. hacek used to denote that the sigma is pronounced as "sh". (Epigraphists and papyrologists too, I suppose, though they'd tend more to the lunate sigma anyway.) In that case, yes, the final sigma remains final. Before a hyphen, on the other hand, it would clearly remain medial, unless you're pulling the 19th century pros- prefix trick. Carl W. Brown riposted to Nick: Nick, If you have a lowercase sigma in the middle of the word followed by a diacritic is it final; sigma, hacek, some other letter. Carl W. Brown then tried to restate the question he as trying to ask: Maybe in might be clearer to ask if there are any cases where you use the final sigma form where it is not the last letter in a word. Modern Greek only. Lucas Pietsch responded to Patrick: Just one addition: You do get a final sigma before explicit (hard) hyphens, i.e. u+2020 and other kinds of dashes, as opposed to (soft) line-breaking hyphens (u+00AD). I guess explicit hyphenation isn't likely to occur in typesetting of Ancient Greek, but it does occur in Modern Greek, in noun compounds of the type κράτος-μέλος. The Unicode rules will handle this correctly, as far as I can see. Michael Everson proved he'd be an ace in the "Scripts" category of the Trivial Pursuit game :-) : Sigma with caron is used at least in dictionaries of the Tsakonian dialect. Nick Nicholas disagreed with Carl W. Brown's suggestion: >If you have a lowercase sigma in the middle of the word followed by a diacritic is it final; >sigma, hacek, some other letter. No, sir. And medial sigma-diacritic is far more frequent than a sigma having a diacritic word-finally. Nick Nicholas also responded to the restating of the question by Carl W. Brown: What I described in my first paragraph is the only such instance I'm aware of (the 19th texts I have in mind were editions of Byzantine texts, but I think the editor was generalising it in his orthography, and was not the only one to do so). It has never been mainstream practice. You'll see a lot of stigmas as sigma-tau ligatures up to the nineteenth century, and being printed as final sigmas; but they're stigmas nonetheless, not sigmas. Oh, just remembered: the phonetic Greek alphabet used in the Soviet Union in the '30s for Pontic and Mariupolitan Greek uses the final sigma universally (and doubles it for "sh".) Again, not mainstream, and any such texts that have been reprinted in Greek academia have been reprinted in conventional orthography. (The Mariupolitans are now using Cyrillic; the ex-Soviet Pontians are mostly migrating to Greece, and I don't know if they're still writing their dialect.) Carl W. Brown responded to this fuller response from Nick Nicholas: It looks like the Unicode TR 21 special casing rules for the Greek final sigma are not quite right. The final sigma in modern Greek should only be used at the end of a word including the case where separate words are joined with hard hyphens. If it is followed by a character such as a combining mark or soft hyphen you must continue scanning to see what follows. If it is followed a letter then it is not final. A simpler test might be it see if a letter or a spacing character or hard hyphen is found first. If it is a letter then it is not a final sigma. Nick Nicholas responded to this suggestion promptly: Which is what we do at the TLG with Beta code (whose S is both medial or final); in fact, Beta code conflates hard hyphens and dashes anyway, considering the (em) dash, without space, punctuation. If the Unicode rules are wrong, well, I hope those that can fix them are tuned in. :-) Mark Davis then jumped in, to respond to the bug report: Yes, that was filed as a bug, and will be fixed the next time we update the case mappings. We are right in the middle of the Unicode 3.1 release, so that will be coming sometime later.
Carl W. Brown:
It is final when followed by a hyphen or combining diacritical mark? Can you have a final sigma in the middle of a word?
Patrick T. Rourke:
Don't know what the Unicode rules are, but the answer is no. The final sigma form is not used if the sigma is in a medial position in the word but at the end of the line (e.g., when it occurs at the point of hyphenation in a hyphenated word at line end). Also, there is no reason why a consonant other than rho should be followed by a combining diacritical mark, except say an underdot for use in papyrological or epigraphical texts. The upper case sigma is the same regardless of position; there is no differentiation between upper case final sigma and upper case initial/medial sigma. If a font uses the lunate sigma for the initial/medial form, it must use it for the final form as well, and vice versa.
Don't know what the Unicode rules are, but the answer is no. The final sigma form is not used if the sigma is in a medial position in the word but at the end of the line (e.g., when it occurs at the point of hyphenation in a hyphenated word at line end). Also, there is no reason why a consonant other than rho should be followed by a combining diacritical mark, except say an underdot for use in papyrological or epigraphical texts.
The upper case sigma is the same regardless of position; there is no differentiation between upper case final sigma and upper case initial/medial sigma.
If a font uses the lunate sigma for the initial/medial form, it must use it for the final form as well, and vice versa.
Nick Nicholas:
On the latter, yes in some 19th century typographical traditions, where the final sigma is used to differentiate the prefix pros- from pro-; e.g. you'll see Lambros in his _Neos Hellenomnemon_ journal write, say, PRO*S*ABBATON = pro-sabbaton, but PRO*@*AGW = pros-agw. (Sorry about non-Unicode; I'm on a Mac and have left my lookup-list at the office.) This tradition has not been maintained, and I don't think it was ever mainstream in Western Europe. I think I've also seen it done with other such prefixes, like eis-. Diacritics following a final sigma would only occur in Modern Greek dialectology --- e.g. hacek used to denote that the sigma is pronounced as "sh". (Epigraphists and papyrologists too, I suppose, though they'd tend more to the lunate sigma anyway.) In that case, yes, the final sigma remains final. Before a hyphen, on the other hand, it would clearly remain medial, unless you're pulling the 19th century pros- prefix trick.
On the latter, yes in some 19th century typographical traditions, where the final sigma is used to differentiate the prefix pros- from pro-; e.g. you'll see Lambros in his _Neos Hellenomnemon_ journal write, say, PRO*S*ABBATON = pro-sabbaton, but PRO*@*AGW = pros-agw. (Sorry about non-Unicode; I'm on a Mac and have left my lookup-list at the office.) This tradition has not been maintained, and I don't think it was ever mainstream in Western Europe. I think I've also seen it done with other such prefixes, like eis-.
Diacritics following a final sigma would only occur in Modern Greek dialectology --- e.g. hacek used to denote that the sigma is pronounced as "sh". (Epigraphists and papyrologists too, I suppose, though they'd tend more to the lunate sigma anyway.) In that case, yes, the final sigma remains final. Before a hyphen, on the other hand, it would clearly remain medial, unless you're pulling the 19th century pros- prefix trick.
Carl W. Brown riposted to Nick:
Nick, If you have a lowercase sigma in the middle of the word followed by a diacritic is it final; sigma, hacek, some other letter.
Nick,
If you have a lowercase sigma in the middle of the word followed by a diacritic is it final;
sigma, hacek, some other letter.
Carl W. Brown then tried to restate the question he as trying to ask:
Maybe in might be clearer to ask if there are any cases where you use the final sigma form where it is not the last letter in a word. Modern Greek only.
Lucas Pietsch responded to Patrick:
Just one addition: You do get a final sigma before explicit (hard) hyphens, i.e. u+2020 and other kinds of dashes, as opposed to (soft) line-breaking hyphens (u+00AD). I guess explicit hyphenation isn't likely to occur in typesetting of Ancient Greek, but it does occur in Modern Greek, in noun compounds of the type κράτος-μέλος. The Unicode rules will handle this correctly, as far as I can see.
Just one addition: You do get a final sigma before explicit (hard) hyphens, i.e. u+2020 and other kinds of dashes, as opposed to (soft) line-breaking hyphens (u+00AD).
I guess explicit hyphenation isn't likely to occur in typesetting of Ancient Greek, but it does occur in Modern Greek, in noun compounds of the type κράτος-μέλος. The Unicode rules will handle this correctly, as far as I can see.
Michael Everson proved he'd be an ace in the "Scripts" category of the Trivial Pursuit game :-) :
Sigma with caron is used at least in dictionaries of the Tsakonian dialect.
Nick Nicholas disagreed with Carl W. Brown's suggestion:
>If you have a lowercase sigma in the middle of the word followed by a diacritic is it final; >sigma, hacek, some other letter. No, sir. And medial sigma-diacritic is far more frequent than a sigma having a diacritic word-finally.
>If you have a lowercase sigma in the middle of the word followed by a diacritic is it final;
>sigma, hacek, some other letter.
No, sir. And medial sigma-diacritic is far more frequent than a sigma having a diacritic word-finally.
Nick Nicholas also responded to the restating of the question by Carl W. Brown:
What I described in my first paragraph is the only such instance I'm aware of (the 19th texts I have in mind were editions of Byzantine texts, but I think the editor was generalising it in his orthography, and was not the only one to do so). It has never been mainstream practice. You'll see a lot of stigmas as sigma-tau ligatures up to the nineteenth century, and being printed as final sigmas; but they're stigmas nonetheless, not sigmas. Oh, just remembered: the phonetic Greek alphabet used in the Soviet Union in the '30s for Pontic and Mariupolitan Greek uses the final sigma universally (and doubles it for "sh".) Again, not mainstream, and any such texts that have been reprinted in Greek academia have been reprinted in conventional orthography. (The Mariupolitans are now using Cyrillic; the ex-Soviet Pontians are mostly migrating to Greece, and I don't know if they're still writing their dialect.)
What I described in my first paragraph is the only such instance I'm aware of (the 19th texts I have in mind were editions of Byzantine texts, but I think the editor was generalising it in his orthography, and was not the only one to do so). It has never been mainstream practice. You'll see a lot of stigmas as sigma-tau ligatures up to the nineteenth century, and being printed as final sigmas; but they're stigmas nonetheless, not sigmas.
Oh, just remembered: the phonetic Greek alphabet used in the Soviet Union in the '30s for Pontic and Mariupolitan Greek uses the final sigma universally (and doubles it for "sh".) Again, not mainstream, and any such texts that have been reprinted in Greek academia have been reprinted in conventional orthography. (The Mariupolitans are now using Cyrillic; the ex-Soviet Pontians are mostly migrating to Greece, and I don't know if they're still writing their dialect.)
Carl W. Brown responded to this fuller response from Nick Nicholas:
It looks like the Unicode TR 21 special casing rules for the Greek final sigma are not quite right. The final sigma in modern Greek should only be used at the end of a word including the case where separate words are joined with hard hyphens. If it is followed by a character such as a combining mark or soft hyphen you must continue scanning to see what follows. If it is followed a letter then it is not final. A simpler test might be it see if a letter or a spacing character or hard hyphen is found first. If it is a letter then it is not a final sigma.
It looks like the Unicode TR 21 special casing rules for the Greek final sigma are not quite right.
The final sigma in modern Greek should only be used at the end of a word including the case where separate words are joined with hard hyphens. If it is followed by a character such as a combining mark or soft hyphen you must continue scanning to see what follows. If it is followed a letter then it is not final.
A simpler test might be it see if a letter or a spacing character or hard hyphen is found first. If it is a letter then it is not a final sigma.
Nick Nicholas responded to this suggestion promptly:
Which is what we do at the TLG with Beta code (whose S is both medial or final); in fact, Beta code conflates hard hyphens and dashes anyway, considering the (em) dash, without space, punctuation. If the Unicode rules are wrong, well, I hope those that can fix them are tuned in. :-)
Which is what we do at the TLG with Beta code (whose S is both medial or final); in fact, Beta code conflates hard hyphens and dashes anyway, considering the (em) dash, without space, punctuation.
If the Unicode rules are wrong, well, I hope those that can fix them are tuned in. :-)
Mark Davis then jumped in, to respond to the bug report:
Yes, that was filed as a bug, and will be fixed the next time we update the case mappings. We are right in the middle of the Unicode 3.1 release, so that will be coming sometime later.
I will talk about this character on Windows in a different post....
This post brought to you by "ς" and "σ" (U+03c2 and U+03c3, a.k.a. GREEK SMALL LETTER FINAL SIGMA and GREEK SMALL LETTER SIGMA)Both of whom are eager to read that upcoming post about their status on Windows!
Back in December I posted the article that I submitted (too late) for the Access Heroes pages put up when the 10th anniversary of Access was being celebrated. In the flotsam and jetsam stream of consciousness at the end, one of the points I listed there was:
that Andrew Miller works with 96% efficiency (conservative estimate from team members) and at times I really wanted to modify that estimate -- upward.
Well, tonight I was reminded of why I keep the MSDN Blogs and TechNet Blogs on the list of the Blogs I read.
Andrew Miller has a blog, with a subtitle of Thoughts on Access.
As one of the original wizard developers before I even knew there was such a job as a wizard developer for Microsoft Access (let alone before I had the job myself), he was responsible for some of the most maintainable code in a code base estimated at over 80,000 lines of VBA that was in many senses utterly unmaintainable.
I was not the person who initially made the claim about him, but I had no problem agreeing with it.
His blog is definitely on my list....
Last night, I received the following question in e-mail:
Dear Michael. I wonder if you can clarify this matter.I was under the impression that Tamil Unicode was possible only under XP professional, - since regional lang settings are available in the Control Panel.Yesterday I bought a new laptop that came with XP home edition. The regional language settings was not available initially. I was disappointed and thought of upgrading to XP pro. But I went to the Help Button and saw a link that installed the regional lang setting. I was able to set up Tamil Unicode settings and I am now happily using Tamil Unicode in the preinstalled Microsoft Works..The query is how come there is this impression that only Windows 2000 and XP pro support (Tamil)Unicode?? Thank you for your time KalaimaniSingapore
Dear Michael.
I wonder if you can clarify this matter.I was under the impression that Tamil Unicode was possible only under XP professional, - since regional lang settings are available in the Control Panel.Yesterday I bought a new laptop that came with XP home edition. The regional language settings was not available initially. I was disappointed and thought of upgrading to XP pro. But I went to the Help Button and saw a link that installed the regional lang setting. I was able to set up Tamil Unicode settings and I am now happily using Tamil Unicode in the preinstalled Microsoft Works..The query is how come there is this impression that only Windows 2000 and XP pro support (Tamil)Unicode??
Thank you for your time
KalaimaniSingapore
I reassured him of one important fact here -- that every version of Windows 2000, Windows XP Home, Windows XP Professional, and Windows Server 2003 contains all of the international support, no matter what localized version the SKU is.
So if you have XP Home then you have the support for Tamil, Georgian, Punjabi, Russian, Greek, Traditional Chinese, Bulgarian, Afrikaans, Catalan, Korean, Basque, Vietnamese, Spanish, Thai, French, Hindi, Japanese, Belorussian, Icelandic, Farsi, Galician, Danish, Ukrainian, Romanian, Simplified Chinese, Swedish, Konkani, Italian, German, Hungarian, Armenian, Konkani, Lithuanian, Divehi, Swahili, Czech, Dutch, Hebrew, Estonian, Gujarati, and all of the rest of them.
You may have to install the proper international support to get the language (and some on this list are only available in XP and later), but they are all there waiting for you to use them. Today!
This post brought to you by "ஶ" (U+0bb6, TAMIL LETTER SHA)Recently (as of Unicode 4.1) added to Unicode based on a proposal by INFITT (International Forum for Information Technology in Tamil)
(Nothing technical in this post)
I'm gonna rant for a moment.
Since I live right across the street from the main campus of where I work, I used to run here, and then eventually walk. I think I mentioned before that I drive a scooter to work these days because of the whole MS thing. This appeals to me since driving such a short distance seems so wasteful.
There are two streets I have to cross to get there -- 40th Street and 36th Street. Now 40th Street is a big intersection so they have a nice "mother may I please cross?" button, and right turners who usually do not block the ramps from the sidewalk to the street. This is important because the scooter weighs several hundred pounds so it is not so easy to lift it onto the curb, you know? It is great that people usually stop so I can get by and do not run over me when I am crossing on my WALK sign when they have a red light.
This is in marked contrast to 36th Street. Here, people who are turning right constantly move forward all the way, blocking the ramp. They really do not even look to see if someone is trying to cross, whether they have a red light or not. I usually end up having to glare at someone for blocking me in until they back up (assuming the next thoughtless person has not already driven up behind them and boxed them in). After they back up and let me through, I usually give them a warm smile and a nod/wave, unless they are a repeat offender from a prior day (which is often), in which case I am usually not smiling as warmly at the jerk (a term I reserve for people who have done it more than three times -- and there are several of those.
I usually try to avoid the whole problem, going a block west to Microsoft Way, where everyone is really nice. They always let me through, they wait even when they probably do not have to (the laws of Washington do not require a car to stay stopped when I am four lanes away from them, but they usually stop anyway). People seldom look annoyed about it, and are usually quite alright to let me go through. There are times when the intersection is busy that I feel guilty blocking the way and I actually ride down one block to where people tend to be more like jerks.
Wanna know the weird part of all this?
The people with the biggest potential to be a problem in all three cases are the same folks leaving Microsoft from their day at work!
It is the same people, literally a block away in all three cases.
So what makes a person a kind and considerate person at one intersection, and complete jerk-of-the-earth a block later?
Is there such a huge difference between (a) being on or off the Microsoft campus and (b) being on the intersection where you enter or leave campus? Is there some kind of zone of jerk-dom that just affects people when they cross that border?
Is it not mildly ironic that my desire to avoid inconveniencing the nice people on one intersection causes me to run into the same people driving like assholes the next block over?
I guess the only good part is that one day, if one of them actuals hit me, they won't be going fast enough to kill me, but since I only cross on the WALK sign and their light is red it will be clearly and completely their fault. And since they work for a living they will have insurance and enough money that they will be able to cover settling whatever lawsuit comes of the whole mess. You can't count on that kind of guarantee in every neighborhood, so at the very least it's good to know that they will be able to make up for their insensitive and dangerous antisocial behavior....
Grrrrrrrr.
Okay, I feel better now.
I'll have a technical post a little later.... :-)
Robert Scoble has a blog. Now he does not need me link to him, as he gets plenty of attention on his own. :-)
Anyway, the other day he made a post entitled 'Light blogging week, first look at Longhorn fonts'. In it, he talked about Bill Hill:
Today we're gonna take a hike around campus with Bill Hill. He was our first interview on Channel 9 (and still one of our most popular). His bit about why you should put only a single space after a period is still one of my favorites. Don't know who he is? He's in charge of typography at Microsoft. You know, fonts and stuff. His group is spending millions of dollars in font and font-rendering technology. So, I'm sure we'll talk about the fonts that his group designed for Longhorn. Ed Bott has the preview of those. He links over to a Poynter Online article about the new "C-fonts" designed for Longhorn.
Today we're gonna take a hike around campus with Bill Hill. He was our first interview on Channel 9 (and still one of our most popular). His bit about why you should put only a single space after a period is still one of my favorites.
Don't know who he is? He's in charge of typography at Microsoft. You know, fonts and stuff. His group is spending millions of dollars in font and font-rendering technology. So, I'm sure we'll talk about the fonts that his group designed for Longhorn.
Ed Bott has the preview of those. He links over to a Poynter Online article about the new "C-fonts" designed for Longhorn.
Now I am not going to knock Bill Hill, he is a smart guy and he has a smart team. And he is an engaging speaker as Robert indicates. ClearType is a very cool technology, and my team works with his team on a lot of different things.
But he is not in charge of all of the typography that happens at Microsoft.
You see (over in another group at Microsoft) I am on the GIFT team, and GIFT does not stand for Globalization Infrastructure, Flowers and Tools. And that "F" does not stand for "Folk Singers" or "Fabulous" or "Freaking" or anything else like that. It stands for Globalization Infrastructure, Fonts and Tools. Because a few years back MST (Microsoft Typography) merged with some other folks under Julie Bennett to form the GIFT team.
Now the typograhy folks have a lot in common with those of us in NLS in that usually people do not really notice us unless something goes wrong. But their work is no less important, in fact I would argue it is often more important since the utility of collation and casing and encoding is pretty limited if you can't see what the characters are (only people on the NLS team get good at speaking fluent question mark or square box. And it is the folks in typography who have been making it all happen for a lot of years.
I could talk about the millions of dollars that they are spending on fonts in Longhorn for new languages (only some of which I can even talk about yet!).
I could talk about all the work they did for Windows XP SP2 to add suppport for Bengali and Malayalam (referred to in Lions and tigers and bearsELKs, Oh my!).
I could talk about all of their efforts for Sinhalese in advance of Longhorn (referred to in Doing a little more in Sri Lanka....).
I could talk about the next dozen fonts that they are working on for new languages in future Windows updates and in Longhorn.
I could talk about all of the free tools they release like the Web Embedding Fonts Tool (WEFT), the Font properties extension.
I could talk about all of the developer tools they release, from Microsoft Font Validator to Visual OpenType Layout Tool (VOLT) to OpenType Layout Services Library (OTLS) to Visual TrueType (VTT) to the OpenType Font Signing Tool to Font Properties Editor to the OpenType Embedding SDK to the many other font development tools.
I could talk about the shaping engines that they produce for both Uniscribe and Avalon, that help make the most out of the new fonts for the languages that we support.
I could talk about how they own the OpenType specification, about which I know just enough to realize the extent to which I am "only an egg" compared to the typographers down the hall.
I could talk about folks like Simon Daniels who have spoken at GDDC and tons of other conferences and who I wish I could find a pointer to some of the slides he has done. Or better yet video of one of the presentations since you have to see him speak to get the full effect of the work that he describes.
I could talk about the large community of OpenType developers out there and the exciting work that OpenType enables.
I could talk about all of the OpenType training that they do, around the world.
I could talk about the cool font that they helped us to deliver for MSKLC that we use to give a visible display to characters that have no visible representation, from U+0020 to U+180b to U+034f and so on.
I could even talk about some of the many other issues that Ning, Simon, Paul, Peter, Ali, Judy, Carolyn, Julia, Cathy, Adam, Vinay, Dave, Mushegh, Nick, Sergey, David, Michel, or the rest of the typography folks are dealing with even as we speak to make the font story at Microsoft a good one, for Longhorn and for everything else.
Or maybe I have said enough to convince everyone that the "other Typography team" at Microsoft is also a place where important work is happening for Longhorn and beyond. Even if Channel 9 has not yet paid them a visit.... :-)
This post is brought to you by U+034f, a.k.a. COMBINING GRAPHEME JOINER.
Flags for APIs are an interesting thing, especially when you have APIs that you improve between versions. And the NLS APIs are no exception to this.
There are two ways one could handle flag values that are undefined in the current version of an API:
Just in case people were not sure, by and large the NLS API use #1, something that I think most APIs do. While the ability to call the code just one way is nice, never being sure what the behavior is really not so nice. And when you look at APIs from the point of view of not knowing what will come next, you never want to add that kind of uncertainty to the mix for the future.
(Note that the one glaring exception to this is the LOCALE_NOUSEROVERRIDE flag, which actually gets pretty controversial when you get down to it (cf: When is it a backcompat break and when is it fulfilling not-yet-fulfilled technology?)
It was interesting working on the Microsoft Layer for Unicode, which tries to let people write Unicode applications. What to do with the flags that are NT-specific becomes an even more interesting set of debates with people who start each meeting claiming they have no strong feelings about the issue and end each meeting ready to kill the ignorant wingnuts who want Microsoft to go down in flames via their own bad decisions. Hopefully you'll see the counterpoint there....
On the whole I am happy with the behavior of the rest of the APIs (and with most of the behavior of GetLocaleInfo, et. al., ignoring the LOCALE_NOUSEROVERRIDE stuff). And that is speaking as a former external developer who had to write all of that version-specific code. Because I would rather see the API fail up front than appear to succeed while actually ignoring what I told it to do. An API that accepts but ignores random crap (and if you think about, all it is random crap when it is not yet defined) is not a very robust API....
Though I must admit as that same former external developer who had to write code against up to ten versions of Windows that the flags only working in some version and not others can be quite a pain to keep straight.
A smattering of flags that were added later:
And so on. All flags added for new functionality that on prior versions would cause the API to fail with an ERROR_INVALID_FLAGS return from GetLastError()...
This post brought to you by "ᾮ" (U+1fae, GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI)
The issue is partially described in the Microsoft Knowledge Base (article 889834) but this article does not tell the full story (and some of what it tells is wrong).
Let's start with the title and its problems between CurrentUICulture and CurrentCulture:
The DateTimePicker control and the MonthCalendar control do not reflect the CurrentUICulture property of an application's main execution thread as you expected when you created a localized application in the .NET Framework or in Visual Studio .NET
Now the MonthCalendar and the DateTimePicker are not based on the UI settings, they are based on the user settings. So even if the control is fully globalized, it would never be based on the UI settings. This is because the date, time, calendar, number, currency, and collation settings are always based on the default user locale (and on CurrentCulture in the .NET Framework). If a localized application's language were to match this, it would only because the user happened to set CurrentCulture and CurrentUICulture to be the same culture, which is often the case but does not have to be.
Now the article is smart enough to point out the user locale settings control the language of the DateTimePicker and the MonthCalendar controls, and it does point out why -- because these two controls are wrappers around the Windows Shell common controls.
But this is not the full story.
Because calendars, as imperfect as they are in Win32 (cf: Calendars on Win32 -- just there for show...., Calendars on Win32 -- Not all there yet) and .NET (cf: Calendars.NET -- new platform, new issues) both platforms have serious advantages over the Shell controls, since the Shell DateTimePicker and MonthCalendar common controls only support the Gregorian calendar.
Thus even if your default user locale settings include a calendar setting where GetLocaleInfo with the LOCALE_ICALENDARTYPE returns any of the following values:
Value Constant Meaning 1 CAL_GREGORIAN Gregorian (localized) 2 CAL_GREGORIAN_US Gregorian (English strings always) 3 CAL_JAPAN Year of the Emperor (Japan) 4 CAL_TAIWAN Taiwan calendar 5 CAL_KOREA Tangun Era (Korea) 6 CAL_HIJRI Hijri (Arabic lunar) 7 CAL_THAI Thai 8 CAL_HEBREW Hebrew (Lunar) 9 CAL_GREGORIAN_ME_FRENCH Gregorian Middle East French calendar 10 CAL_GREGORIAN_ARABIC Gregorian Arabic calendar 11 CAL_GREGORIAN_XLIT_ENGLISH Gregorian Transliterated English calendar 12 CAL_GREGORIAN_XLIT_FRENCH Gregorian Transliterated French calendar
the DateTimePicker and MonthCalendar controls will never go beyond the Gregorian calendar.
Now the methods and properties on the calendar classes derived from Calendar class (GregorianCalendar, HebrewCalendar, HijriCalendar, JapaneseCalendar, JulianCalendar, KoreanCalendar, TaiwanCalendar, and ThaiBuddhistCalendar) contain the capabilities to let you create your own calendar. I'll try and throw such a sample of a calendar together another time.
Note that the above limitations do not apply to the ASP.Net control (System.Web.UI.WebControls.Calendar). I will cover this control and its capabilities another day....
This post brought to you by "ฟ" (U+0e1f, THAI CHARACTER FO FAN)
Ok, a poll for the people who read here regularly....
I was pinged a few people to register this blog as being involved with the various technology areas that it touches (Windows, the CLR, SQL Server, Office, AD, globalization, etc.) and I did so.
But note that my post categories (the list over to the left) are based on various internationalization topics, not on technology.
This caused someone to post a comment suggesting that I fix the bug that caused the post to be listed under an unrelated technology.
Well, they are right, no argument about that.
To fix this, I have several options:
I guess I am leaning toward (C) or (D) at this point, slightly more towards (D) since I am not trying to maximize hit count; I'd rather be on fewer radars than too many, if I have to choose.
Anyone have any thoughts? Does anyone use the old categories or are they sensible only to me? Would people prefer new prouduct based ones, instead?
Looks like they are starting to gear up for Tech·Ed 2005 (starting June 5-10)!
On the Language Log, Bill Poser posted about the use of Chinese in a particular episode of Law & Order in his post Chinese in Law And Order:
Television is confusing. I was watching Law and Order a little earlier. It was the episode in which the police find a little Chinese girl and her baby sister alone in their apartment, their mother missing. The story is about what has happened to her. The Chinese-speaking detective and the little girl converse in Mandarin, and so do the little girl and her aunt. Near the end, when they locate the little girl's teenage sister, she and her aunt speak Mandarin with each other. But when the aunt goes into a shop in Chinatown to consult the owner, they speak Cantonese.
He then points out the problems with this whole scenario.
This scenario seems unrealistic to me. That the man in Chinatown should speak Cantonese is what I'd expect. Most Chinese immigrants to the US until recently spoke Cantonese. Recent immigrants include many Mandarin speakers, so it isn't a surprise that the girls and their aunt spoke Mandarin. Indeed, just recently I had what to me was the rather odd experience of encountering a little girl, maybe 8 or 9, in a shop in Chinatown, who spoke neither English nor Cantonese. We spoke Mandarin (she rather better than me - yet another area in which age and academic degrees don't help). What is odd is that the aunt spoke Cantonese with the man in Chinatown. Of course, many Cantonese-speakers learn Mandarin as a second language, so bilinguals are not rare, but it is quite unlikely that a Cantonese person who also knows Mandarin would speak Mandarin with her nieces. People who are basically Mandarin speakers rarely speak Cantonese; if they do it is usually because they have moved to a Cantonese-speaking area. The only other hypothesis that I can think of is that the adults are first-language Cantonese speakers who have learned Mandarin as a second language and who so strongly identify with Mandarin as the language of modernity that they have spoken Mandarin with their children and nieces. I guess that's possible, but I haven't ever met anyone like that. In my experience, Cantonese speakers always prefer Cantonese. They may make an effort to learn Mandarin because they perceive it as advantageous to know, but they would never use it with their children.
This scenario seems unrealistic to me. That the man in Chinatown should speak Cantonese is what I'd expect. Most Chinese immigrants to the US until recently spoke Cantonese. Recent immigrants include many Mandarin speakers, so it isn't a surprise that the girls and their aunt spoke Mandarin. Indeed, just recently I had what to me was the rather odd experience of encountering a little girl, maybe 8 or 9, in a shop in Chinatown, who spoke neither English nor Cantonese. We spoke Mandarin (she rather better than me - yet another area in which age and academic degrees don't help).
What is odd is that the aunt spoke Cantonese with the man in Chinatown. Of course, many Cantonese-speakers learn Mandarin as a second language, so bilinguals are not rare, but it is quite unlikely that a Cantonese person who also knows Mandarin would speak Mandarin with her nieces. People who are basically Mandarin speakers rarely speak Cantonese; if they do it is usually because they have moved to a Cantonese-speaking area. The only other hypothesis that I can think of is that the adults are first-language Cantonese speakers who have learned Mandarin as a second language and who so strongly identify with Mandarin as the language of modernity that they have spoken Mandarin with their children and nieces. I guess that's possible, but I haven't ever met anyone like that. In my experience, Cantonese speakers always prefer Cantonese. They may make an effort to learn Mandarin because they perceive it as advantageous to know, but they would never use it with their children.
It is often a mistake, however, to try to ascribe higher motives to writers of a gritty television show filmed in New York.
So, I'm wondering whether the Law and Order folks had in mind some interesting scenario that would explain the choice of languages in this episode, or whether they just don't know one kind of Chinese from another, or don't think that anyone will notice.
The latter, I would say.
It is a bit like the work Mark Okrand did for Paramount in creating an entire Klingon language (for which he later created a dictionary). Dr. Okrand was once in school with Ken Whistler, who I have talked about previously. And there are times that he may regret the fact that most or links in Google Scholar pointing to him relate to scholarly work about a language that does not exist and whose principal speakers wear rubber protrusions for the foreheads when they speak it. Cornelis Krottje notes in his revisionary proposal of the Klingon Dictionary:
The current dictionary of Klingon (Okrand, 1992) is a bilingual, bidirectional dictionary, consisting of a passive Klingon-English section and an active English-Klingon section. We will maintain this nature of the dictionary; the alternative, an active Klingon-English section and a passive English-Klingon section, is unrealistic, simply because of the fact that native speakers of Klingon do not exist.
But note that despite the recognition of all of this by the lucid speakers of the language, the fact is that most of the Star Trek episodes that have involved Klingons since the original Star Trek movie for which Paramount commissioned Dr. Okrand have done so without any linguistic guidance. The script is used randomly on ships and controls, and the language used seldom matches the actual language beyond single words like nuqneH that the Klingon Language Institute has not yet managed to make as common in English as words like grok.
The writers of Law & Order probably did not have any deep motives or hidden scenarios for what they did. I frankly doubt they even really knew that the actors did this. Perhaps it was just an easter egg that they produced for the show? :-)
This post brought to you by "𠀀" (U+20000, the first Extension B ideograph meaning "the sound made by breathing in; oh!")
Yes, I said it: consistency and correctness are both four letter words.
As I have said before, one of the greatest strengths of NLS is the strong combination of linguistic knowledge and technical strength. The synthesis that these two very different viewpoints can create is so much more compellingly stronger than anything that either could do on its own.
But it can lead to a lot of interesting debates.
Occasionally, when you have someone from a technical background who has delusions of linguistic aptitude, you end up debating with yourself.
In this context, the definitions of the two words (a-la-dictionary.com) are interesting:
Consistency -- Correspondence among related aspects; compatibility. Reliability or uniformity of successive results or events.
Correctness -- To meet a required standard or condition. conformity to fact or truth.
It is easy to see how each can be important -- being able to count on getting the same results in different versions seems very important. But when the result is wrong, doesn't an API that purports to accurately reflect something have the responsibility of breaking consistency with a prior version, preferring consistency with the truth?
And in each case, the price of keeping to the tenet can be quite expensive. Even if you ignore the fact that you will have to defend your approach to the person who sits in the other camp on the particular issue.
In my case, I have to shave the face of the man who makes some of those decisions, which may be one of the reasons I do not shave every day. And there are times I have lost sleep over those decisions, certain that no matter how many people will be happy with a particular decision that somebody will feel that they have been (to borrow the phrase of Lily Tomlin from The West Wing talking about why she was fired the last time she worked in the White House) screwed with their pants on.
So I decided to try to form some principles, so I can stay clean shaven and not lose sleep when I get a chance to do it:
This is the best I can do to minimize the number of four-letter words that I say under my breath when I have to push issues of either consistency or correctness.
This post brought to you by "€" (U+20ac, a.k.a. EURO SIGN)
This question often comes up -- what is the InstalledUICulture used for?
Short answer? It isn't for anything. Don't use it.
Longer answer? It is a culture based on the return of the GetSystemDefaultUILanguage for systems that support that API. This API is itself fairly useless. Which is why you should avoid the managed object -- wrapped up useless does not find itself being more useful. Neither one is for anything. Don't use them.
Want an even longer answer? Ok, you asked for it....
The GetSystemDefaultUILanguage returns the install language of the operating system. On a localized system that is the install language of the OS, and on an MUI system that is almost always English1.
Now note that this is not a language that the user has chosen to work in, or claimed to have knowledge of, or claims to even be able to comprehend. Having your application do something with it is as productive as looking into who your grandfather took to the prom and chatting them up as if they should know who you are. In summary, it is a bad idea.
There are exceptions, but to date I have never seen an application that uses it which is doing so appropriately.
This is in marked contrast to a similarly named but very different API that I have talked about in the past -- GetSystemDefaultLCID. Although it too is misused (and such misuse can be dangerous), it does at least have some important functionality. If you need a conceptual hook to understand the difference between GetSystemDefaultUILanguage and GetSystemDefaultLCID, it is like looking at the difference between Heroin and Cocaine -- both are very dangerous but the latter does have some medicinal benefit2.
The API and method are similar to the drugs in that the developer gets a "high" from thinking that their code works; the "down" comes later when they find out that the whole usage thing was just a "bad trip."
Don't you wish you had stuck with the short answer and avoided my drug analogy? :-)
1 - I'll talk about the exceptions to this another time. I promise that they will not rehbailitate either the Win32 API or the managed object.2 - e.g. as a diagnostic indicator for people suffering from glossopharayngeal neuralgia. I'll talk about this disease some other time, though I do not myself have it. :-)
This post brought to you by "d" (U+0064, a.k.a. LATIN SMALL LETTER D)(The Letter "d" says "Don't say NO to drugs; say NO THANK YOU and calmly walk away)