Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
I pointed out in the post Some sort of order to collation that it is easy to dismiss linguistic issues when one is thinking about collation. As Steven Pinker pointed out in The Language Instinct:
...for the same reason that alphbetical order is similar across the Hebrew, Greek, Roman, and Cyrillic alphabets. There is nothing special about alphabetical order; it was just the order that the Caananites invented, and all Western alphbets came from theirs.
But it is not as simple as that. Looking at the beginnings of the alphabets for Hebrew:
vs. Greek:
vs. Russian:
vs. English:
and so on, we are as struck by the similarities (e.g. in Hebrew there is actually both a בּ (bet) and a ב (vet) that show up after the Alef, just as there is both a Бб (be) and a Вв (ve) after the А in Russian) as by the differences (e.g. there are two 'v' sounds in Hebrew, neither of which are anywhere near the 'v' in English -- or that in Hebrew א (alef) is silent while in most other languages it is not).
Obviously there is a commonality here that is not accidental, but just as obviously the actual letters present and the order of those letters has changed over time in different languages.
There are many possible reasons for change here, and looking at the differences from the original order from the Caananites gave us and any language today, many of the reasons for changes in order have either an orthographic or a phonemic basis.
Which brings us to something of a linguistic basis, doesn't it? :-)
Now this is especially true as we look at languages that pick up the use of a script such as Latin, Cyrillic, or Arabic and find the need to add letters. Because obviously they need a place to put those additional letters within their alphabetic order, and there are obvious reasons to choose a linguistic basis for that ordering.
Now this ordering may conflict with what a user of the script but not of the language may have for a letter -- thus ڇ (tcheheh) will seem to many Arabic language readers like a ح (hah) with four funny dots in it, similar to how I (as a speaker of English) might look at ṻ (u with macron and diaresis) as a u with some funny smudges on top of it.
Am I wrong? Certainly.
Is that Arabic reader? Yes.
But in the context of both the English and Arabic languages, we are both 100% correct.
And while the decision of where I would place them in an ordered list will likely be after ح and u,on the arbitrary basis they look a bit like them, it is not really going to be the same for languages that might make use of the characters.
Where they might be placed in the alphabetical order of a language that makes use of ڇ or ṻ is likely be very different. Since our answer was on the basis of ignorance of what the letter is, it would only make sense that their knowledge of the letter and what it does will guide their notion of where it belongs alphabetically.
This is an issue that I will be posting about in the future, with some more specific examples, giving both the "ignorant" and "knowledgable" viewpoints....
This post brought to you by "ڇ" (U+0687, a.k.a. ARABIC LETTER TCHEHEH)
I read with amusement Mark Starr's Why Doesn't Anyone Speak English in Torino?
(Well, most of the amusement was with his conclusion "in a real emergency, at least the hookers here speak English."
It is interesting how there is an implicit assumption that if the city is big enough it should be easier to get around with English.
Though I have found that assumption to be untrue often enough that I think I have cured myself of it, after several years.
But I thought about Tokyo of many years ago versus now, and about that humorous conclusion and I saw a pattern.
People in any kind of service based job will learn languages like English in order to be able to take advantage of the all important tourist section of the market.
Though obviously for this to work as a motivator, the business has to be steady enough that people can see or at least sense the amount of business they are losing by not being able to understand.
And of course Torino does have an episodic reason to be interested, but it perhaps lacks the sustained interest that it would need to interest folks in the economic benefits of learning additional languages....
When I look at Windows and at computers in general, I know how often people make assumptions like "if they are developers, then they know English" but I wonder how often that is just self-fulfilling prophesy.
After all, when someone says this they are leaving something out of their assumption. What they really mean to say is "if they are developers who use our product, then they know English".
Which makes me glad that we are trying to move into so many new language markets -- because it means we are not making the assumption that people will learn English to use our products.
It was a lousy assumption anyway.... :-)
This post brought to you by "ঈ" (U+0988, BENGALI LETTER II)
Paul Langton asked via the contact link:
Gday Michael, Firstly, love the blog and though a lot of it is waaaay over my head its always a great read, I'd go so far to say the best of all the MSDN blogs that I've sampled. OK suck up out of the way, I have a VB script that does the same function as ticking the two tickboxes in "Supplemental language support" - i.e. "Install files for complex script and right to left languages (including Thai)" and "Install Files for East Asian Languages": Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path '=======================================================================' Main'On Error Resume Next Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject") sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%") ' Install "Supplemental Language Support" oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1 oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1 My question is - what is the best way to detect if these are already enabled so when it is run it doesn't reinstall all this stuff? Rgds,Paul
Gday Michael,
Firstly, love the blog and though a lot of it is waaaay over my head its always a great read, I'd go so far to say the best of all the MSDN blogs that I've sampled.
OK suck up out of the way, I have a VB script that does the same function as ticking the two tickboxes in "Supplemental language support" - i.e. "Install files for complex script and right to left languages (including Thai)" and "Install Files for East Asian Languages":
Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path '=======================================================================' Main'On Error Resume Next Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject") sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%") ' Install "Supplemental Language Support" oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1 oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1
Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path
'=======================================================================' Main'On Error Resume Next
Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject")
sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%")
' Install "Supplemental Language Support"
oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1
oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1
My question is - what is the best way to detect if these are already enabled so when it is run it doesn't reinstall all this stuff?
Rgds,Paul
Ok, first of all -- Paul, don't ever do this!
The code that does the installation of these components does indeed perform these steps, but it also does more than that -- and if you do only these things then you will probably findout at some point that not everything works as you want it to.
This is obviously bad.
The way to get this done is to use the method given in KB article 289125. Just install one of the East Asian and one of the Complex script locales and it will perform ALL of the installation steps, rather than just the ones you want.
And now to the actual question -- how to know when to skip this due to the installation already happening?
Well, you can actually use the old IsValidLocale function on any of the locales that is within one of the language groups in question (e.g. 0x0411 which is Japanese or 0x041e which is Thau) using the LCID_INSTALLED flag, or the IsValidLanguageGroup function to check the actual language groups and see if they are installed.
Now all of this is useful to remember, especially in the context of What isn't in the default install for NLS and Language groups -- the vestigial tail of NLS, especially since they give some use for language groups. :-)
But please try to avoid depending on anything in the INFs, since that is undocumented, unsupported, subject to change, and is changing massively in Vista....
This post brought to you by "ૐ" (U+0ad0, a.k.a. GUJARATI OM)
(nothing technical, yada yada yada....)
Years ago, travelling was always a pain. At first because I had to walk with the cane which was hard over long distances. Even standing in line was a pain.
And then when I started asking for wheelchair service because I had to deal with all of the problems in the particular piece of the service industry (an industry which lives on tips since if their hourlies were any lower they'd have to pay for their jobs).
And in hotels I stopped asking for accessible rooms because they were always farther away from the elevators.
Then I got the scooter.
Suddenly, I am always pushed to the handicapped short line at the airport, even though (since I am now sitting comfortably) I could probably wait in line.
And without me saying a word, I get rooms near the elevator. Even though the distance no longer matters to me -- I have a 10mile range on a fully charged battery!
So basically when things were hard there was not much infrastructure that simply made things easier automatically. And now that they are not, there is.
It's all backwards!
I must admit I feel a little guilty being in the short line at security in the airport now -- because I know I don't have to be (other than the fact that TSA wants me there for whatever policy reasons they have).
And I always preferred being nearer to the elevator, though never enough to ask for it. So it is odd to be given a consideration that I am not even asking for if someone else may need it.
Though honestly I cannot think of a way to fix all this, so this is just me bitching and moaning for no good reason. Sorry about that.
Well, there is one productive thing I can do, I can give some advice:
If there are people reading this who are now in that tough phase where everything is hard, I'd advise getting the scooter or wheelchair and realize that suddenly a lot of things do get easier.
Regular reader Maurits asked in the Suggestion Box:
What kind of name for a character is this?? http://www.fileformat.info/info/unicode/char/534d/index.htm
What kind of name for a character is this??
http://www.fileformat.info/info/unicode/char/534d/index.htm
Well, first (and most importantly), I should point out the name of the Unicode character U+534d is not that string -- this is just a Unified CJK Ideograph.
(look out, that is a big download!)
Though if you look at the Unihan database for the information on the character, you will see where that string comes from. It is in the kDefinition field:
An English definition for this character. Definitions are for modern written Chinese and are usually (but not always) the same as the definition in other Chinese dialects or non-Chinese languages. In some cases, synonyms are indicated. Fuller variant information can be found using the various variant fields. • Definitions specific to non-Chinese languages or Chinese dialects other than modern Mandarin are marked, e.g., (Cant.) or (J). • Major definitions are separated by semicolons, and minor definitions by commas. Any valid Unicode character (except for tab, double-quote, and any line break character) may be used within the definition field.
And the reverse character is also in Unicode, and it is in Unihan, with the same definition.
Thomas Chan suggested:
The entry for U+534D in the _Hanyu Da Zidian_, vol. 1, p. 51 (as indicated in unihan.txt) includes a quote that it was originally not a Han character, "wan ben fei zi ...", suggesting that it now is. There are also serifs shown in that dictionary and the _Kangxi Zidian_ for both characters. Couldn't the above two characters be considerd a "CJK" or "IDEOGRAPHIC" version (like the spaces, zero, punctuation, brackets, etc. in the "CJK Symbols and Punctuation" block)?
The entry for U+534D in the _Hanyu Da Zidian_, vol. 1, p. 51 (as indicated in unihan.txt) includes a quote that it was originally not a Han character, "wan ben fei zi ...", suggesting that it now is. There are also serifs shown in that dictionary and the _Kangxi Zidian_ for both characters.
Couldn't the above two characters be considerd a "CJK" or "IDEOGRAPHIC" version (like the spaces, zero, punctuation, brackets, etc. in the "CJK Symbols and Punctuation" block)?
Andrew C. West posted some of the background information on these characters:
If memory serves me, the swastika was formally designated a Chinese ideograph by the redoubtable Empress Wu of the Tang dynasty during the late 7th century. Empress Wu had a penchant for creating new ideographs, and decreed that the Buddhist swastika symbol should henceforth be considered a Chinese ideograph to be pronounced WAN4 (a deliberate homophone for U+842C "10,000"). This is why, unexpectedly to some, the swastika symbols are found in the CJK Ideograph block rather than elsewhere. Incidentally, U+534D and U+5350 are rarely used within running text in Chinese. In the decorative arts the swastika motif is generally described as WAN4ZI4 <842C, 5B57> "WAN ideograph", as in the word WAN4ZI4JIN1 <842C, 5B57, 5DFE>, a type of turban with a swastika decoration that was the height of fashion during the Ming dynasty.
If memory serves me, the swastika was formally designated a Chinese ideograph by the redoubtable Empress Wu of the Tang dynasty during the late 7th century. Empress Wu had a penchant for creating new ideographs, and decreed that the Buddhist swastika symbol should henceforth be considered a Chinese ideograph to be pronounced WAN4 (a deliberate homophone for U+842C "10,000"). This is why, unexpectedly to some, the swastika symbols are found in the CJK Ideograph block rather than elsewhere.
Incidentally, U+534D and U+5350 are rarely used within running text in Chinese. In the decorative arts the swastika motif is generally described as WAN4ZI4 <842C, 5B57> "WAN ideograph", as in the word WAN4ZI4JIN1 <842C, 5B57, 5DFE>, a type of turban with a swastika decoration that was the height of fashion during the Ming dynasty.
The truth is that these characters are much older than any of the offensive things that were later done under their banner, and certainly in the correct context and situations, it would be a real problem not to include them.
Now if it were up to me I would do something about that definition string, in Unihan, whether it is in the dictionary or not, but that is a different story, entirely....
This post brought to you by "卍" and "卐" (U+534d and U+5350, two CJK Unified Ideographs)
I'm not sure how many of you remember when I posted Hungarian is even more complicated than I thought and More on the fabled EqualString.
Not because I don't have stats or anything, but because there is no way to gauge how many of you are new readers and how many of you really have nothing better to do that read what I am posting here. :-)
Anyway, I am going to talk about RtlEqualUnicodeString and RtlCompareUnicodeString, the functions in ntdll.dll that do binary comparisons that can be cae insensitive, again.
I found out something really interesting about them the other day.
Now it is obvious how RtlEqualUnicodeString might be used -- I mean, if you have two strings and you need to know in a binary sense whether they are equal (possibly ignoring case) then it can be very handy. Because no matter how un-natural the comparison seems to humans, the fact is that lots of Windows loves it.
Of course the actual usage of RtlCompareUnicodeString is a bit less clear -- I mean, the order has no meaning to humans. So a function that uses it to order two strings seems like a ripe source for incorrect usage.
Don't worry, it turns out that nobody is using that order inappropriately.
In just about every case, the return value of the function is tested to see whether it was equal to or not equal to zero.
Yes, that is right -- almost everyone who uses it is essentially duplicating the functionality of RtlEqualUnicodeString.
When you get down to it, one has to wonder how much more expensive is operation A than operation B:
A - compare two strings, one WCHAR at a time, return the difference if there is one as soon as you find it, then compare that number to zero to see if there is in fact a difference.
B - compare two strings, one WCHAR at a time, return TRUE or FALSE as soon as you know whether they are in fact equal.
Remembering for a moment that a difference that makes no difference, makes no difference -- do you think it makes a significant difference?
Hopefully not. Though it worries me that no one seems to be doing anything beyond what RtlEqualUnicodeString would provide. So why take a hit at all?
I resisted the temptation to just go and fix all of the occurrences (it is a 100% safe change but even so, I hate when people do it to code I own).
I also resisted the temptation to send out a bunch of mail to all of the owners to tell them to change their code (I hate when people do that to me, too).
Now that I read this post again, it occurs to me that this will probably not actually be very interesting to people. It just seemed weird to me.
Though if you own one of those calls to RtlCompareUnicodeString, then feel free to change it; at worst it will just be more self-documenting as to the intention, and at best (if the code is called many times in a tight loop) it could even help performance!
This post brought to you by "P" (U+0050, a.k.a. LATIN CAPITAL LETTER P)
Last month I talked about GetDateFormat and using the DATE_LTRREADING/DATE_RTLREADING flags with it in Return of the Mark.
Now, there is a lot going on here when we are deciding issues of directionality. So let us try to separate all of the issues. :-)
I mean to say, if you are a developer calling GetDateFormat it is hard to know whether to pass one of the flags.
And depending on the context of where the formatted string is being inserted and what is in the string, those marks could be important to include.
If you think it sounds like another case of a bad decision to put localization decisions in the hands of developers like I pointed out in Just when you think you know a function... then you are insightful.
Unfortunately, you are also mistaken, since there is much more involved than localization here; there is the calendar being used, the needs of various locales, and the context where it is inserted. And there is no good way to put this one in the hands of localizers and get the right results, either.
Because they do not have all the information.
Basically, we have all of the following information contributors on the 'directionality' front:
Ok, so we have four factors to consider in our decision for whether to pass DATE_LTRREADING or DATE_RTLREADING, and if so, which one to pass.
All of which GetDateFormat could do itself if there were some cool flag like DATE_PICK_APPROPRIATE_READING_PLEASE, or somesuch. :-)
And how would you get the calendar that is being used? Well, if DATE_USE_ALT_CALENDAR is passed to GetDateFormat, you would use GetLocaleInfo with the LOCALE_IOPTIONALCALENDAR flag; otherwise, you would use GetLocaleInfo with the LOCALE_ICALENDARTYPE flag.
So, how would you get the directionality from a locale? Well, you can get that from the LOCALESIGNATURE returned by LOCALE_FONTSIGNATURE (more on that parameter this post and this one), looking at bit 123 of the Unicode Subset Bitfields.
And how do you combine these two possible locale settings and the calendar setting?
Hmmm, that is a bit harder, isn't it?
Actually, it turns out that things are fairly easy, believe it or not.
The main trick to remember is that the markers are used to imply directionality in the case where directionality may be unclear.
If all three are RTL, then obviously you could include DATE_RTLREADING (if you were also sure that the user default locale was also RTL you could perhaps pass no flags since there is no ambiguity, it is a pure RTL context).
If any of the three are RTL, then let the calendar decide (which means that CAL_HEBREW and CAL_HIJRI lead to DATE_RTLREADING, other CAL_* leads to DATE_LTRREADING, and CAL_GREGORIAN is decided by the UI language.
Otherwise, you could pass DATE_LTRREADING (and once again if the default user locale was also LTR you could perhaps pass no flags since there is no ambiguity, it is a pure LTR context).
Now if I had an interview candidate who had some real NLS experience, this might have made a fascinating interview question, I think.
And figuring out the logic above would have made an excellent conceptual question for a lunch interview! :-)
In any case, GetDateFormat could almost certainly be smarter here.
It is very hard for the caller to know what to do, while most of the information is available to the function -- it is a shame it is not set up to do the work if it is asked to.... :-(
This post brought to you by U+200f, RIGHT-TO-LEFT MARKER
James Brown asked in the microsoft.public.win32.programmer.international newsgroup:
I have a fully working Uniscribe wrapper which renders a line of Unicode text, using the low-level ScriptItemize /Layout/Shape/Place/TextOut calls. Its working pretty well (very well in fact) but there is still one area I am not happy with. For a regular string of "english" text (i.e. non-complex), ScriptItemize always breaks the string into individual words. For a long line of text, containing much white-space and punctuation, this can result in quite a number of SCRIPT_ITEMs being returned. This results in a large number of calls to ScriptTextOut to render the text, which is where the problem is - because I am required to call ScriptTextOut for each "item-run" in the text, this results in a fairly slow mode of operation - alot slower than calling ExtTextOut for the whole line for example. It's not that ScriptTextOut itself is slow, it is just the shear number of calls to the OS that is causing the problem. So my idea is as follows: After Shaping, all of the returned glyph-data for every item-run in the string is stored consecutively in a large buffer. Ordinarily I isolate each run in this buffer and draw the runs individually with ScriptTextOut. However for a "simple" string of text (i.e one that ScriptIsComplex recognizes as such), I am proposing to pass the entire buffer of glyph/widths etc to ScriptTextOut in one go - so even if there was 30 runs of text, I would just treat this as one run and call ScriptTextOut just once - in essence, recombining all script-items into one single unit. Assuming for the moment that I am using just one font, does anyone see any problem in this approach? The only issue I can see is specifying a correct SCRIPT_ANALYSIS structure (there is a unique structure per run so which would I specify?) I have seen hints that maybe ScriptTextOut performs some trickery prior to calling ExtTextOut (for complex scripts) and that combining runs prior to calling it would be bad.....but for regular english text (code-points < 255 for example) would this be ok? I have tested this method, and it does seem to work - and it is *much* faster this way... it would be nice for a Microsoft uniscribe/typography rep to comment on this approach.
I have a fully working Uniscribe wrapper which renders a line of Unicode text, using the low-level ScriptItemize /Layout/Shape/Place/TextOut calls. Its working pretty well (very well in fact) but there is still one area I am not happy with. For a regular string of "english" text (i.e. non-complex), ScriptItemize always breaks the string into individual words. For a long line of text, containing much white-space and punctuation, this can result in quite a number of SCRIPT_ITEMs being returned.
This results in a large number of calls to ScriptTextOut to render the text, which is where the problem is - because I am required to call ScriptTextOut for each "item-run" in the text, this results in a fairly slow mode of operation - alot slower than calling ExtTextOut for the whole line for example. It's not that ScriptTextOut itself is slow, it is just the shear number of calls to the OS that is causing the problem.
So my idea is as follows:
After Shaping, all of the returned glyph-data for every item-run in the string is stored consecutively in a large buffer. Ordinarily I isolate each run in this buffer and draw the runs individually with ScriptTextOut.
However for a "simple" string of text (i.e one that ScriptIsComplex recognizes as such), I am proposing to pass the entire buffer of glyph/widths etc to ScriptTextOut in one go - so even if there was 30 runs of text, I would just treat this as one run and call ScriptTextOut just once - in essence, recombining all script-items into one single unit.
Assuming for the moment that I am using just one font, does anyone see any problem in this approach? The only issue I can see is specifying a correct SCRIPT_ANALYSIS structure (there is a unique structure per run so which would I specify?)
I have seen hints that maybe ScriptTextOut performs some trickery prior to calling ExtTextOut (for complex scripts) and that combining runs prior to calling it would be bad.....but for regular english text (code-points < 255 for example) would this be ok?
I have tested this method, and it does seem to work - and it is *much* faster this way... it would be nice for a Microsoft uniscribe/typography rep to comment on this approach.
The method itself should be sound (this type of use of ScriptIsComplex is very similar to the method that LPK.DLL (discussed previously) uses to determine whether to forward text rendering calls to Uniscribe or not.
(Of course in the case of LPK.DLL, Uniscribe is not called in the non-complex case, ExtTextOutW is; there may be a performance benefit to doing this since ScriptTextOut must evetually call ExtTextOutW to do the actual rendering -- so eliminating the extra overhead may be everyone's advantage).
No I am not entirely clear on why non-complex text would be broken into separate runs (especially text for which ScriptIsComplex resturns FALSE), so I will probably try to dig a little deeper on that point.
Does anyone have any theories? :-)
This post brought to you by "ཛྷ" (U+0f5c, a.k.a. TIBETAN LETTER DZHA)
Ever since that first post about ELKs (Lions and tigers and bearsELKs, Oh my!) there has been a lot of interest in this mechanism. But that is not the only thing that generates interest.
For example, Dieter asked the following question in the microsoft.public.win32.programmer.international newsgroup:
I am passing 'hr-ba' to the construcutor of .NET Framework 2.0 class CultureInfo. Works fine in XP 2000, fails in Windows Server 2003 with message: Culture name 'hr-ba' is not supported.Parameter name: name I have noticed that hr-ba is not in the lsit of supported cultures of CultureInfo documentation, but in blog entry http://blogs.msdn.com/shawnste/archive/2005/12/06/500675.aspx I have read that the CultureInfo is built up from os infos. Any suggestions what I have to make it work on Windows Server 2003? Is there a setup available that installs that additional cultures available in XP Prof. and not in Windows Server 2003?
I am passing 'hr-ba' to the construcutor of .NET Framework 2.0 class CultureInfo. Works fine in XP 2000, fails in Windows Server 2003 with message:
Culture name 'hr-ba' is not supported.Parameter name: name
I have noticed that hr-ba is not in the lsit of supported cultures of CultureInfo documentation, but in blog entry http://blogs.msdn.com/shawnste/archive/2005/12/06/500675.aspx I have read that the CultureInfo is built up from os infos.
Any suggestions what I have to make it work on Windows Server 2003? Is there a setup available that installs that additional cultures available in XP Prof. and not in Windows Server 2003?
It is important to remember that these 'Windows only" CultureInfo objects are simply not designed to work on all platforms. They just aren't.
Now if someone who was using the exciting new Welsh locale in XP SP2 was not able to see it in their managed applications, they would be complaining pretty loudly.
Of course now there will obviously be people unhappy with the fact that these new cultures will not be on all platforms.
I think the problem is that developers like to complain. I know, because that is what I used to do anytime I wasn't getting what I wanted. :-)
(Maybe that isn't even as past tense as I like to think!)
Now it is true that there is no mechanism to add locales to a platform that does not have them other than the ELKs (which are not currently available for Server 2003, as I mentioned in ELKs aren't roaming where the servers are). And while I have learned since I started working for Microsoft it is that one can never say never, there is obviously no way that Dieter's immediate need will be met directly. Even if everything changed.
But don't worry, there is a definite workaround here to get things up and running.
After all, there is certainly a mechanism to add cultures to the 2.0 version of the .NET Framework -- CUSTOM CULTURES!
That's right, you can create a custom culture based on the Windows only one, save it as LDML, copy that file to the other machine (Server 2003, Windows 2000, even Window 98/ME), and register it.
Dieter tried it out and responded back with:
Thanks for the hint with the LDML file. It worked for me.
And now, with that same format now being supported in new versions of Windows, the ability to level the playing field will only become easier in the future, for both managed and unmanaged code.
This post brought to you by "ᠽ" (U+183d, a.k.a. MONGOLIAN LETTER ZA)
(No, the title of this post does not contain a typo!)
I have a regular reader of this blog who is a 12 year old young man named Dean.
He has an interesting take on my post There is no such thing as a surrogate character (dammit!).
Although he did not really follow all of the Unicode take on the evilness of the term "surrogate character," he pointed out that the real problem was not that "there is no such thing as a surrogate character" at all.
He suggested that we should allow people to call these characters that are made up of two surrogate code units by a simple term:
A SURROGATES CHARACTER
(the emphasis is mine)
When he first suggested it, I went back through previous mails from Dean that convinced me his age claim was genuine (up to and including his delight that I used the word dammit in a post title!).
It struck me as a much more brilliant compromise that more accurately resolves the problem of the natural tendency people seem to have to call these entities "surrogate characters" by shifting the battlefield in such a way that the language mavens, the grammar police, and the wordinistas can start battling for us!
And to be honest, Dean suggested that some of these mavens could perhaps help the cause, citing this post and several Language Log posts on the language maven issue.
Why not have these busybodies do some work for us, just for a change? :-)
Clearly there are two surrogate code units there, so calling the two of them a surrogate character is an obvious pluralization mismatch.
What do you think?
In my opinion, a touchdown (with the extra point), a field goal, and a safety for Dean, 12 points that the Seahawks could have used to win the Super Bowl yesterday! :-(
This post brought to you by "𐠠" (U+10820, a.k.a. U+d802 U+dc20, a.k.a. CYPRIOT SYLLABLE PI, a proud surrogates character!)
Leave it to me to have professional blog post titles!
It all started yesterday when I posted Keyboards: Monolingual or Multilingual?
Now in that post, I focused on a few of the challenges that the champions of a multilingual keyboard face when they aim for acceptance of their keyboard.
(and let me stop for a moment to point out that when I say multilingual keyboard I mean a single layout that supports multiple languages, as opposed to using multiple layouts with language switching)
But I did skip over talking about what makes up a good keyboard layout for a single language, and the consequences for multilingual keyboards that fall out of that omission.
I thought that today (and probably later this week as well!) I would dig into these issues a bit more deeply. :-)
To better understand the issues, we will go down into the asylum. The asylum for the linguistically insane, to talk to Hannibal Chomsky (the linguist who served failing grad students to his post-doctoral fellows) and get his thoughts on the issue:
Michael: You were telling me the truth about keyboards before, sir. Please continue now. Hannibal: I've read your blog. Have you? Everything you need to create a good keyboard is right there in those blog pages. Michel: Then tell me how! Hannibal: First principles, Michael. Simplicity. Read Marcus Aurelius. Of each particular thing, ask what it is in itself? What is its nature? What does it do, this keyboard you seek? Michael: It lets you type the documents you want to... Hannibal: (Hannibal interupts) No, that is incidental. What is the first thing it does? What need does it serve by letting you type the documents you want to? Michael: I don't know, acceptance? Keeping me from throwing the monitor at the wall? Hannibal: No. It expresses language. That is its nature. And how do we begin to express language, Michael? Do we seek out the letters to type? (pause) Make an effort to answer now, Michael. Michael: No, we just... Hannibal: No, precisely. We begin by using the letters we use every day. Don't you see the fingers of other people moving over their keyboards to express language? I hardly see how you couldn't. And don't you feel your fingers typing quickly over the keys to express what is on you mind? You will tell me if those keys stop screaming, won't you Michael?
Michael: You were telling me the truth about keyboards before, sir. Please continue now.
Hannibal: I've read your blog. Have you? Everything you need to create a good keyboard is right there in those blog pages.
Michel: Then tell me how!
Hannibal: First principles, Michael. Simplicity. Read Marcus Aurelius. Of each particular thing, ask what it is in itself? What is its nature? What does it do, this keyboard you seek?
Michael: It lets you type the documents you want to...
Hannibal: (Hannibal interupts) No, that is incidental. What is the first thing it does? What need does it serve by letting you type the documents you want to?
Michael: I don't know, acceptance? Keeping me from throwing the monitor at the wall?
Hannibal: No. It expresses language. That is its nature. And how do we begin to express language, Michael? Do we seek out the letters to type? (pause) Make an effort to answer now, Michael.
Michael: No, we just...
Hannibal: No, precisely. We begin by using the letters we use every day. Don't you see the fingers of other people moving over their keyboards to express language? I hardly see how you couldn't. And don't you feel your fingers typing quickly over the keys to express what is on you mind?
You will tell me if those keys stop screaming, won't you Michael?
Ah, we begin to see the issue (though at the cost of letting a madman into our heads? Luckily for us he is fictional!)
A good keyboard must be set up so that the letters that you will most frequently use are placed so that they are easy to hit, with less common letters/numbers/symbols further away.
And since every language is likely to have different usage frequencies of different letters, even the simplest of monolingual keyboard layouts will (if designed well) be different from the simple layouts of other languages.
And if you do want to add additional letters to cover other languages, those letters must be placed in a way that is harmonious with the simple design of support for your primary language -- the key assignments would ideally be placed in such a way that they are accessible enolugh to type them when you want to while far enough out of the way that they do not make typos more destructive and more of a pain to fix.
So why do I say in the title of the post that in my estimation, most keyboard layouts suck? Well, mainly because most are not designed with these principles in mind, instead often being based on other principles, and other agendas.
Those principles and agendas may serve needs, and those needs may even be valid ones. But they do not always serve ideal use of a keyboard layout to be able to express through one's typing the words, the language that one's brain is producing.
To give an example, the mere fact that there is no one rule for how often every person in Canada may use English, French, and other languages is a problem that makes the Canadian Multilingaul Keyboard Layout less usable for many of the people for whom it was created.
Some can find it ideal, most will not; this turns out to be true even before you add the poor interoperability with applications like Microsoft Word due to its use of the RIGHT and LEFT 'Control' shift states.
It also turns out to be true before people realize that applications like Microsoft Word cannot tag language appropriately, causing the spellcheck experience to be so not the ideal one....
Now I would not expect either this post or the one from yesterday to sway the people who truly believe that a multilingual keyboard layout is the best thing for a particular community of users.
But I hope it will maybe at least start to convince people who are not actively trying to add such a layout why it may be in the best interests of people to not have only multilingual keyboard layouts as their default options for their locales!
This post brought to you by "ೠ" (U+0ce0, a.k.a. KANNADA LETTER VOCALIC RR)
The other day, when I talked about how I was Approaching linguiticalishnessality, a comment from Thierry Fontanelle of the MS Speech and Natural Language group pointed out that the quarterly symposium in computational linguistics held by Microsoft and the University of Washington was about to happen.
It did indeed happen this last Friday, and they were two fascinating talks, one on Unsupervised Acquisition of Ateso Morphology and the other on Locating, Recognizing, and Converting Interlinear Text on the Web.
I thought I'd say a few words about each. :-)
The second talk, given by William Lewis (a Visiting Assistant Professor, Dept. of Linguistics, UW), talked about a very interesting project trying to create a searchable database of interlinear text, a common format for linguistic samples. An example of such text (this one borrowed from the excellent talk I saw the week before from Rachel Hastings) is:
Llama-kuna urqu-pi ka-n llama-PL mountain-LOC be-3sg 'There are llamas in the mountains'
Llama-kuna urqu-pi ka-n
llama-PL mountain-LOC be-3sg
'There are llamas in the mountains'
Linguists are likely used to seeing this format known as Interlinear Glossed Text (IGT). The 'gloss' refers to that middle line, the one with the tags.
It is obviously not so easy to simply use an MSN Search or a Google query to find the large number of them available on the internet, so the ODIN (Online Database of INterlinear text) project is an attempt to bring in methodologies to find these IGT examples, get their language based on surrounding text, an catalog them so they can be easily searched later.
ODIN has at this point a very good rate of few false positives when detecting IGT it can catalog, but with a high cost in terms of false negatives (i.e. many valid cases are thrown away in order to be certain that the detected cases are definitely valid).
One thing that I found to be very interesting (beyond the general negative feelings about what the PDF format has done to such searches difficult -- more on this another day!) was the great lengths that the project has to go to in order to find what is obviously a recognizably 'standard' way to represent information, probably due to the fact that there is no widely used, standard way of producing them (beyond creative use of the space bar to line up text, I mean).
It struck me that there should likely be features within products like Word that would make these things easier to regularize these things. In fact, with Word in Office 12 supporting PDF, there may be an awesome opportunity to try to make many of these issues easier to solve.
The first talk, given by Manuela Noske, a Software Localization Engineer for Windows, describes a fascinating attempt to take a corpus of ~460,000 alphanumeric tokens of Ateso, an Eastern Nilotic language using Linguistica v.2.0.4.
Manuela was honest about the fact that the results were not all that had been hoped for, in large part due to the lack of standardization of much of the language in terms of case markers, spelling, and several other morophological and phonemic issues. This is especially interesting given the huge efforts being made in Uganda to get the language online -- the lack of standardization would seemingly hinder efforts to understand Ateso usage more than a little bit!
Since most of the corpus is actually made up of Ateso periodicals, factors such as periodical styles or even author preferences cannot be discounted, and after the talk when I spoke with Manuela she suggested that these issues are definitely possible avenues to get around the limitations that Linguistica v.2.0.4 showed for such a language (more information from native speakers would also undoubtably help in weighing the importance of the differences).
During Q&A after the talk, one person pointed out how items like spell checkers often had a significant effect on such problems when they are widely used, as they actually enforce a standard on a language that is clearly struggling through such variations.
That idea scares me a little, since the thought of a spell checker having such an influence is scary for any mistake it might contain. I mean, it is an awesome responsibility to know that a mistake will lead to improperly spelled words in school book reports, but that is nothing compared to the effect a mistake could have on a languge struggling to find its own proper usage!
In any case, both talks were very interesting and I got to have several conversations after the talks, too. I don't know that I will be able to go to every talk that happens, but I will definitely try to attend them when I can. :-)
This post brought to you by "ƀ" (U+0180, a.k.a. LATIN SMALL LETTER B WITH STROKE)
The importance of having multilingual keyboards is often a controversial one.
I have often been amazed to watch people who feel intensely that it is important for their keyboard to support multiple languages, and compared them to others who also expressed an analagous intensity that it was improper for a keyboard purporting to support a particular language actually supporting many of them.
This is indeed a debate I have seen both internally and externally as well, especially for locations that either support many different languages or have a potential need to handle multiple languages (e.g. in large immigrant populations or neighboring countries).
The multilingual proponents often face an uphill battle for several reasons:
One reason is the simple fact that for the average user the keyboard will not be intuitive -- and the interest to overcome that and learn a keyboard is directly proportional to their need to support other languages. Otherwise, people will just consider it a bug that the possibility of creating typos is so greatly increased! How can you argue the importance of something
Another difficulty in the argument for multilingual keyboards is that sometimes they are not yet something the customers of that language are asking for. And indeede it may or may not be in the future, but Microsoft is put in a pretty weird position that we try our best to avoid when we are asked to take sides on such an issue.
I guess what is most important to recognize here is that this is argument where no one involved with it is actually wrong except when they claim to be speaking for everyone else.
(the fact that there is a disagreement proves that part, at least!)
The best we can often say is that the evidence is not all in yet.... :-)
This often becomes a scenario that is best served by MSKLC (Microsoft Keyboard Layout Creator) until and unless the actual need is determined (since once we add a keyboard we are often stuck with it even if its proponents abandon the idea).
Steve Clayton talked about how the fact that hotels have moved to the common phrase Have you stayed with us before, sir?
When I first read his post I was swept into all of the information that they could have which they could use to customize the experience for regular customers.
And I was like him annoyed at the fact that they even bothered to ask if they are not going to really do what they could be doing for customers.
Luckily, at that point Steve took the argument off the rails, at least from my ludditian point of view. :-)
He started talking about CRM and Media Center, which stopped me for a moment. And I started thinking about alternate reasons why they might be asking the post title question that don't relate to the service experience....
Now I know I am not the only person who was chilled to see Tom Cruise walk into stores and have computers greet him and ask about clothing he had bought previously. And I am sure that there are people who feel uneasy about the full profiles that search engines and EBay and Amaxon.com keep on what people have done on the site in the past.
So one good reason to ask us how we are doing would be that it is a nice way to be polite without seeming like stalkers and freaking out their patrons. When you meet strangers who somehow know things about you, it can be upsetting. So why take the risk?
On a similar vein, hotel maids will almost without fail shift things around a bit in the bathroom when they clean the room. Even if everything was already straightened up. Even to the point of moving where they put things if no one else moved them.
It used to upset me and seem very like busybody type stuff, until once when this did not happen, and my first thought was that maybe they had not cleaned up in there!
It made me realize there is a good reason to move stuff around -- because i you think they do not take care of the room, you may not complain -- but you may also not come back. It is a little ping so that they know you get the signal if you are listening for it.
Then another hypothesis came to me -- I thought about how servers in restaurants do the same thing. And I know the a bit about some of the reasons for them to do this sort of thing -- so that they know whether they need to give you the shpiel for the signature dishes or not.
Asking whether you have been there before is how you get an easy way to not hear about all the stuff that you may not care about, even if you have never been there before. It is your chance to opt out of the shpiel.
The more I think about it, the more I realize that I actually DO say I have been to the hotel even if I have not. Mainly to avoid a lengthy conversation when I just want to get to my room and lie down (travelling can be tiring!).
It makes me appreciate the fact that the hotel staff gives me this feature -- and more importantly it makes me dread the day that all of these "service opportunities" come to fruition and I no longer even get the chance to pretend that every person I buy something from does does not have my life story at their fingertips.
Any way you look at it, enough people are excited at the opprtunities that those with less noble and more economic motives will likely want to get involved. And it is short distance from there to misuse of such "opportunities."
Paranoid? Well, as Randy Quaid mentioned in The Paper, he only got so paranoid when they started plotting against him.
And it looks like those who are plotting have started gaining momentum....
The truism in the title of this post seems fairly obvious. Though in several situations this week, I have had to point out this fact when I explained why the answer they got turned out to not be the one they wanted....
A lot of it comes down to the use of the lstrcmpi function.
(As an aside, note that this KB article is simply wrong for every 32-bit version of Windows. I will have to talk about why I think the MSKB is sometimes a priceless asset to be treasured and other times a pariah to be shunned!)
Now the fact is that lstrcmpi is a wrapper around CompareString, which regular readers here know is in Windows for the pupose of doing linguistically meaningful comparisons. And you can fry a linguist in butter and they'd still be a linguist1 -- so no mere wrapper function is going to change the fundamental purpose....
So, if you are a developer thinking about a case insensitive items like FAT/FAT32/NTFS files or the registry or OS objects like events/mutexes/etc., and you think that lstrcmpi looks like the perfect function to use for comparing two such items, what would I say?
Well, I promise I would not call you an idiot2. But if pressed I might have to call a design document that had such an implementation plan idiotic. And I would really try to convince you to fix your code. :-)
Now here is where we get back to the title of this post.
Because if you are calling lstrcmpi for appropriate reasons (i.e. you wanted to get linguistically meaningful results, say in the sorting of a list in a user interface) but you wanted to have behavior that did not vary with different locales, then CompareString with LOCALE_INVARIANT is a good answer.
But if you wanted almost anything else, including all of the non-linguistic purposes hinted at earlier, then CompareStringOrdinal or RtlCompareUnicodeString is a much better choice.
Maybe the fact that these non-linguistic functions are so much FASTER might have some influence on people. Or the fact that if people are makingsecurity decisions based on the results they could be crafting their own security bugs? I mean, both performance and security are "sexier" than international to a lot of these people.:-)
So, how someone asks the question (especially keeping in mind the fact that code already calling lstrcmpi implies a specific, requested usage) can have serious impact on the answer I would give.
Though to be honest, for the last few years my cynical side has decided to assume that original code was probably wrong, and that neither the person who wrote the code nor the person asking me the question has really thought through the scenario.
So perhaps now I would say that the answer is dependent on the cynicism of the person being asked? :-)
One of my presentations at the 29th Internationalization and Unicode Conference (the one entitled Tales of Incorrect String Comparisons) actually talks about this issue and several other collation-type problems. I would highly recommend it to anyone who finds this type of thing to be interesting. Several cool demos, etc. :-)
1 - Apologies to Martin Cruz Smith for the slightly munged quote!2 - Well, not unless you decided lstrcmpi was the function to use even after the problem was explained to you, and even then probably not to your face. :-)
This post brought to you by "İ" (U+0130, a.k.a. LATIN CAPITAL LETTER I WITH DOT ABOVE)