Postings are provided as is with no warranties, and confer no rights. Opinions expressed here are my own delusions; my employers at best shake their heads and sigh, at worst repudiate the content with extreme prejudice, whenever it manages to appear on their radar.
This blog is unsuitable for overly sensitive persons with low self-esteem and/or no sense of humour. Proceed at your own risk. Use as directed. Do not spray directly into eyes. Caution: filling may be hot. Do not give to children under 60 years of age. Not labeled for individual sale. Do not read 'natas teews ym' backwards. Objects in mirror are closer than they appear. Chew before swallowing. Do not bend, fold, spindle or mutilate. Do not take orally unless directed by a physician. Remove baby before folding stroller. Not for use on unexplained calf pain.
A nice FLAIR (FLuid Attenuated Inversion Recovery) view from the not-too-distant past. Every abnormality you can see on this scan (and there is more than one!) is asymptomatic at present. Alongside is a picture of me walking the walls at Fremont Studios, a sign of a damaged brain.
WARNING: No technical content, though I am technically a little pissed
A few months ago, I was in my scooter riding some my recyclables over to the large bins in my apartment complex. It is easy to scoot right over there, recyclables in the back basket of the scooter, and dump them in there.
The space right next to the bins is a handicapped spot. I know because back before I had the scooter I used to drive the stuff over, and I always loved that I could park there when I dumped the stuff (yes, I do have the proper state-issued disabled parking placard).
Anyway, there was a car parked there, tightly enough that I could not even squeeze myself in to get to the bin, let alone the scooter. Not the end of the world, but I noticed that the person had no disabled parking placard or license plate. Same thing for the car next to them (which was not blocking me, but it was yet another handicapped space being illegally parked in).
I'm generally not one to raise a fuss (I'm still getting used to the whole disability thing), so I turned around and figured I'd try my luck another day.
Time passes, and every time I come home I look over there I see that these folks are still parked there. And it is the same cars. I am starting to get mad.
Did I mention that the two cars who seem to be the reqgular offenders here both have current Microsoft parking permits? Don't even gt me started on how obnoxious I think that is.
Anyway, I finally scoot down to the main office for the complex. I pick up the mail, then mention to the person there that there are people who are parking in the handicapped parking spots every day. Should I complain to them or to the Redmond police directly?
She asks me whether there are signs posted.
I have no earthly idea, so I scoot out there and look. No signs, just the pavement nicely painted with the little wheelchair dude. I scoot back and pass this on.
She explains to me that the law in Washington is such that if there is no sign then the police can neither ticket nor tow. They cannot even give a warning.
Hmmmm.
I checked later, she was right. In RCW 46.61.581:
A parking space or stall for a disabled person shall be indicated by a vertical sign, between thirty-six and eighty-four inches off the ground, with the international symbol of access, whose colors are white on a blue background, described under RCW 70.92.120 and the notice "State disabled parking permit required."Failure of the person owning or controlling the property where required parking spaces are located to erect and maintain the sign is a class 2 civil infraction under chapter 7.80 RCW for each parking space that should be so designated. The person owning or controlling the property where the required parking spaces are located shall ensure that the parking spaces are not blocked or made inaccessible, and failure to do so is a class 2 civil infraction.
That's all it says -- no sign, then its not a handicapped spot.
So I ask this person at the office whether these spaces are being repurposed, and she assures me they are not. Its just that the signs are expensive.
Double hmmmm.
At this point I start to complain a bit more and point out the fact that there are actually laws about this sort of thing (there are -- RCW 70.92.140, for example).
They sober up quickly enough on that one and finally agree to get a couple of signs (one outside where my apartment is, one next to the recycle bin), leaving the other three "faux handicapped" spaces in their FAUX state, since signs are expensive.
One of those two spaces where the two cars used to park still has a car in it every day. It seems to be one of those same two cars (I guess they are fighting for "their" space). Maybe I should start posting license plates and descriptions. Or maybe I should consider the handicap of their lack of decency as one that entitles them to park there? I hope they are contingent staff and their contracts don't get renewed. Most people at Microsoft are a lot nicer.
I realize that by posting this I may be enabling a whole new crowd of jerks to start taking advantage of this particular loophole, but I hope that people can perhaps be a little better than that. They can rise above their technical right to park in a parking space and see that it is marked on the pavement as a handicapped spot. And at least tey to respect that....
There is currently no HANDICAPPED SYMBOL in Unicode, though it would have loved to have been the sponsor.
Not too long ago, I posted (in Lions and tigers and bearsELKs, Oh my!) about ELKs. And about how I would talk more about what is involved in "getting out of the way". This is another post on that theme...
Review of our story so far:
In Fall 2004, Cathy Wissink and I were in San Jose at the Unicode Technical Committee meeting (being held at Apple) along with 20+ of our colleagues from various companies involved with internationalization. We spoke at the IMUG (International Mac User's Group) meeting one evening, giving a much longer version of the talk that has been done before at both prior Internationalization and Unicode Conferences and at the Microsoft Global Development & Deployment Conference. Things were a little bit closer to shipping so more could be said, and since we were given more time we were definitely allowed to say more. The title of the talk? Windows for the Rest of the World -- Customizing Windows for Emerging Markets. This post will contain a few more slides of the content from that talk. :-)
In Fall 2004, Cathy Wissink and I were in San Jose at the Unicode Technical Committee meeting (being held at Apple) along with 20+ of our colleagues from various companies involved with internationalization. We spoke at the IMUG (International Mac User's Group) meeting one evening, giving a much longer version of the talk that has been done before at both prior Internationalization and Unicode Conferences and at the Microsoft Global Development & Deployment Conference. Things were a little bit closer to shipping so more could be said, and since we were given more time we were definitely allowed to say more.
The title of the talk? Windows for the Rest of the World -- Customizing Windows for Emerging Markets. This post will contain a few more slides of the content from that talk. :-)
Now ELKs were not written just for the sake of adding locales for the fun of it. They were added to help a specific, important scenario.
You see, no matter how useful additional locale support may be, it still forces you to use Windows localized into a language other than the one in which you may feel most comfortable. And the real scenario here that drove the ELK project was a desire to allow localized versions to be released more quickly, in some cases localizations into languages that were not yet supported on Windows (and the ELKs were in this context a fancy way of making sure that support would be present).
Now remember how I bemoaned the expense of obtaining locale data? Well, the expense of localization blows that number out of the water.
It actually reminds me of an old Dilbert cartoon, which was in newspapers back in May of 1999 (to my knowledge it is the only Dilbert cartoon involving internationalization or localization outside of the funny stories involving the Elbonians. The lines in this strip (sorry, no permissions to repost the original comic!) go like this:
Pointy-haired boss: Tina, I want you to write the Chinese version of our product's instructions.Tina: Can you tell the difference between Chinese words and random scribbles?Pointy-haired boss: No.Tina: I'll be done in five minutes.
Tina is right -- bad localization is cheap and fast. But you also won't find many people buying it.
Good software localization is hugely expensive, and involves expertise that is hard to find and keep -- knowledge of the source language, the target language, and most importantly the technology and how it is described in both languages. It is not mere translation, because the work to make people feel comfortable with the product also has to be done if you want users to be happy with the product.
So how does the idea of getting localization done faster and for more languages get reconciled with these difficulties?
Introducing, the Language Interface Pack or LIP.
And this is all very important. Because there is a huge demand for more localizations, and more quickly. So by partnering with third parties allows for cheaper, faster, and culturally accurate localizations, and most importantly by working to localizing "the 20% of strings seen 80% of the time" Microsoft is able to enable cheaper and faster localizations, while not sacrificing quality.
At the time of this post, there are LIPs for 19 languages (when we did the presentation, there were only 4!):
For more information on LIPs, including what is localized and more, see the LIP FAQ, brought to you by Microsoft's own Dr. International!
Looking back at the earlier post about ELKs, the temptation to talk about how some LIPs did not need ELKs (such as Romanian), and other ELKs have LIPs (compare the two lists, you will see many similarities), and there some LIP-less ELKs (again, compare the lists). Think of what those ELKs save on ChapStick!
Of course you never know what might be a LIP tomorrow. Who knows when Microsoft may give you all some LIP?
The goal for both ELKs and LIPs is to support more people with more languages and to constantly work to broaden the number of people who can enjoy Windows.
And yes, there is more that was said that night, and more to say here in general. For another time....
This post brought to you by "æ" (U+00e6, a.k.a. LATIN SMALL LETTER AE)A letter that itself plans to be the star of a future posting on Sorting It All Out....
There is a great deal of confusion surrounding the meaning of these two different things in the .NET Framework, and when to use each. If you have suffered, are suffering, or think may suffer in the future from such a confusion, then read on!
(Otherwise, I guess you can go away and come back another time)
The invariant culture's direct ancestor is the invariant locale. Officially added to the Windows source tree at 10:23am on May 12, 2001, its intention was not to be used as an actual locale (which would explain why no locale data was added until a month later; until then no one was using it in GetLocaleInfo!).
Originally, LOCALE_INVARIANT had just one noble purpose -- to allow one to use CompareString (and LCMapString with the LCMAP_SORTKEY flag) in a way that would only use the "Default" Windows sorting table as mentioned a little bit here and especially here. The results, as that second article mentioned, would not vary when the user or system locale settings did; they would be invariant within that installation of Windows.
The data was added for this locale a month later, as I said, for obvious reasons -- if you have an LCID that one function considers to be valid, you must have a very good reason if another will not. And it cannot duplicate any other locale, either. Much weird data was added so that no one would be tempted to try to act like they spoke a language called "Invariant" and then all was good.
Note that these string comparisons still had much linguistic value -- half of the locales in Windows use that default table, so an invariant sort would not only avoid varying, it would also look right to a lot of the world.
The .NET framework had similar requirements (with the additional need for invariant parsing/formatting support) and thus CultureInfo.InvariantCulture was created. As with the locale, any string comparions made with InvariantCulture's CompareInfo object would have linguistic validity in a lot of places, and would not vary within that installation of the .NET Framework.
So everyone had what they needed, right?
Well, no.
A bunch of people wanted a method of doing a more binary type of comparison, instead of one that would be based on the "linguistically appropriate" approach gven a particular culture1.
The difference between what we had and what they wanted was akin to the difference between the C Runtime's strcoll/wcscoll versus strcmp/wcscmp (in the CRT documentation they refer to the difference as being locale based versus lexicographic).
The other advantage to such a "lexicographic" comparison is that it would be faster since a simple binary comparison of the code point values was being used.
To meet this need, the notion of an Ordinal sort was added and an Ordinal member was added to the CompareOptions enumeration. Selecting it would ignore all of those cultural collation features and give you a binary sort that would also, incidentally, not vary.
The only remaining problem at this point is that there were now two useful ways to do these different "niche" type of comparisons but neither name really jumps out at the developers who were looking for such solutions.
That problem remains to this day, though every single time I speak at a conference or answer a question in a newsgroup or get someone to look at posts like this one, then there is at least one less developer who has this problem. Maybe this time it is you? :-)
Now the story does not end here; many people have wanted to do things in a case-insensitive way. Of course if you wanted a case-insensitive invariant comparison then you could have done that all along -- just use the InvariantCulture's CompareInfo methods with the CompareOptions.IgnoreCase flag passed in. Easy!
But some people wanted a case-insensitive ordinal comparison?!?
Now the closet linguist in me shudders at this concept since a casing operation is essentially a linguistic one while an ordinal one is specifically not -- it's lexicographic.
So people are asking for a linguistic non-linguistic support, a request that for me brings to mind the comedian Steven Wright's dog2.
However, the technical half of me understands the need and so I got over my linguistic fetish as one of my colleagues on the BCL team worked in Whidbey to add a new OrdinalIgnoreCase member to the CompareOptions enumeration.
The behavior is basically to do the casing operation using the default casing tables prior to doing the binary comparison. This feature has been in the "Whidbey" version of the .NET Framework for some time (first checked into the source code tree on February 7, 2003), so you can try it out today if you have just about any build of Whidbey underfoot.
Hopefully this post will help clear up some of the confusion about these two interesting comparison types.
1 - What can I say? Some people are Некультурные (uncultured) though not in the culturally offensive sense.2 - Steven Wright claimed to have named his dog Stay so that he could call out "Come here, Stay! Come here, Stay!" and watch the dog walk toward him in a stuttery fashion.
This post brought to you by "Ω" (U+03a9, GREEK CAPITAL LETTER OMEGA)I talked to Omega just before this post went live. She said that as the last letter in the Greek alphabet (who was pretty much always therefore last in the queue), she understood the cost of keeping letters in order. Any performance benefit is good one, to her mind. Especially since a binary sort would let her come before her little sister (U+03c9, GREEK SMALL LETTER OMEGA) for once.
One thing people may notice right away when dealing with the command console is that the default ANSI code page (ACP) does not match the OEM code page (OEMCP) for most locales.
It all goes back to DOS (as many things do!)....
When there was DOS, the code page story was much more controlled by IBM than by Microsoft. Many "original" code pages came out of this time, from the interesting to the downright weird (yes, I am thinking of code page 437, G*!), though some of them may even predate IBM (not sure on this point).
Then, with Windows came the Windows code pages, engineered (if one can say that about data) to handle more languages at one time. Modeled after the ISO-8859 series but plugging in characters useful to more languages, for various markets (like basic support of French in the Arabic code page):
The idea was that DOS applications (not considered "legacy" then due to their prevalence) would have the same old code pages to fall back on, and Windows applications would have their own code pages to support more languages. They even added APIs to affect the base file system functions to work in one mode or the other, since the file system is the one thing both applications would have to access. And AreFileApisANSI, SetFileApisToOEM, and SetFileApisToANSI were born1.
So they are not the same for reasons of backwards compatibility.
Ah, you say -- but why are the ACP and the OEMCP the same for the CJK locales? We have:
And they act as both the ANSI code page and the OEM code page for locales in most East Asian locales.
There are many reasons for this. One of the obvious reason that these four code pages, being originally based on specific standards (governmental or industrial) had no additional backcompat issue. And no one wanted to gratuitously start making up code pages.
One architectural reason that affects NT is the rules about code pages in kernel mode. APIs like RtlUnicodeStringToOemString and RtlUnicodeStringToAnsiString have some implicit assumptions in the architecture that the size of the string will never change if you move between the ANSI and the OEM code pages. And for the non-CJK code pages this is no problem since either the character is on the code page and it is one byte or the character is not and you will get a question mark (which is also one byte). But the CJK code pages could be two bytes versus one in many cases. That would have been really bad.
(These function also assume that every Unicode character is two bytes -- which they are, for ANSI and OEM code pages. And for those who are wondering, the functions in ntdll.dll do not have the question about precomposed versus composite Unicode -- they only support the precomposed form. And Julie takes no responsibility for them, though she did fix bugs at one point in the early days!)
Anyway, the EA code pages are relieved of needing two different code pages. Which is just as well since they have other things to worry about, such as functions like IsDBCSLeadByte....
Aren't you glad you are using Unicode and do not need to worry about any of this? :-)
1 - Don't get me started on a naming convention that has one kind of acronym (API) not fully capitalized and another kind (ANSI, OEM) are. Geez....
This post brought to you by "ì" (U+00ec, a.k.a. LATIN SMALL LETTER I WITH GRAVE)
Early last year, Raymond Chen talked about how Char.IsDigit matches more than just 0 through 9 and later last year I talked about Crossing the DIGITal divide. But in both cases the conversation is limited to digits, and not the wide world of numbers which includes a lot more than just different ways of saying 0123456789.
The distinction between digits and numbers in Unicode is an important one, since the formatting and parsing of numeric values is highly dependent on whether a number acts like the ASCII digits 0 - 9 or not.
Now the bulk of the modern number systems use the same Arabic-Indic system conventions to which software developers are accustomed, but others do exist, some of which are still see use today.
As an example people can relate to, most of us are aware of the Roman numeral system where there is no Zero and you sometimes have to use a lot of addition in subtraction in a deterministic manner (such that any time a smaller number comes before a larger one, the smaller one is subtracted; otherwise if they are the same value or the larger one comes first, it is added). Thus Ⅰ is one, Ⅲ is three, Ⅳ is 4, Ⅴ is 5, and so on. Although it is not used too much, it is still commonly seen in the credits of movies and television shows for the copyright date (e.g. MCMLXXXIX for 1989). Many people who are not used to Roman numerals breathed a sigh of relief at the year 2000 since MM is so much easier to read....
It is of note that the Roman Numerals are encoded in Unicode even though they can all be represented as existing letters. The primary reason for this is that there are character properties associated with each encoded character, and these properties are used by many implementations of Unicode to get actual work done. Therefore, the letter V (U+0056, LATIN CAPITAL LETTER V) has a General Category of Lu (Letter, Uppercase) while Ⅴ(U+2164, ROMAN NUMBERAL FIVE) has a general category of Nl (Letter, Number).
And yes, even that claim falls apart a little since the hexidecimal digits ABCDEF are not separately encoded for reasons of backwards compatibility with decades of existing practice on computers which is not the case with Roman numerals. Even the argument for having encoded the Roman numerals is a little specious since for the most part they have not been encoded and when they are the style never seems to be consistent typographically. Though YMMV since you may have better fonts than I do! Try "ⅯⅭⅯⅬⅩⅩⅩⅨ" for the test....
All of this goes to show that Unicode is a very complex standard. In the end, Unicode can always do what it needs to do without fear of the occasional contradiction, since there will always be some precedent with which to be consistent. :-)
Ethiopic numbers are based on a different alternative system, one that can really wreak havoc with a formatting/parsing architecture like that in Windows or the .NET Framework if you try to bring Ethiopic data in without writing code do the work (just like with Roman numerals). I'll talk about Ethiopic numbers another time....
Yet another system, the one I will talk about here, is that of Tamil numerals. It is an additive and positional system (unlike Roman numerals, there is no subtraction involved) that has no zero but includes characters for 10, 100, and 1000.
In the traditional system the number 3,782 would be represented as ௩௲௭௱௮௰௨ (literally Three-Thousand(s)-Seven-Hundread(s)-Eight-Ten(s)-Two, or மூன்று-ஆயிரத்து-எழு-நூற்று-எண்-பத்து-இரண்டு in Tamil).
At least since the early 1800s, however, usage of the Tamil numerals as digits has been more and more common. Thus the number 3,782 would often be represented as ௩௭௮௨ (literally 3782).
The following table gives a bunch of different numbers and how they are represented in both the older, more traditional style and in the "modern" style where they act as digits. Note that the table is treating U+0eb6 as TAMIL DIGIT ZERO even though it is not being added to Unicode until version 4.1. Up until now the ASCII DIGIT ZERO was used as needed, as I do in the table below for display purposes, and if you want to represent these numbers before Unicode 4.1 is released you should likely use U+0030 (DIGIT ZERO). The modern Tamil column using the LOCALE_SGROUPING setting of Tamil....
1 - a.k.a. Lakh2 - a.k.a. 10 Lakhs3 - a.k.a. crore4 - a.k.a. 10 crore5 - a.k.a. 100 crore6 - a.k.a. thousand crore7 - a.k.a. 10 thousand crore8 - a.k.a. lakh crore9 - a.k.a. crore crore
Some examples of both types of usage:
Note that the traditional form is not currently handled by any code in either Windows or the .NET Framework, though it is sometimes seen in even modern contexts such as calendars. The system is not too complicated and figuring out the algorithm to parse or format with it seems like the sort of thing that would make an interesting Microsoft interview question. Though perhaps I will post some potential solutions another day....
Special thanks to Sivaraj Doddannan, Dr. N. Ganesan, and Working Group 02 of INFITT (of which they are both members) for helping to dig up the excellent resources for Tamil numbers. INFITT (International Forum for Information Technology in Tamil) is a liaison member of Unicode and has been instrumental in providing character addition and usage reports to help finish up the Tamil block in Unicode.
This post brought to you by "௧௨௩௪௫௬௭௮௯" (U+0be7 - U+0bef, a.k.a. TAMIL DIGIT ONE - TAMIL DIGIT NINE)and they all welcome their new compadre U+0be6, which is coming soon to a Unicode near you!
Paul Langton asked via the contact link:
Gday Michael, Firstly, love the blog and though a lot of it is waaaay over my head its always a great read, I'd go so far to say the best of all the MSDN blogs that I've sampled. OK suck up out of the way, I have a VB script that does the same function as ticking the two tickboxes in "Supplemental language support" - i.e. "Install files for complex script and right to left languages (including Thai)" and "Install Files for East Asian Languages": Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path '=======================================================================' Main'On Error Resume Next Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject") sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%") ' Install "Supplemental Language Support" oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1 oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1 My question is - what is the best way to detect if these are already enabled so when it is run it doesn't reinstall all this stuff? Rgds,Paul
Gday Michael,
Firstly, love the blog and though a lot of it is waaaay over my head its always a great read, I'd go so far to say the best of all the MSDN blogs that I've sampled.
OK suck up out of the way, I have a VB script that does the same function as ticking the two tickboxes in "Supplemental language support" - i.e. "Install files for complex script and right to left languages (including Thai)" and "Install Files for East Asian Languages":
Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path '=======================================================================' Main'On Error Resume Next Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject") sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%") ' Install "Supplemental Language Support" oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1 oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1
Dim oShell ' Windows Scripting Host shellDim oFSO ' File system objectDim sCurDir ' Script pathDim sWinDir ' Windows root path
'=======================================================================' Main'On Error Resume Next
Set oShell = CreateObject("WScript.Shell")Set oFSO = CreateObject("Scripting.FileSystemObject")
sCurDir = Left(WScript.ScriptFullName, InStrRev(WScript.ScriptFullName, "\") - 1)sWinDir = oShell.ExpandEnvironmentStrings ("%WinDir%")
' Install "Supplemental Language Support"
oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.COMPLEX.INSTALL", 1, 1
oShell.Run sWinDir & "\system32\rundll32.exe advpack.dll,LaunchINFSection " & sCurDir & "\intl.inf,LANGUAGE_COLLECTION.EXTENDED.INSTALL", 1, 1
My question is - what is the best way to detect if these are already enabled so when it is run it doesn't reinstall all this stuff?
Rgds,Paul
Ok, first of all -- Paul, don't ever do this!
The code that does the installation of these components does indeed perform these steps, but it also does more than that -- and if you do only these things then you will probably findout at some point that not everything works as you want it to.
This is obviously bad.
The way to get this done is to use the method given in KB article 289125. Just install one of the East Asian and one of the Complex script locales and it will perform ALL of the installation steps, rather than just the ones you want.
And now to the actual question -- how to know when to skip this due to the installation already happening?
Well, you can actually use the old IsValidLocale function on any of the locales that is within one of the language groups in question (e.g. 0x0411 which is Japanese or 0x041e which is Thau) using the LCID_INSTALLED flag, or the IsValidLanguageGroup function to check the actual language groups and see if they are installed.
Now all of this is useful to remember, especially in the context of What isn't in the default install for NLS and Language groups -- the vestigial tail of NLS, especially since they give some use for language groups. :-)
But please try to avoid depending on anything in the INFs, since that is undocumented, unsupported, subject to change, and is changing massively in Vista....
This post brought to you by "ૐ" (U+0ad0, a.k.a. GUJARATI OM)
In previous posts, I have talked about the unattend mode of Regional and Language Options.
And in the most recent of those, I promised to talk about the changes in Vista. So that is what is happening in this very post!
WARNING: The stuff covered here will not work in Windows prior to Vista....
Some of the exciting features that were added may be small items, but a few of them have been requested for a very very long time:
One of the big new changes, to go along with all of the changes to Vista's unattend during setup story in general, is a move to an XML file format rather than the old text file format used previously....
To run it, make sure you follow that advice I gave about control.exe, instead of rundll32.exe like that KB article suggests:
control.exe intl.cpl,,/f:"c:\Unattend.xml"
A sample of the XML file that should work in Beta 2 of Vista (and probably most other builds, this functionality has been in there for a while now!) can be seen below:
<gs:GlobalizationServices xmlns:gs="urn:longhornGlobalizationUnattend"> <!-- user list --> <gs:UserList> <gs:User UserID="Current" CopySettingsToDefaultUserAcct="true" CopySettingsToSystemAcct="true"/> </gs:UserList> <!-- GeoID --> <gs:LocationPreferences> <gs:GeoID Value="134"/> </gs:LocationPreferences> <!-- UI Language Prefernces --> <gs:MUILanguagePreferences> <gs:MUILanguage Value="cy-GB"/> <gs:MUIFallback Value="en-GB"/> </gs:MUILanguagePreferences> <!-- system locale --> <gs:SystemLocale Name="en-GB"/> <!-- input preferences --> <gs:InputPreferences> <gs:InputLanguageID Action="add" ID="0409:00000409"/> <gs:InputLanguageID Action="remove" ID="0409:00000409"/> <!--bs-Latn-BA--><gs:InputLanguageID Action="add" ID="141a:0000041a"/> <!--cy-GB--><gs:InputLanguageID Action="add" ID="0452:00000452"/> <!--cy-GB--><gs:InputLanguageID Action="add" ID="0809:00000809"/> </gs:InputPreferences> <!-- user locale --> <gs:UserLocale> <gs:Locale Name="en-US" SetAsCurrent="true" ResetAllSettings="false"> <gs:Win32> <gs:iCalendarType>1</gs:iCalendarType> <gs:sList>...</gs:sList> <gs:sDecimal>;;</gs:sDecimal> <gs:sThousand>::</gs:sThousand> <gs:sGrouping>1</gs:sGrouping> <gs:iDigits>2</gs:iDigits> <gs:iNegNumber>2</gs:iNegNumber> <gs:sNegativeSign>(</gs:sNegativeSign> <gs:sPositiveSign>=</gs:sPositiveSign> <gs:sCurrency>kr</gs:sCurrency> <gs:sMonDecimalSep>,,</gs:sMonDecimalSep> <gs:sMonThousandSep>...</gs:sMonThousandSep> <gs:sMonGrouping>3</gs:sMonGrouping> <gs:iCurrDigits>1</gs:iCurrDigits> <gs:iCurrency>3</gs:iCurrency> <gs:iNegCurr>3</gs:iNegCurr> <gs:iLZero>0</gs:iLZero> <gs:sTimeFormat>:HH:m:s tt:</gs:sTimeFormat> <gs:s1159>a.m.</gs:s1159> <gs:s2359>p.m.</gs:s2359> <gs:sShortDate>d/M/yy</gs:sShortDate> <gs:sLongDate>dddd, MMMM yyyy</gs:sLongDate> <gs:iFirstDayOfWeek>6</gs:iFirstDayOfWeek> <gs:iFirstWeekOfYear>2</gs:iFirstWeekOfYear> <gs:sNativeDigits>0246813579</gs:sNativeDigits> <gs:iDigitSubstitution>1</gs:iDigitSubstitution> <gs:iMeasure>0</gs:iMeasure> <gs:iTwoDigitYearMax>2021</gs:iTwoDigitYearMax> </gs:Win32> </gs:Locale> </gs:UserLocale> </gs:GlobalizationServices>
<gs:GlobalizationServices xmlns:gs="urn:longhornGlobalizationUnattend">
<!-- user list --> <gs:UserList> <gs:User UserID="Current" CopySettingsToDefaultUserAcct="true" CopySettingsToSystemAcct="true"/> </gs:UserList>
<!-- GeoID --> <gs:LocationPreferences> <gs:GeoID Value="134"/> </gs:LocationPreferences>
<!-- UI Language Prefernces --> <gs:MUILanguagePreferences> <gs:MUILanguage Value="cy-GB"/> <gs:MUIFallback Value="en-GB"/> </gs:MUILanguagePreferences>
<!-- system locale --> <gs:SystemLocale Name="en-GB"/>
<!-- input preferences --> <gs:InputPreferences> <gs:InputLanguageID Action="add" ID="0409:00000409"/> <gs:InputLanguageID Action="remove" ID="0409:00000409"/> <!--bs-Latn-BA--><gs:InputLanguageID Action="add" ID="141a:0000041a"/> <!--cy-GB--><gs:InputLanguageID Action="add" ID="0452:00000452"/> <!--cy-GB--><gs:InputLanguageID Action="add" ID="0809:00000809"/> </gs:InputPreferences>
<!-- user locale --> <gs:UserLocale> <gs:Locale Name="en-US" SetAsCurrent="true" ResetAllSettings="false"> <gs:Win32> <gs:iCalendarType>1</gs:iCalendarType> <gs:sList>...</gs:sList> <gs:sDecimal>;;</gs:sDecimal> <gs:sThousand>::</gs:sThousand> <gs:sGrouping>1</gs:sGrouping> <gs:iDigits>2</gs:iDigits> <gs:iNegNumber>2</gs:iNegNumber> <gs:sNegativeSign>(</gs:sNegativeSign> <gs:sPositiveSign>=</gs:sPositiveSign> <gs:sCurrency>kr</gs:sCurrency> <gs:sMonDecimalSep>,,</gs:sMonDecimalSep> <gs:sMonThousandSep>...</gs:sMonThousandSep> <gs:sMonGrouping>3</gs:sMonGrouping> <gs:iCurrDigits>1</gs:iCurrDigits> <gs:iCurrency>3</gs:iCurrency> <gs:iNegCurr>3</gs:iNegCurr> <gs:iLZero>0</gs:iLZero> <gs:sTimeFormat>:HH:m:s tt:</gs:sTimeFormat> <gs:s1159>a.m.</gs:s1159> <gs:s2359>p.m.</gs:s2359> <gs:sShortDate>d/M/yy</gs:sShortDate> <gs:sLongDate>dddd, MMMM yyyy</gs:sLongDate> <gs:iFirstDayOfWeek>6</gs:iFirstDayOfWeek> <gs:iFirstWeekOfYear>2</gs:iFirstWeekOfYear> <gs:sNativeDigits>0246813579</gs:sNativeDigits> <gs:iDigitSubstitution>1</gs:iDigitSubstitution> <gs:iMeasure>0</gs:iMeasure> <gs:iTwoDigitYearMax>2021</gs:iTwoDigitYearMax> </gs:Win32> </gs:Locale> </gs:UserLocale>
</gs:GlobalizationServices>
Note the use of comments in the XML. :-)
Now as was true in the old format, you only have to include the sections that you have changes in. However, at a minimum you must always have the following skeleton:
<gs:GlobalizationServices xmlns:gs="urn:longhornGlobalizationUnattend"> <!-- user list --> <gs:UserList> <gs:User UserID="Current"/> </gs:UserList> </gs:GlobalizationServices>
<!-- user list --> <gs:UserList> <gs:User UserID="Current"/> </gs:UserList>
WARNING: Since I am a person who did not look at the event log when I could not get a file to work, you can save yourself a lot of time if you don't forget to add that "gs:UserList" tag!
Anticipating another question -- at present only the "Current" user is supported.
But don't worry. Think of this as another one of those instances where potential features in future versions are poking their heads out for people to gawk at! :-)
Now of course if anything changes between now and Vista shipping I will post again with whatever the updates are....
Anticpating another response -- the current plan of record is that the old text files will have to be converted to the new format, though of course that is a bigger issue than just Regional and Language Options -- it affects the unattend files for all of Vista. But I'll keep you updated on this issue if I hear anything else.
Now I am sure there are things I am forgetting, but I will make sure to add those in the future, too. :-)
This post brought to you by "ᠽ" (U+183d, a.k.a. MONGOLIAN LETTER ZA)
They are never going to learn this one.
Marlins suspend batboy for milk-drinking dare
I'll ignore the suspension issue and talk about the "milk bet" here.
Now this particular bet has been around for a long time. I first heard about it when I was working for the Access team, probably around eight years ago.
Heath, a fellow developer on the team with an office right next door to mine, was certain that he could drink a gallon of milk in an hour without throwing up. He went to CalTech and because of this had a very logical way of thinking this through. He could easily drink one of those one pint milk containers in just a few minutes. So the gallon could be polished off easily since it really is just eight of those one pint containers.
(for those outside the US, there are four quarts to a gallon and two pints to a quart!)
And he had a whole hour to do it, so he could take his time and make it with no problem, right?
Well, actually, it is wrong.
Fellow developer Nicholas Shulman volunteered the explanation for why the bet is never won as Heath was running to the restroom to avoid throwing up in the conference room to which we had all adjourned.
"A stomach," Nick explained with the just the right inflection for irony, "is about a half a gallon."
Milk needs time in the stomach to be broken down before it can go on -- it does a body good, but it needs a little time to do that good. And there is simply no way to break the milk down fast enough to take in a full gallon in an hour. If you try to do so, your body will rebel and if you try and force the issue, your stomach will settle the argument for you.
Perhaps some future CalTech or MIT student who has read this blog will either refuse the bet, or anticipatorily buy something that will break down the milk and drink a bunch of that right before the bet starts.
(via Spencer)
Last night, I received the following question in e-mail:
Dear Michael. I wonder if you can clarify this matter.I was under the impression that Tamil Unicode was possible only under XP professional, - since regional lang settings are available in the Control Panel.Yesterday I bought a new laptop that came with XP home edition. The regional language settings was not available initially. I was disappointed and thought of upgrading to XP pro. But I went to the Help Button and saw a link that installed the regional lang setting. I was able to set up Tamil Unicode settings and I am now happily using Tamil Unicode in the preinstalled Microsoft Works..The query is how come there is this impression that only Windows 2000 and XP pro support (Tamil)Unicode?? Thank you for your time KalaimaniSingapore
Dear Michael.
I wonder if you can clarify this matter.I was under the impression that Tamil Unicode was possible only under XP professional, - since regional lang settings are available in the Control Panel.Yesterday I bought a new laptop that came with XP home edition. The regional language settings was not available initially. I was disappointed and thought of upgrading to XP pro. But I went to the Help Button and saw a link that installed the regional lang setting. I was able to set up Tamil Unicode settings and I am now happily using Tamil Unicode in the preinstalled Microsoft Works..The query is how come there is this impression that only Windows 2000 and XP pro support (Tamil)Unicode??
Thank you for your time
KalaimaniSingapore
I reassured him of one important fact here -- that every version of Windows 2000, Windows XP Home, Windows XP Professional, and Windows Server 2003 contains all of the international support, no matter what localized version the SKU is.
So if you have XP Home then you have the support for Tamil, Georgian, Punjabi, Russian, Greek, Traditional Chinese, Bulgarian, Afrikaans, Catalan, Korean, Basque, Vietnamese, Spanish, Thai, French, Hindi, Japanese, Belorussian, Icelandic, Farsi, Galician, Danish, Ukrainian, Romanian, Simplified Chinese, Swedish, Konkani, Italian, German, Hungarian, Armenian, Konkani, Lithuanian, Divehi, Swahili, Czech, Dutch, Hebrew, Estonian, Gujarati, and all of the rest of them.
You may have to install the proper international support to get the language (and some on this list are only available in XP and later), but they are all there waiting for you to use them. Today!
This post brought to you by "ஶ" (U+0bb6, TAMIL LETTER SHA)Recently (as of Unicode 4.1) added to Unicode based on a proposal by INFITT (International Forum for Information Technology in Tamil)
Apologies for the title, I still cannot resist that sort of thing. Maybe one day....If you have not read it yet, look at Language-specific processing #0 for more info about this series!
IFilter is one interface that you can use to lower the barriers between the engines that do the work of indexing and the data that may be sitting in proprietary formats. The documentation probably explains it better than I could here:
The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. It also extracts chunks of values, which are properties of an entire document or of well-defined parts of a document. IFilter provides the foundation for building higher-level applications such as document indexers and application-independent viewers.
Immediately several of what seems much like the shipping implementations of this feature like this will come to mind: Full Text Search in SQL Server, SharePoint, Exchange, and Index Server for starters. And then there are those like MSN Desktop Search, as well. All of the times that search suppots additional file formats. Imagine being able to get in on the fun to make sure your own format is supported for some type of indexing/searching?
This is a COM interface so to implement it you have to implement AddRef/Release/QueryInterface as always. The additional methods you have to implement:
The general topic about the IFilter interface has pointers to summaries, samples, instructions on building, applying and testing filters, as well as methods to bind to already existing IFilter implementations.
It is also nice to see such a great effort on the security side -- links and information to help guarantee that ISVs who write code against this interface do it securely. Throughout there are good warnings:
Caution IFilters for Indexing Service run in the Local System security context. They should be written to manage buffers and to stack correctly. All string copies must have explicit checks to guard against buffer overruns. You should always verify the allocated size of the buffer. You should always test the size of the data against the size of the buffer.
That and a link to secure code practices to consider when implementing these interfaces are a welcome touch as far as I am concerned (as it does no good for Microsoft to write secure code if an ISV writes a component with a security issue!).
Now note that this interface, this IFilter, is not really about language-specific processing as much as it is about format-specific processing. But one of the greatest strengths of a service like MS Search is the ability to apply it to different file formats. It makes IFilter a very important interface to stretch the boundaries of what can be searched.
And it gives the future topics, that deal with those more linguistic aspects of language-specific processing a much wider reach than they would otherwise have. So I will give IFilter an honorary "cool" status that I would usually reserve for things more linguisticalish :-)
This post was sponsored by "F" (U+0046, a.k.a. LATIN CAPITAL LETTER F)A letter that realized it would never get to sponsor any of the fun "F" words while I am working for Microsoft, so it thought it should take "Filter" while it was available.
This post is about a not entirely intuitive fact that will be seen in the implementation of collation in Microsoft products. It affects the results of both CompareString and LCMapString in Windows, the results of using the CompareInfo and Sortkey classes in the .NET Framework, and in the results in products like Jet and SQL Server.
To help show what is happening under the covers, I will use the sort keys.
We'll use the letters A (U+0041, LATIN CAPITAL LETTER A) and Ą (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK), as well as their lowercase counterparts.
When getting sort keys using the default table (LOCALE_INVARIANT), the weights look like the following:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 02 01 1B 01 02 01 01 00Ą U+0104 0E 02 01 1B 01 12 01 01 00
Note the Unicode weights (in blue), the diacritic weights (in green) and the case weights (in red). Now when we ignore case:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 02 01 1B 01 02 01 01 00Ą U+0104 0E 02 01 1B 01 02 01 01 00
And when we ignore diacritics:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 02 01 01 02 01 01 00Ą U+0104 0E 02 01 01 12 01 01 00
And then we ignore both:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 02 01 01 02 01 01 00Ą U+0104 0E 02 01 01 02 01 01 00
Clearly, in the default table LATIN CAPITAL LETTER A WITH OGONEK is little more than a LATIN CAPITAL LETTER A with a hook in it's foot. A small diacritic weight is added to show that it is still primarily a LATIN CAPITAL LETTER A. And the act of ignoring the diacritic gives identical results to when the diacritic was never there in the first place -- you can see it right in the weights.
Now, how about when we move to Polish, LCID 0x00000415? In Polish, LATIN CAPITAL LETTER A WITH OGONEK is a letter with a unique Unicode weight, and this causes a difference in the results:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 12 01 01 00ą U+0105 0E 04 01 01 02 01 01 00Ą U+0104 0E 04 01 01 12 01 01 00
Do you see what happened here? Since in Polish LATIN CAPITAL LETTER A WITH OGONEK has a unique Unicode weight, ignoring the case weight has a predictable effect:
a U+0061 0E 02 01 01 02 01 01 00A U+0041 0E 02 01 01 02 01 01 00ą U+0105 0E 04 01 01 02 01 01 00Ą U+0104 0E 04 01 01 02 01 01 00
And Ignoring the diacritic weight will have no effect whatsoever (since there is no diacritic weight to ignore):
So the net effect is that for Polish, passing a NORM_IGNORENONSPACE flag in Windows, a CompareOptions.IgnoreNonspace in the .NET Framework, or a collation in SQL Server such as Polish_CI_AI (Polish, case insensitive, accent insensitive) will never see LATIN CAPITAL LETTER A WITH OGONEK as a LATIN CAPITAL LETTER A. Because Polish does not give the letter diacritic weight.
This is a common issue, whether you look at å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE) in Swedish, Č (U+010c, a.k.a. LATIN CAPITAL LETTER C WITH CARON) in Slovenian, or any of the other hundreds of examples that exist in supported collations. The key is that in each case you must consider not only whether the character appears to have a diacritic on them but how the language is looking at the string....
This post brought to you by "Ą" (U+0104, a.k.a. LATIN CAPITAL LETTER A WITH OGONEK)
(apologies to George Orwell, of course!)
Val asks:
Michael,I've been reading your "Striping Diacritics" post, and it's been a great help. I've also been comparing it with another version I've seen. This other version is similar to yours, except that it breaks up the original string into characters, then normalizes the characters individually.I threw a few languages at the two functions. I found yours handled Vietnamese while the other one did not.However I have a problem where I don't know the code page of the string before I feed it to the function. It may be a far-east language, it may be an european language, or it may be a far-east + european language. Further more it may be a katakana or hiragana character set (i think I got those right)Your function corrupted japanese, chinese, and especially korean. The other function did not. However I'm hesitant to use the other function because of the problems it has with vietnamese (which is a latin character set, so it worries me that it's hiding other issues with other languages)Is there a way to modify your function to be friendlier to languages that don't have diacritics? Right now, I've modified it that if it finds a non-latin character (based on unicode value ranges), it aborts the whole processing and returns the original string, but this obviously can't handle strings that have multiple character sets.Fyi: the problems I've seen with yours is that if I run JP through it, it strips out things that LOOK like accents but are not. For example, on of the JP characters looks like a reverse V with a small circle on top right. After your function, the small circl on top right is missing. With korean, it seemed like it inserted blanks between each character after we run it through Normalize() function.
It's funny, I never know when I get questions like this one if people fully understand what they are asking for (no offense to Val!).
And I am not helped in this case by the fact that Val is the name of both
The latter being a stripper is interesting given that this new Val is asking me about stripping, albeit an entirely different kind of stripping!). The fact that Valerie found things about programming to be interesting only further clouds the situation.
I have literally no sense of the person asking the question and am somewhat certain that neither of my previous Val experiences will properly guide me in this respect. :-)
I'll start with the Japanese case, which really is not a bug, or a shortcoming in the code.
After all, to me the conversion from å to a is a normal "diacritic stripping" operation, while a Swedish user would probably throw something heavy at my head were it not for the fact that all the Swedes I have met are so bleeding polite.
I don't find that Swede's reaction to be any more or less reasonable that the Japanese user's suggestion (mirroring Val's) that the change from ピ to ヒ is "corrupting Japanese text."
In both cases, you can see why it happens:
U+00e5 (å) decomposes to U+0061 U+030a, and U+030a is Mn (Mark, Nonspacing), which the code is designed to strip.
U+30d4 (ピ) decomposes to U+30d2 U+309a, and U+309a is Mn (Mark, Nonspacing), which the code is designed to strip.
So in essence what Val is asking for here is a way to say get at the sentiment expressed in the title: that all Mn characters are non-spacing, but some are more non-spacing than others. Which really does not exist in any kind of automated sense.
Now the Korean problem is a bit easier to describe and deal with. After all U+d3fc (폼) decomposes to U+1111 U+1169 U+11b7, and although none of them are non-spacing, if you don't have a font that can handle the conjoining Jamo (e.g. Gulim Old Hangul, the font I have on this machine), it will look like three separate Jamo rather than one single modern Hangul syllable.
Of course, with just a minor change to the code I first presented in Stripping diacritics.... to take the decomposed text and recompose it by converting it to Unicode Normalization Form C. The new code would look something like this:
namespace Remove {using System;using System.Text;using System.Globalization; class Remove { [STAThread] static void Main(string[] args) { foreach(string st in args) { Console.WriteLine(RemoveDiacritics(st)); } } static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for(int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if(uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return(sb.ToString().Normalize(NormalizationForm.FormC)); } }}
I don't know what in particular in the Chinese text was going wrong, and without an example it is hard to say; perhaps the new code will resolve the problem.
But the first case (the one with Japanese) is obviously the more interesting one in terms of functionality -- the request being for a way to strip diacritics that are considered to be meaningless does demand that a bit of rigor be applied to the meaning of the term meaningless, which could be linguistically derived from the meaning of the term corrupted (by assuming that anything that is called corrupted would have been meaningful had corruption not taken place!).
This post brought to you by ピ (U+30d4, a.k.a. KATAKANA LETTER PI)
Take a look at the following code, let me know what you think of it (compiled with Whidbey Beta 2, note the preview of the exciting new StringInfo methods for dealing with text elements!):
namespace àáâãäå {using System;using System.Text;using System.Globalization; class àáâãäå { [STAThread] static void Main(string[] args) { àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); } static void àáâãäå(string àáâãäå) { StringBuilder àáâãäå = new StringBuilder(); StringInfo àáâãäå = new StringInfo(àáâãäå); àáâãäå.Append(àáâãäå.Normalize(NormalizationForm.FormC)); àáâãäå.Append(": "); for(int àáâãäå=0; àáâãäå < àáâãäå.LengthInTextElements; àáâãäå++) { string àáâãäå = àáâãäå.SubstringByTextElements(àáâãäå, 1); if(àáâãäå.IsNormalized(NormalizationForm.FormC)) { àáâãäå.Append("C"); } else if(àáâãäå.IsNormalized(NormalizationForm.FormD)) { àáâãäå.Append("D"); } else { àáâãäå.Append("_"); } } Console.WriteLine(àáâãäå.ToString()); return; } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } }}
namespace àáâãäå {using System;using System.Text;using System.Globalization;
class àáâãäå { [STAThread] static void Main(string[] args) { àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); } static void àáâãäå(string àáâãäå) { StringBuilder àáâãäå = new StringBuilder(); StringInfo àáâãäå = new StringInfo(àáâãäå);
àáâãäå.Append(àáâãäå.Normalize(NormalizationForm.FormC)); àáâãäå.Append(": ");
for(int àáâãäå=0; àáâãäå < àáâãäå.LengthInTextElements; àáâãäå++) { string àáâãäå = àáâãäå.SubstringByTextElements(àáâãäå, 1); if(àáâãäå.IsNormalized(NormalizationForm.FormC)) { àáâãäå.Append("C"); } else if(àáâãäå.IsNormalized(NormalizationForm.FormD)) { àáâãäå.Append("D"); } else { àáâãäå.Append("_"); } } Console.WriteLine(àáâãäå.ToString()); return; } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå.àáâãäå("àáâãäå"); } }}
It compiles, even though it looks like the namespace, the class name, every procedure (other than main) and every variable looks like the same string.
The wonders of various combinations of the string "àáâãäå".
(Interestingly, due to all the exciting work of someone from the VS team, the cursor moves over the text elements as letters and thus it was an interesting challenge getting this written!)
Do you think it makes the code less readable? :-)
Compile it on the command line:
c:\temp>csc àáâãäå.cs
and then run it to see the output:
c:\temp>àáâãäåàáâaäå: CCCCCCàáâaäå: DDDDDDàáâaäå: DCCCCCàáâaäå: DDCCCCàáâaäå: DDDCCCàáâaäå: DDDDCCàáâaäå: DDDDDC
Interesting, no? :-)
This post brought to you by "à", "á", "â", "ã", "ä", and "å" (U+00e0, U+00e1, U+00e2, U+00e3, U+00e4, and U+00e5, a.k.a. LATIN SMALL LETTER A WITH GRAVE, LATIN SMALL LETTER A WITH ACUTE, LATIN SMALL LETTER A WITH CIRCUMFLEX, LATIN SMALL LETTER A WITH TILDE, LATIN SMALL LETTER A WITH DIAERESIS, and LATIN SMALL LETTER A WITH RING ABOVE)Well who did you think would be willing to sponsor this rubbish? :-)
I had this conversation a little over two years ago in the Netherlands on the end of the last day at a conference. It may not be word for word, though I actually think it comes pretty close (its not like I had a tape recorder). The cookies were Pepperidge Farm Mint Milanos, but I do not like mint (I love the non-mint varieties, I am not sure how I ended up with the ones I did - it might have been a mistake to mention I did not like them).
Oh, also the name of woman I talked to is not really Andrea; I just like the name and do not mind the nod to Jubal Harshaw....
Me: Andrea, would you like a cookie?
Andrea: Actually, I would like to know what the "Korean Unicode sort" is.
Me: I'd actually rather give you one of these cookies. They are really good. Plus its less embarrassing than the answer to your question.
Andrea: I know you hate mint, you said so yesterday at the luncheon. C'mon Michael!
(Short pause)
Andrea: Or is it Mike? Or maybe michka like your mails?
Me: Michael's best.
Andrea: Ok, no Russian bears. So tell me, why is the Korean Unicode sort embarrassing? I could not find it defined anywhere, except maybe I found a vague hint to the 'Unicode collation' setting that was used in SQL Server 7.0, which could be Korean. Is that it?
Me: No, that's not what it is. Though SQL Server does have a "Korean Unicode collation" of its own that matches the one that used to be on Windows.
Andrea: Grrr. You are infuriating, Michael. What is the Korean Unicode sort? The one that is in SQL Server, the one that used to be in Windows, the one that is still in the header files. What is it?
Me: Well, its almost the same sort as the one we use for English.
Andrea: Almost? How close is almost? Sounds like almost hitting a home run, but what kind? Was it an almost home run that was a strike out, or an almost home run that was a triple?
Me: Ouch! Well, if you put it that way, I guess you could say it's a strike out.
(I have an embarrassed smile at this point)
Me: We move one character.
Andrea: One character?
Me: One character.
Andrea: What character is it? Something insulting to a government? Did Microsoft upset the Korean premier or something?
Me: No, nothing like that. Its U+005c, the "REVERSE SOLIDUS". Also known as the backslash. Not insulting at all.
Andrea: One of us has to be missing something, Michael. Maybe you had better give me a cookie.
(She eats a cookie, and tries to hand the package back. I shake my head)
Andrea: So please, explain to me why the backslash has to be moved for Korea.
Me: Well, because for Korean, it is also the Won sign (₩).
Andrea: You said in your talk today that there is room for over a million characters in Unicode. There is no room for a dedicated Won?
Me: Oh, there is a dedicated Won Sign at U+20a9. Its just that in most Korean fonts a character that looks like a Won is put in the slot for U+005c, and since the characters look the same we try to make sure that they are treated as if they were the same.
Andrea: Ok, I see that. But why is it called the Korean Unicode sort. If its legacy then that would make it the Korean ANSI sort, right?
Me: Well, ANSI does not have Korean in it, and there is no Won.
Andrea: You know what I mean, Michael. Are you this exasperating when you talk with your girlfriend?
Me: Oh, I... I'm between girlfriends at the moment.
Andrea: I WONder why....
Me: Hey now!
(Andrea is wearing quite an impish grin at this point)
Andrea: Just kidding. But I was up too late last night and you already gave me the cookies. So I have no real need to flirt when I am teasing at this point.
Me: Hmmmm, no one ever used to have a need. Anyway, I know what you mean. It probably would have made more sense to tie it to the Korean standard, except thats encoding and not sorting. And they basically do put the won at 0x5c in their encoding standard, so MS is just trying to be consistent. It would have been really weird trying to tie to KSC-5601.
Andrea: I can definitely see that. So, what about the rest of the Hangul and Hanja and Jamo and whatnot that is used in by Koreans?
Me: Well, now you understand why it was probably removed from Windows -- because it does not really do much for Korean.
Andrea: But its still in SQL Server. They didn't get the memo?
Me: I know you think that I am a bigwig at Microsoft, but I'm not. I was offered a job there but I haven't even started yet. And I am definitely not "in the know" about what they do in SQL Server.
Andrea: No need to be shirty, dear. I understand. I apologize for thinking you were important.
(I grimace at this point)
Andrea: Ok, and I apologize for teasing you now. But back to the Korean thing.... do you have a guess?
Me: Oh, definitely. I just don't know if I am right.
Andrea: So what is the theory?
Me: My guess is that since there is a serious worry about backward compatibiliy and sort orders in SQL Server, and they can't really get rid of something as easily, even if it is useless. I guess they could have hacked it since its only different by one character, but they are a team that is astoundingly against hacks. Thats something I can respect.
Andrea: So can I. Probably worth a KB article, at least.
Me: Maybe. If PSS gets customers wondering where good old 0x00010412 went, I'll suggest it.
(She eats another cookie)
Andrea: Ok. I'm sorry to monopolize your time like this.
Me: No worries, the group is gone, the conference is mostly over. Hell, I'd probably be flying out tonight if there were a flight. You can come out with us tonight if you want. Well, that is if we are going anywhere.
Andrea: Actually, you can come out with us. My friends are more socially adept than yours.
Me: Probably true. And more than me, too.
Andrea: One more question and we can head back to what's left of the group.
Me: Ok. What's the question?
Andrea: Whats up with the Japanese (Unicode) sort?
Needless to say, the conversation devolved at that point. But Andrea did finish the cookies. I did go out with four of Andrea's friends that night and drank more than I should have. The flight home was harder with a hangover, and to be perfectly honest it was not until I sat down to try and remember the whole conversation earlier tonight that I remembered I was supposed to follow up with PSS.
Maybe the blog entry is good enough at this point? :-)
The scene is familiar -- you are typing along and suddenly you are not seeing the letters you typed. And suddenly you imagine you are channeling your inner Homer Simpson as you say
D'oh, stupid keyboard!
But the computer has not been possessed. It may be the situation the inestimable Kate Gregory describes in Language bar have a mind of its own? :
For several months now, I've been plagued by unexpected language changes while I'm typing. I'll type one character, maybe a quote or a question mark, and I'll get a really strange character instead, say a capital E with an accent on it. I came to realize that it was the language settings, and I keep the language bar on my toolbar so I can flip back to English whenever this strange thing happens. But I didn't know why it was happening, and I found stopping what I was doing to mouse over to the bar and click back to the language I wanted very frustrating. Well, now I know what was going on! ALT-SHIFT rotates through the languages. I'm a huge ALT-TAB user, and I ALT-SHIFT-TAB when I need to cycle backwards through that list. I also use a fair amount of other ALT-things, like ALT-A to bring up the favourites menu in IE, then arrow keys to choose an item. I really prefer the keyboard to the mouse. Well I guess every once in a while an ALT-SHIFT gets through to the language bar and flips my language. So now when I go to type a URL and see ццц I can quickly make it right. Лфеу (er, Kate)
For several months now, I've been plagued by unexpected language changes while I'm typing. I'll type one character, maybe a quote or a question mark, and I'll get a really strange character instead, say a capital E with an accent on it. I came to realize that it was the language settings, and I keep the language bar on my toolbar so I can flip back to English whenever this strange thing happens. But I didn't know why it was happening, and I found stopping what I was doing to mouse over to the bar and click back to the language I wanted very frustrating.
Well, now I know what was going on! ALT-SHIFT rotates through the languages. I'm a huge ALT-TAB user, and I ALT-SHIFT-TAB when I need to cycle backwards through that list. I also use a fair amount of other ALT-things, like ALT-A to bring up the favourites menu in IE, then arrow keys to choose an item. I really prefer the keyboard to the mouse. Well I guess every once in a while an ALT-SHIFT gets through to the language bar and flips my language. So now when I go to type a URL and see ццц I can quickly make it right.
Лфеу (er, Kate)
That's actually one very common issue. And there is no shame in at all, though I admit to curiousity in wondering if Kate really has a cyrillic keyboard in her list? :-)
Happens to me all the time!
I was once trying to repro a bug that occurred only if you have more than 50 keyboard selected, and then one never knew what one was getting when one ALT-SHIFT'ed, or how to get back where one was before. I finally found a brilliant workaround though -- I added the US English keyboard 51 times, under 51 different languages. That way no matter what I accidentally switched to the letters would look the same. At the point things only sucked in one application -- can you guess which one?
It was Word. Can you guess how? Well, Word which chooses the language to tag the text with based on the input language, causing what I thought was a brilliant workaround to be one of the most non-intuitive blocker to proper spell checking since the time my dictionary fell off my balcony back when I lived on the third floor and viciously attacked a house plant. It did manage to prove that the book is mightier than the plant, though. Much to the chagrin of my former downstairs neighbor, who was quite happy when I finally moved several buildings away.
Where was I?
Oh yes, when keyboards seem to be misbehaving. Maybe this one seems familiar to you?
I have a couple computers at the office that run Word 2000 and lately whenever I try to put an accent mark on a vowel it just inserts two accent marks instead. (Quite annoying!) I've tried reinstalling the keyboard language and reinstalling and updating Office but to no avail.
I have a couple computers at the office that run Word 2000 and lately whenever I try to put an accent mark on a vowel it just inserts two accent marks instead. (Quite annoying!)
I've tried reinstalling the keyboard language and reinstalling and updating Office but to no avail.
Unfortunately, this one turned out to be the Bugbear.B worm. Luckily Symantec has a removal tool. But it is best to not ignore this sort of problem when it happens (the person who reported this particular problem admitted it had been going on for months).
One more -- similar to the last one but with a happier ending:
This has been bugging me for months. I am not sure when it started, but any time I try to put an apostrophe into a document, nothing happens. Then if I hit the key again I get two of them. I have to hit the backspace key to get what I wanted. So it takes three keystrokes to get me what should have taken one. Is this some sort of virus? Help!
This has been bugging me for months. I am not sure when it started, but any time I try to put an apostrophe into a document, nothing happens. Then if I hit the key again I get two of them.
I have to hit the backspace key to get what I wanted. So it takes three keystrokes to get me what should have taken one. Is this some sort of virus? Help!
Ah, no virus this time. However, it turns out that this person had installed the "United States - International" keyboard layout. This layout has the apostrophe as a dead key for an acute accent. And as I have said before, dead keys are not intuitive. In his case either the apostrophe and a space or uninstalling the layout were both okay options. He chose the latter since he did not need the international layout....
This post brought to you by "Я" (U+042f, a.k.a. CYRILLIC CAPITAL LETTER YA)