Welcome to MSDN Blogs Sign in | Join | Help

SetLocaleInfo() is horrid, don't use it!

I just ran into a bug with SetLocaleInfo() use, and it pretty much reminded me that SetLocaleInfo() stinks.  Michael said it years ago and it's still true.

The only thing it's useful for is a "Regional and Language Options" type app, and there's already one of those in the control panel :)   Some apps try to call SetLocaleInfo() to set some behavior, and then call the date/time APIs or whatever.  One app changed the currency symbol while it ran reports.  Problem is that SetLocaleInfo() changes the user preferences for every app in the user account.  Now the poor user's going to be running some report or something and they're records will be formatted wrong. 

Some apps try to "fix" the setting, and change it only momentarily, then change it back.  Now the other app will have a couple odd results in the middle of mostly-correct results, and it'll be nearly impossible for them to figure out why it doesn't work.  SetLocaleInfo() also makes things slow because apps have to reload their settings all the time.  The app that's using SetLocaleInfo() needs to seriously wonder what happens if another app did the same thing and called SetLocaleInfo() right after you did.  Obviously that wouldn't be what you'd expect, so please don't do it to everyone else.

Anyway, don't use SetLocaleInfo().  If there's some limitation you're trying to work around by calling SetLocaleInfo(), let me know.

-Shawn

IDNA2008 / IDNAbis on Windows 7, Vista, Net, etc.

Some people have asked what they should do to support IDNA2008 on Microsoft platforms. 

We provide IdnToAscii() and related functions in the Windows SDK.  That's available natively on Vista+, and through idndl.dll on earlier platforms.  Idndl is shipped with IE 7, or through "Microsoft Internationalized Domain Names (IDN) Mitigation APIs" at the Microsoft Download Center.

For .Net, the IdnMapping class provides IDNA2003 conversion since .Net V2.

Obviously these APIs currently only support IDNA2003, however the interfaces won't change for IDNA2008 support.  I don't know what mechanism will be used to update for IDNA2008 support, but applications should continue to use the exposed APIs for consistent support across the platform.  That way if the system gets updated to IDNA2008, the application will be able to take advantage of the updated support.

Visual Studio 2010 and .Net v4 Beta released

If you go to http://msdn.microsoft.com/en-us/vstudio/dd582936.aspx there's now a Visual Studio 2010 and .Net v4 beta page.  The betas are a good place to learn about new releases and provide feedback.

-Shawn

Posted by shawnste | 0 Comments

Missing International Setting Registry Key?

Some Zune users ran into a strange problem http://forums.zune.net/2/3/518796/ShowPost.aspx and it seems like the Nation Value in HKEY_CURRENT_USER\Control Panel\International\Geo isn't there on some Windows XP machines.  That seems really wierd to me because if the International key is missing I'd expect odd behavior from the machine.  We get the user locale from there, so that's not particularly good.  For Zune it seems like the missing key causes a log-on problem with a "Zune can't sign in: this functionality is not available in your Region" because GetUserGeoId() ended up failing.

Anyway, odd user problems because of a missing HKEY_CURRENT_USER\Control Panel\International key can be fixed by adding HKEY_CURRENT_USER\Control Panel\International\Geo through regedit.  (Or if it's a permissions problem, delete the key and readd it).  Of course editing the registry can mess up your machine and all that.  If you run the Regional and Language Options Control Panel you can reset your settings to the correct values and apply (I usually change to something else and them back to make sure it sees a change).

How did the key get missing anyway?  That's a really good question, if you know please tell me!  In at least one case it seems the OEM image may not have had the key, which is kind of bizarre.  Maybe the out-of-box-experience got skipped or something.  (The one that asks you where you live).  Maybe a registry "cleaner" mangled the key or maybe something else did.  We've seen cases where an enterprise installation messes with these keys, but usually those are pretty rare.  Generally it is not a good idea to change these keys yourself, intl.cpl provides a way to change it, and it supports xml config files too: http://blogs.msdn.com/shawnste/archive/2007/04/12/configuring-international-settings-from-the-command-line.aspx  In the future we'd like to remove our dependency on some of the stuff in this key, so don't be suprised if it changes in future versions.

- Shawn

A Cool Hard Drive Fix

Well, I should say all the caveats like "don't try this at home," and "it'll destroy your data," and "this is a stupid thing to do," and "don't try this with data you really need, use a professional data recovery service," but...

At home I have a Windows Home Server, which is really cool way to store data, I used to use a server with RAID-5, but so far this has been a lot easier.  Anyway, I digress.  Last week I added a 2TB drive.  (Well, 1.75 TB or so, how come HDD manufacturers can't count in binary like everyone else?)  We have lots of video @ home (yea, I know I need to prune it, but haven't had the time).

I didn't really have enough space to duplicate everything, so I turned off duplication on a folder I figured didn't have very important stuff on it.  After all, it's a new drive, and the price of drives drop fast, so I'll buy another drive in 6 months and relieve the pressure.  You can see where this is headed.

This morning I was more than a tad surprised to discover that the new drive had failed.  For various reasons 1TB of data had gotten onto it, and an unknown amount hadn't been duplicated.  I was a bit bummed and also discovered something about myself:  "unimportant" data becomes more important when there's a terabyte of it missing!

I could reimport the original tapes (Hi-8, Mini-DV, etc), so I wasn't too worried, but that'd still be a LOT of work and time, so I was hoping there was an easier fix.

In that vain I moved the changed the power cord, then switched the SATA port of the drive. I put it in an external case, I put it in a different external case, I changed the power supply, I put it in a different computer.  Nothing.  It started to spin up and then click... click... click.  I don't know much about HDD internals, but the click sounded like heads trying to move, but failing.  (Not crashed, that's a much worse sound!).  Finally I gave up.

Since I gave up, I got a new drive (so WHS could duplicate the data again), and wrote off the data.  Since I hadn't done this with WHS before, I poked around the web to get more info and stumbled into a bunch of "repair" info.  (There's a youtube clip of someone stopping a similar clicking noise.  They carefully arrange the drive on a workspace, with a nice mat to protect the parts, then they arrange the screwdrivers and tools and demonstrate the click.  Then they "solve" the clicking with a sledgehammer.  Admittedly the clicking noise did stop, however I think they should've specified their requirements a bit clearer!)

In my stumbling, I came across anecdotal reports of cooling a HDD to "fix" it.  Various reasons were given (and another youtube video of a HDD in a block of ice, which I doubt was very successful.) 

The "Fix" 

Figuring I had nothing to lose, I stuck the drive in the freezer (in it's USB housing for good measure).  An hour later I pulled it out, ran downstairs & plugged it into the Windows Home Server.

The power turned on, the drive spun up, then a "click".... then more clicks, the normal soft clattering of heads moving normally!  I ran upstairs to the desktop and told WHS to find the drive... and it did!  Freezing the drive did something to make it work.

There are numerous suggestions on how to cool off the drive.  Some people suggested using a baggie to keep condensation from it (not sure how that works since there'd still be air... and moisture... in the baggie).  I didn't bother, and the thin layer of frost was a bit freaky.  Other people ran cables out the freezer door so the drive could stay cold.  Since the server's downstairs that's pretty inconvenient, so again I didn't bother.  I figured I'd worry about that if it heated up and died, but it's been copying files for several hours now (it takes a long time to copy a terabyte), so I guess it was the initial "spin up the drive" that was causing my problems.

So why's it work?  I have no clue.  Some speculation on the web was weak connections that somehow got happier.  I know electricity's happier conducting at lower temperatures, so maybe there's a thin wire or connection somewhere that got just enough more electrons through it.  Or maybe a mechanical problem was overcome by shrinking the parts (most things shrink when they get cold).  Or maybe even an electrical connection was mechanically fixed by shrinking.  Who knows.  Again, I don't recommend anyone do this to their drive, but I was pretty happy it worked as a last resort.

It's a bit disconcerting that the drive failed after only a week or two.  I'm guessing the server spun down the drive sometime and then it failed to spin up again.  I assume I was pretty unlucky and there was a weak part somewhere that made it past the initial test.

Windows Home Server Recovery 

Since the drive's "back", it's a little easier, but it would've been about the same anyway.  I put in the new drive, then I told WHS to remove the dead drive (which wasn't dead then).  Then WHS copies all the stuff off of the removed drive to other drives.  If I hadn't frozen it, same thing, 'cept the lost data would've been missing.  So any duplicates or originals on that drive would've been recreated on the new drive or another drive.  Obviously anything I didn't enable duplication for that lived on the dead drive would've been lost, I'm lucky in that respect.

I wasn't very familier with how recovery would work on WHS.  Previously I had a RAID-5 system, and have off-site backups of some files, but hadn't done recovery with WHS.  There's a bunch of discussion on the web of the pros & cons of WHS's technique, but it seems to work.  It sounds like it'd be more interesting if the main drive had failed because it'd have to recreate the "tombstones," but it'd still find the other drives and recreate the file structure when the WHS was reinstalled (I don't work in WHS, so forgive me if I got something wrong here).  As a contrast, another box died last summer and I still haven't gotten it's RAID-5 array back up.  (Fortunately I was suspicious of it and most everything was copied to the WHS).

-Shawn

[Update: This is a bit disconcerting.  I took the defective drive back to the store, and they happily took it back.  I said it was defective, but they tested it and apparently it froze really well because it still worked.  They didn't care, they said they'd take it back anyway.  I clearly stated that it had died, dead, like bad heads, and that I'd "fixed" it by sticking it in the freezer.  They still didn't care and marked it to go back on the shelf!.  Though this drive was new, I will now never buy a HDD with a "returned, full warrenty, 5% discount" sticker on it!  It's one thing if I have to go back to the store 'cause my DVD player died, completely different to have an iffy HDD.]

Posted by shawnste | 3 Comments
Filed under:

Japanese Calendars, How do I Test Support for Additional Eras?

The Japanese Calendar is labeled by the reign of the current emporer.  Windows has supported 4 Japanese calendar Eras, however in the future there may be more eras.  Realizing this, we've added support in Windows 7, Server 2008R2 & .Net v4 for additional Japanese Eras.

There're a few things applications should know about extended Japanese Era support:

  • The first is that applications may see more than 4 eras!. 
  • The current era could end
  • Future dates could change, moving from the current era to a new era.  (So "Heisei 71" could be "something 12" or something like that)
  • The era numbers themselves could change, so the 1-4 era numbers shouldn't be relied upon.

Cool, so if we're being careful about all of those things, and are pretty sure we don't have any hard-coded dependencies on the Japanese Eras or when they are, how do we test it?

On Windows 7 / Server 2008 R2 (beta) and CLRv4 (beta) Japanese era information can be found in the registry:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Calendars\Japanese\Eras]
"1868 01 01"="明治_明_Meiji_M"
"1912 07 30"="大正_大_Taisho_T"
"1926 12 25"="昭和_昭_Showa_S"
"1989 01 08"="平成_平_Heisei_H"

Additional eras can be added just by adding an entry to the table.  The format for the value "YYYY MM DD"="JE_AJE_EE_AEE" where the first day of the era is represented in the value name by YYYY for the year (gregorian), MM for the month, and DD for the day.  The value data contains the era strings, the "full" era name, followed by the abbreviated name (in Japanese), then the full name followed by the abbreviated name in English.  Each value is seperated by _ (underscore).  So in "YYYY MM DD"="JE_AJE_EE_AEE", JE represents the Japanese Era name, AJE represents the abbreviated Japanese Era.  EE represents the English Era name, and AEE represents the abbreviated English Era name.

An example would be:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\Calendars\Japanese\Eras]
"1868 01 01"="明治_明_Meiji_M"
"1912 07 30"="大正_大_Taisho_T"
"1926 12 25"="昭和_昭_Showa_S"
"1989 01 08"="平成_平_Heisei_H"
"2020 09 01"="仮名_仮_Test Era_X"

(I'm not trying to make any predictions with the 2020, it's just a test number :)  Also remember this only works in Windows 7 and other newer products.

 

[Edited:  This is intended to support the possibility of a future era, not to enable older eras.  There's the potential that lots of stuff won't work correctly if eras prior to the existing 4 are added].  Also note that this registry key is pretty flexible.  If you want you could add a bunch of data for a bunch of eras < 1868.  (I don't have the data.)  That might break more stuff because then "Era 1" would turn into "Era 8" or something.  (Which is why I said not to depend on the Era number, it's pretty meaningless and arbitrary for us).  As noted, we only really intended to support adding new eras, not the historical ones.  Prior to 1870 or so the calendar was lunar not gregorian based, so conversions prior to then are pretty meaningless anyway.

 

Hope this is interesting,

Shawn

 

 

Posted by shawnste | 4 Comments
Filed under:

Oversimplification of EAI/IMA (International eMail Addresses)

A couple months ago I blogged about EAI Email Address Internationalization/Internationalized Email Addresses (EAI/IMA) and felt like blogging again.

China's been very interested in non-ASCII email addresses for some time, and is working hard to adopt the EAI standard.  I've heard a target of November 2009 for that standard.  http://www.china.org.cn/china/sci_tech/2008-09/27/content_16544162.htm briefly addresses EAI.

Oversimplification of EAI

The basic concept of EAI is "just" to use UTF-8 for email.  Most software can comply just by allowing Unicode in their email addresses.  Using UTF-8 is reasonably straight forward, and most of the details are just around compatibility with existing mail standards.  The IETF working group has a page at http://www.ietf.org/dyn/wg/charter/eai-charter.html.

Local Part of the Email Address

The local part of an email address is the user account part.  Often times servers allow it to be case-insensitive, however it can also be case-sensitive.  Similarly EAI allows the servers to define any mappings of the local part that are appropriate for that organization.  Some may choose to do case mapping similar to existing case-insensitive servers.  A different mapping, like Turkish behavior for i and I is possible.  Another option would be to perform normalization like NFC or NFKC on the name.  Width mapping and aliases are possible.  Just like now, clients would just use the names given and let the recipient's mail server figure it out.

Domain Part

EAI allows Unicode (UTF-8) for the entire address, so special mapping isn't necessary.  Of course if the domain doesn't have a valid registration, eg: isn't valid IDN, then it won't work, but that's not really an email protocol issue.  EAI uses UTF-8 instead of "punycode" for domain names.  Punycode only happens when "downgrading."

Negotiation

Mostly, "just using UTF-8" is pretty simple, but for backward compatibility, EAI aware servers and clients will need to negotiate their protocols.  For SMTP, the UTF8SMTP does this.  EAI aware servers can exchange the UTF8SMTP extension and agree to communicate in UTF-8.  If the server doesn't provide that flag, then the client's have to use a different mechanism.  The other protocols have similar handshaking.

Downgrading

All email clients and servers aren't going to instantly become Unicode aware, so there is a downgrading concept for compatibility.  Downgrade is the area with the most churn in the experimental standards, but the basic concept remains the same.

If you have an EAI aware server and you try to talk to an unaware system, you'll need to fallback to the legacy protocols and encoding mechanisms.  Effectively this means that EAI accounts will need an ASCII alias so that if an EAI mail fails, it can be resent using the ASCII alias and MIME encodings.

To a legacy recipient, such a mail would appear as any other legacy email, and replies would go to the sender's ASCII alias.  The receiving server would need to recognize that the ASCII and Unicode EAI aliases were for the same account and route the mail appropriately.

There was some discussion of providing additional data that allows reconstructing a downgraded mail, but most of those techniques seem to break at least some legacy clients and have additional problems.  My feeling is also that if a client knows how to reconstruct a downgraded mail, it also knows EAI anyway, so likely the mail would never be downgraded, so the additional complexity is unnecessary.  I think it's likely that the initial standards will only specify minimal downgrading and not the ability to reconstruct a downgraded message.

Status

Of course the IETF RFCs are still experimental and China hasn't published their standards yet, but my oversimplification probably won't change much in the final version.

What is Title Case?

Disclaimer: I'm not an English teacher (that's my mom), so I'm sure my description of title casing in English probably has exceptions/variations.

Title casing has an interesting history in computer programming.  Programmers like to use CamelCase to make variable names more readable, and, particularly amongst developers native to some languages, there's an idea that title casing is interesting, such as in String.ToTitleCase(), and in Windows 7, LCMapString(LCMAP_TITLECASE).  Most title casing algorythms are linguistically bad, even in English.  For other languages it's worse.

ToTitleCase() takes a very simple approach to title casing.  Maybe in the future it'll be smarter, but for now it just uppercases the first letter in a group of letters, and tries to pay attention to non-letters and word breaks.  It also tries to keep acronyms all upper-case.

Even in English this is a simplistic approach.  The title of this post is "What is Title Case?"  Is is supposed to be lower case, but ToTitleCase() would mess it up.  Additionally unexpected word breaks or punctuation could trick the algorithm.  Even the acronym test isn't complete since it just expects all-upper case and sometimes acronyms keep the lower case of the full title.  Also it messess up names like DiSilva or McConnell.  Contractions can also be messed up.

Outside of English, ToTitleCase() rapidly gets silly.  In English we capitalize everything except articles, short prepositions and some other short words.  In German it's just like a normal sentence, with only nouns getting capitalized, so the English slightly over-eager capitilization behavior becomes very over-eager.  Other languages also can have letters before the main word, eg: l'État, so the ToTitleCase rules can mess up those words as well.

And then there're scripts/languages that don't even have an upper/lower case distinction, so ToTitleCase gets pointless.

Anyway, use care when using ToTitleCase().  It might work in some cases, but don't expect it to work linguistically, particularly globally, particularly in non-English cases.  Also maybe we'll get smarter and figure out a more correct way to do it in the future.

 -Shawn 

Alternate encoding names recognized by .Net / IE

If you run the sample from http://msdn.microsoft.com/en-us/library/system.text.encoding.getencodings.aspx then you can get a list of what Microsoft .Net thinks each Encoding/Code Page's name is.  (WebName is more consistent to what's used in charset). eg:

using System;
using System.Text;

public class SamplesEncoding
{
   public static void Main()
   {
      // For every encoding, get the property values.
      foreach( EncodingInfo ei in Encoding.GetEncodings() )
      {
         Encoding e = ei.GetEncoding();
         Console.Write( "{0,-6} {1,-25} ", ei.CodePage, ei.Name );
      }
   }
}

There are several other names that are recognized by Encoding.GetEncoding() however, similar to what IE would recognize in a charset tag.  I'm not sure if there's a way to get at the full list of aliases programatically, but this is what you'd get for these input strings: 

Label Code Page
"437" 437
"ANSI_X3.4-1968" 20127
"ANSI_X3.4-1986" 20127
"arabic" 28596
"ascii" 20127
"ASMO-708" 708
"Big5" 950
"Big5-HKSCS" 950
"CCSID00858" 858
"CCSID00924" 20924
"CCSID01140" 1140
"CCSID01141" 1141
"CCSID01142" 1142
"CCSID01143" 1143
"CCSID01144" 1144
"CCSID01145" 1145
"CCSID01146" 1146
"CCSID01147" 1147
"CCSID01148" 1148
"CCSID01149" 1149
"chinese" 936
"cn-big5" 950
"CN-GB" 936
"CP00858" 858
"CP00924" 20924
"CP01140" 1140
"CP01141" 1141
"CP01142" 1142
"CP01143" 1143
"CP01144" 1144
"CP01145" 1145
"CP01146" 1146
"CP01147" 1147
"CP01148" 1148
"CP01149" 1149
"cp037" 37
"cp1025" 21025
"CP1026" 1026
"cp1256" 1256
"CP273" 20273
"CP278" 20278
"CP280" 20280
"CP284" 20284
"CP285" 20285
"cp290" 20290
"cp297" 20297
"cp367" 20127
"cp420" 20420
"cp423" 20423
"cp424" 20424
"cp437" 437
"CP500" 500
"cp50227" 50227
"cp819" 28591
"cp850" 850
"cp852" 852
"cp855" 855
"cp857" 857
"cp858" 858
"cp860" 860
"cp861" 861
"cp862" 862
"cp863" 863
"cp864" 864
"cp865" 865
"cp866" 866
"cp869" 869
"CP870" 870
"CP871" 20871
"cp875" 875
"cp880" 20880
"CP905" 20905
"csASCII" 20127
"csbig5" 950
"csEUCKR" 51949
"csEUCPkdFmtJapanese" 51932
"csGB2312" 936
"csGB231280" 936
"csIBM037" 37
"csIBM1026" 1026
"csIBM273" 20273
"csIBM277" 20277
"csIBM278" 20278
"csIBM280" 20280
"csIBM284" 20284
"csIBM285" 20285
"csIBM290" 20290
"csIBM297" 20297
"csIBM420" 20420
"csIBM423" 20423
"csIBM424" 20424
"csIBM500" 500
"csIBM870" 870
"csIBM871" 20871
"csIBM880" 20880
"csIBM905" 20905
"csIBMThai" 20838
"csISO2022JP" 50221
"csISO2022KR" 50225
"csISO58GB231280" 936
"csISOLatin1" 28591
"csISOLatin2" 28592
"csISOLatin3" 28593
"csISOLatin4" 28594
"csISOLatin5" 28599
"csISOLatin9" 28605
"csISOLatinArabic" 28596
"csISOLatinCyrillic" 28595
"csISOLatinGreek" 28597
"csISOLatinHebrew" 28598
"csKOI8R" 20866
"csKSC56011987" 949
"csPC8CodePage437" 437
"csShiftJIS" 932
"csUnicode11UTF7" 65000
"csWindows31J" 932
"cyrillic" 28595
"DIN_66003" 20106
"DOS-720" 720
"DOS-862" 862
"DOS-874" 874
"ebcdic-cp-ar1" 20420
"ebcdic-cp-be" 500
"ebcdic-cp-ca" 37
"ebcdic-cp-ch" 500
"EBCDIC-CP-DK" 20277
"ebcdic-cp-es" 20284
"ebcdic-cp-fi" 20278
"ebcdic-cp-fr" 20297
"ebcdic-cp-gb" 20285
"ebcdic-cp-gr" 20423
"ebcdic-cp-he" 20424
"ebcdic-cp-is" 20871
"ebcdic-cp-it" 20280
"ebcdic-cp-nl" 37
"EBCDIC-CP-NO" 20277
"ebcdic-cp-roece" 870
"ebcdic-cp-se" 20278
"ebcdic-cp-tr" 20905
"ebcdic-cp-us" 37
"ebcdic-cp-wt" 37
"ebcdic-cp-yu" 870
"EBCDIC-Cyrillic" 20880
"ebcdic-de-273+euro" 1141
"ebcdic-dk-277+euro" 1142
"ebcdic-es-284+euro" 1145
"ebcdic-fi-278+euro" 1143
"ebcdic-fr-297+euro" 1147
"ebcdic-gb-285+euro" 1146
"ebcdic-international-500+euro" 1148
"ebcdic-is-871+euro" 1149
"ebcdic-it-280+euro" 1144
"EBCDIC-JP-kana" 20290
"ebcdic-Latin9--euro" 20924
"ebcdic-no-277+euro" 1142
"ebcdic-se-278+euro" 1143
"ebcdic-us-37+euro" 1140
"ECMA-114" 28596
"ECMA-118" 28597
"ELOT_928" 28597
"euc-cn" 51936
"euc-jp" 51932
"euc-kr" 51949
"Extended_UNIX_Code_Packed_Format_for_Japanese" 51932
"GB18030" 54936
"GB2312" 936
"GB2312-80" 936
"GB231280" 936
"GBK" 936
"GB_2312-80" 936
"German" 20106
"greek" 28597
"greek8" 28597
"hebrew" 28598
"hz-gb-2312" 52936
"IBM-Thai" 20838
"IBM00858" 858
"IBM00924" 20924
"IBM01047" 1047
"IBM01140" 1140
"IBM01141" 1141
"IBM01142" 1142
"IBM01143" 1143
"IBM01144" 1144
"IBM01145" 1145
"IBM01146" 1146
"IBM01147" 1147
"IBM01148" 1148
"IBM01149" 1149
"IBM037" 37
"IBM1026" 1026
"IBM273" 20273
"IBM277" 20277
"IBM278" 20278
"IBM280" 20280
"IBM284" 20284
"IBM285" 20285
"IBM290" 20290
"IBM297" 20297
"IBM367" 20127
"IBM420" 20420
"IBM423" 20423
"IBM424" 20424
"IBM437" 437
"IBM500" 500
"ibm737" 737
"ibm775" 775
"ibm819" 28591
"IBM850" 850
"IBM852" 852
"IBM855" 855
"IBM857" 857
"IBM860" 860
"IBM861" 861
"IBM862" 862
"IBM863" 863
"IBM864" 864
"IBM865" 865
"IBM866" 866
"IBM869" 869
"IBM870" 870
"IBM871" 20871
"IBM880" 20880
"IBM905" 20905
"irv" 20105
"ISO-10646-UCS-2" 1200
"iso-2022-jp" 50220
"iso-2022-jpeuc" 51932
"iso-2022-kr" 50225
"iso-2022-kr-7" 50225
"iso-2022-kr-7bit" 50225
"iso-2022-kr-8" 51949
"iso-2022-kr-8bit" 51949
"iso-8859-1" 28591
"iso-8859-11" 874
"iso-8859-13" 28603
"iso-8859-15" 28605
"iso-8859-2" 28592
"iso-8859-3" 28593
"iso-8859-4" 28594
"iso-8859-5" 28595
"iso-8859-6" 28596
"iso-8859-7" 28597
"iso-8859-8" 28598
"ISO-8859-8 Visual" 28598
"iso-8859-8-i" 38598
"iso-8859-9" 28599
"iso-ir-100" 28591
"iso-ir-101" 28592
"iso-ir-109" 28593
"iso-ir-110" 28594
"iso-ir-126" 28597
"iso-ir-127" 28596
"iso-ir-138" 28598
"iso-ir-144" 28595
"iso-ir-148" 28599
"iso-ir-149" 949
"iso-ir-58" 936
"iso-ir-6" 20127
"ISO646-US" 20127
"iso8859-1" 28591
"iso8859-2" 28592
"ISO_646.irv:1991" 20127
"iso_8859-1" 28591
"ISO_8859-15" 28605
"iso_8859-1:1987" 28591
"iso_8859-2" 28592
"iso_8859-2:1987" 28592
"ISO_8859-3" 28593
"ISO_8859-3:1988" 28593
"ISO_8859-4" 28594
"ISO_8859-4:1988" 28594
"ISO_8859-5" 28595
"ISO_8859-5:1988" 28595
"ISO_8859-6" 28596
"ISO_8859-6:1987" 28596
"ISO_8859-7" 28597
"ISO_8859-7:1987" 28597
"ISO_8859-8" 28598
"ISO_8859-8:1988" 28598
"ISO_8859-9" 28599
"ISO_8859-9:1989" 28599
"Johab" 1361
"koi" 20866
"koi8" 20866
"koi8-r" 20866
"koi8-ru" 21866
"koi8-u" 21866
"koi8r" 20866
"korean" 949
"ks-c-5601" 949
"ks-c5601" 949
"KSC5601" 949
"KSC_5601" 949
"ks_c_5601" 949
"ks_c_5601-1987" 949
"ks_c_5601-1989" 949
"ks_c_5601_1987" 949
"l1" 28591
"l2" 28592
"l3" 28593
"l4" 28594
"l5" 28599
"l9" 28605
"latin1" 28591
"latin2" 28592
"latin3" 28593
"latin4" 28594
"latin5" 28599
"latin9" 28605
"logical" 28598
"macintosh" 10000
"ms_Kanji" 932
"Norwegian" 20108
"NS_4551-1" 20108
"PC-Multilingual-850+euro" 858
"SEN_850200_B" 20107
"shift-jis" 932
"shift_jis" 932
"sjis" 932
"Swedish" 20107
"TIS-620" 874
"ucs-2" 1200
"unicode" 1200
"unicode-1-1-utf-7" 65000
"unicode-1-1-utf-8" 65001
"unicode-2-0-utf-7" 65000
"unicode-2-0-utf-8" 65001
"unicodeFFFE" 1201
"us" 20127
"us-ascii" 20127
"utf-16" 1200
"UTF-16BE" 1201
"UTF-16LE" 1200
"utf-32" 12000
"UTF-32BE" 12001
"UTF-32LE" 12000
"utf-7" 65000
"utf-8" 65001
"visual" 28598
"windows-1250" 1250
"windows-1251" 1251
"windows-1252" 1252
"windows-1253" 1253
"Windows-1254" 1254
"windows-1255" 1255
"windows-1256" 1256
"windows-1257" 1257
"windows-1258" 1258
"windows-874" 874
"x-ansi" 1252
"x-Chinese-CNS" 20000
"x-Chinese-Eten" 20002
"x-cp1250" 1250
"x-cp1251" 1251
"x-cp20001" 20001
"x-cp20003" 20003
"x-cp20004" 20004
"x-cp20005" 20005
"x-cp20261" 20261
"x-cp20269" 20269
"x-cp20936" 20936
"x-cp20949" 20949
"x-cp50227" 50227
"X-EBCDIC-KoreanExtended" 20833
"x-euc" 51932
"x-euc-cn" 51936
"x-euc-jp" 51932
"x-Europa" 29001
"x-IA5" 20105
"x-IA5-German" 20106
"x-IA5-Norwegian" 20108
"x-IA5-Swedish" 20107
"x-iscii-as" 57006
"x-iscii-be" 57003
"x-iscii-de" 57002
"x-iscii-gu" 57010
"x-iscii-ka" 57008
"x-iscii-ma" 57009
"x-iscii-or" 57007
"x-iscii-pa" 57011
"x-iscii-ta" 57004
"x-iscii-te" 57005
"x-mac-arabic" 10004
"x-mac-ce" 10029
"x-mac-chinesesimp" 10008
"x-mac-chinesetrad" 10002
"x-mac-croatian" 10082
"x-mac-cyrillic" 10007
"x-mac-greek" 10006
"x-mac-hebrew" 10005
"x-mac-icelandic" 10079
"x-mac-japanese" 10001
"x-mac-korean" 10003
"x-mac-romanian" 10010
"x-mac-thai" 10021
"x-mac-turkish" 10081
"x-mac-ukrainian" 10017
"x-ms-cp932" 932
"x-sjis" 932
"x-unicode-1-1-utf-7" 65000
"x-unicode-1-1-utf-8" 65001
"x-unicode-2-0-utf-7" 65000
"x-unicode-2-0-utf-8" 65001
"x-x-big5" 950

There's one really egregious name here.  UnicodeFFFE is actually Big Endian UTF-16.  It's like the byte order mark (BOM) for UTF-16BE written in little endian order.  Try to use UTF-16BE instead :)

Note that historically lots of data on the web has been mis-tagged, or isn't tagged at all.  For data from windows machines that data is often in the windows system code page, such as windows-1252.  So sometimes browsers may attempt to use the current system code page, or try to guess (with varying degrees of success), the actual code page.  Additionally there are differences between different vendor's code page behavior causing further ambiguity.

See also:

 

Unicode, IDN (IDNA), EAI (IMA) and Homograph Security

I wrote about IDN & Security before http://blogs.msdn.com/shawnste/archive/2005/03/03/384692.aspx but thought I'd share some of my more updated views about security of URLs/IDN/Unicode/Email addresses.

People haven't really bothered much with DNS or character based security when it was limited to ASCII.  I'm not sure if this because people just didn't think about it, or if they thought there wasn't a problem or whatever.  What security attacks happen have been regarded more as "oh, that's curious" rather than a real concern.  Basically there seems to be a presumption that a script, like the ASCII subset of Latin, are inherintly secure.  Therefore it would seem reasonable that if ASCII Latin can be secure, then other scripts, or mixed script environments have homographs, then those scenarios must be insecure and are therefore broken.

Latin and ASCII aren't Secure

The problem with that logic is that it's flawed.  Homographs exist in Latin/ASCII, however http://rnicrosoft.com tends to be regarded as "quaint and amusing" rather than a security problem.  (There used to be a web page there, dunno what happened).  Similarly g00gle or MlCROSOFT or whatnot can all happen in ASCII.  Some things can be done to ASCII to limit the risk, such as choosing fonts or making things lowercase, but that's not always possible. 

Strings are Typed and Read by Humans

Even if the scripts themselves are perfect, the strings we use with the scripts are not.  For example, users have to type them in, and they may or may not use upper or lower case (in cased scripts).  I heard one computer expert indicate that users should just figure out how to enter URLs in lower case, in Unicode Normalization Form C.  (Instead of addressing the problem we should educate all the users).  I wish he were joking.

Depending on the context, there are things you can do to ASCII only strings that can confuse users.  For example http://microsoft.secure.com isn't going to necessarily go to a Microsoft site.  http://secure.com/microsoft.com is a similar trick.

DNS isn't the only subject of these problems.  I get mail all the time in the form company@mail-servicing.com where "company" is a legitimate company and "mail-servicing" is the people they've contracted to send their bulk mail.  So it's impossible for me to determine if that's actually a good address for the company.  Even worse is when the mail contains a link.  "Provide feedback about your recent warrenty support to http://feedback-surveys.com/OEMsupport"

Strings aren't Even Strings

Sometimes what we click on isn't even related to where we end up going.  We've all seen phishing attacks that are look like mybank.com but go to an IP address that no one can tell if it's real or not.

Strings aren't Always Specific

In some environments strings often aren't even very specific.  I'm pretty certain that if I want a live.com account that I won't get shawn or shawns or even shawnsteele.  Instead I'll be shawn7935 or something.  There's another Shawn here at work that gets some of my mail from simple typos, let alone malicious intent.  There's a pretty good chance that Fred8374 could pass himself off as Fred8347 if he really wanted to.  

We've even been trained that strings don't even have to be close.  If I buy something on eBay from "JoesBestStuff", it takes some faith for me to pay SallySewing7@live.com (apologies if those are real accounts).  I've been quite amused at the varation betwee "seller's name" and the email sometimes.

Even when we expect them to be the same, there are many spellings for some words.  "Mohammed" is often transliterated differently to Latin.  Unless you deal with one quite often, you're likely to assume most spellings are the same.

Globalization of Strings

Now we've figured out that strings aren't secure, and we'll get tricked even if they were secure.  How does that change in a global environment, such as with IDNA or EAI/IMA strings?  Not much.

Sticking to Latin, you suddenly gain a bunch of look-alikes (homographs) by allowing non-ASCII values.  Strings like mícrosoft, mïcrosoft and mıcrosoft are all “close enough” to be convused, particularly at a quick glance, even more so if the user is conditioned to expect the "real" string.  E.g:  "Important security update for windows, go download it from Mícrosoft.com"  We're already expecting to see microsoft, so the few different pixels are easily missed.

For other scripts the problem can be much more severe.  Complex scripts can have simliar appearing strings, and many include numerous characters.  Chinese for example has enough characters available that it can be fairly easy in some cases to find a rare character that is similar in appearance to a common character which people have been preconditioned to expect.

"I Solved Homographs"

This leads to a typical problem for developers, particularly "Western" Latin-script based developers.  We tend to expect that if we solve script mixing so that we can't mix up Cyrillic and Latin, that we've solved the homograph problem.  Instead, we've barely scratched the surface and effectively buried our heads in the sand.

In some cases the "solution" can be worse than the problem.  For example, some browsers decide that I don't understand Cyrillic since my user locale is en-US (or Klingon), and then prints out punycode.  That's mildly useful to me as a warning, however it does the same thing for Chinese.  It's very unlikely that I'm going to confuse Chinese with Latin, but I'll get Punycode in the address bar anyawy.  Now I have no chance of finding out what the actual URL is supposed to look like.  Punycode is all gibberish, but I could probably decipher a Chinese glyph enough to see if it looked similar to what I expected.  With any punicode strings, I don't even need homographs to confuse me, any Chinese would look the same.  For that matter I could be expecting Chinese, but it could actually be Japanese or Korean, or Cyrillic for that matter.  I'm not trying to say that the browsers' approach is "wrong", just that while this approach may address some problems, it can also cause new ones.

Most of the "solutions" to Homographs that I've seen are similar in my opinion.  They may address a specific issue, but don't solve the entire problem globally.  I also think some approaches are unnecessarily limiting.  Mitigations that reduce the surface area for an attack are useful, however developers should recognize the limitations of those approaches and make sure they aren't spending tons of effort "shutting the window, but leaving the front door wide open."  That only provides a false sense of security, which can be far worse than the original problem.

Comprehensive Solutions

So instead of thinking that strings like URLs are inherintly secure somehow if they're ASCII, and focusing on the differences from ASCII, like Cyrillic homographs, we should rather assume that ANY URL might not take us to a place we want to go.  Even an ASCII one.

A much better solution to URL security is one that addresses the entire system rather than focusing on Homographs.  IE, for example, detects malicious web sites (I don't know exactly how it works, but I gather there's blacklisting and bad behavior detection, kinda like virus checking for web sites).  This is far more effective than preventing mixed scripts, and has the advantage of working with ASCII only URLs.  It also does a good job against homographs, pretty much making the punicode-in-the-address-bar irrelevent.  It also works with many forms of attack, even non-obvious ones. 

My opinion is that if you do a "good job" of detecting any phishing/spoofing type web site, even ASCII-only, then the need for Homograph detection is much reduced.  And if you can't do that, then the attackers will merely add an extra label or something to get around your homograph detection.

Mitigation by Protocol

For things like IDN, it is interesting to consider how the protocol itself approaches security.  Some things are "obvious" as not being interesting for a name.  Compatibility characters, control characters, etc. could somewhat readily be excluded.  Some things are generally considered technically "obvious" to some users, but may frustrate others.  It is generally considered that lower casing the DNS name causes less confusing (can't mix up lower case l with capital I), but I doubt that AAA.com prefers lower casing.  Similarly IDNA2003 allows unicode "symbols," which are widely regarded as being useless, particularly since they're hard to type, but I suspect that someone would like I♥NY.  So there's a gray area that gets a bit confusing.

Consideration for other protocols is similar.  EAI (email) is interesting because it basically defers "correctness" to the registrar (whoever runs the mail server).  IDN provides some restriction by protocol and more at the registrar level.

One problem with restricting valid characters at the protocol level is that it works OK in a small set, but once you get to a global audiance the rules get very complicated.  Domain names allowed (most) English names when they were restricted to ASCII, but German and French had difficulties.  With IDN additional languages are supported, but perhaps the needs of an English registrar and a German one differ.  A complete set of rules applicable world-wide for all strings in all languages may not be possible (eg: turkish i), but even if they were, they would be very complex and difficult to implement for every application adopting a protocol.

Mitigation by Registrar

Restriction at the registrar can be more effective, though perhaps less consistent.  A registrar could be like a domain name registrar, but for these purposes you could also think of the person that assigns user accounts at a business, or email address registration from your ISP.

Registrars can restrict languages to those used in the country they support.  They can bundle or block homographs or alternate spellings (like Traditional and Simplified Chinese spellings of the same word.)  In a business they could have certain rules.  First name, last initial, or first initial, last name is common for user accounts in many companies, at least until they get too many employees).

IDN has some restrictions by protocol, but allows much tighter restriction at the registrar level.  Ironically, a label at a lower level could then have different "rules" than at the higher level.  EAI allows the local part to be determined entirely by the provider/registrar rather than the protocol.

Rules at the "registrar" level can still be very complex for a complete set of rules, however cases with conceptual differences can still be adopted as applicable for the registrar's environment, whereas a protocol level rule has to either be too flexible, or disallow one registrar's legitimate scenario.  Rules at the registrar level can also be adjusted more readily than at the protocol level.

Mitigation by Application

An application can also decide to be more comprehensive than the protocol.  An application may also have more information, such as blacklists or user settings.  They can make choices for some users like "they only read English, so don't bother with Cyrillic then," and a different choice for a different user.  Applications can also potentially be grayer in their behavior.  Instead of "allowing" and "disallowing" strings, they can say "gee, I'm not so sure, you really want to do this?", or flag it and continue.  They can also be dynamic, such as when you add a sender to a junk mail filter.

IDN vs EAI/IMA vs Unicode

Pretty much this entire "strings aren't secure" concept applies to any Unicode (or for that matter any other code page) string.  That could be an IDN domain name, an EAI mail address, a user account name, etc.  Some environments may be more ameniable to certain solutions than others, but the types of attacks that impact a Unicode IDN label could also succeed with the local (user name) part of a Unicode EAI email address.  The general concepts are portable.

I used IDN heavily as an example, but the same things happen to EAI addresses, user account names, logon credentials, etc.  Anything that uses Unicode, or strings, needs to realize that strings can't be expected to be inherintly "secure."

There's more info on some thinking about Unicode Security in Unicode TR#39 http://www.unicode.org/draft/reports/tr39/tr39.html.  TR39 addresses the appropriate use of Unicode characters and homographs, but this is at best a mitigation of the more general security concerns of identifier strings.  Phishing and spoofing would still happen even in plain ASCII.

Hope this was helpful, or at least interesting,

Shawn

 

A helpful reader pointed out I don't really know Klingon.

PS. I just checked out your blog (very nice by the way, lots of stuff I need to read) and I noticed along the top of the page you have  (jItlhInganbe') for "I'm not a Klingon". The translation your looking for would be   (tlhIngan jIHbe' - I am not a Klingon).

Thanks, fixed it :)  As I said, I'm not a Klingon, and had a terrible time finding something to work for "am" in The Klingon Dictionary or one of the on-line lessons I found.

Although that's kind of like one of those odd things that always strikes me as funny when traveling. 

Me: "Can you tell me where a good restaurant is?"
Native: "Sorry, I don't speak English". (Sometimes even without an accent)

Hmm.  I'll continue to have to rely on others for Klingon translations :)

- Shawn
 
 
Posted by shawnste | 5 Comments
Filed under:

Email Address Internationalization / Internationalized eMail Addresses (EAI/IMA)

With the IDN work for Internationalized Domain Names using characters beyond ASCII, it is only natural to tackle the problem of Internationalized Internet eMail.

Some smart people have been working on an IETF working group to figure out how non-ASCII email would work, and I encourage people to take a look: http://www.ietf.org/html.charters/eai-charter.html.  That page has the charter, a list of drafts and RFCs that have already been produced, and links to the IMA working group mailing list.

Assuming you're an ASCII/Latin character user, imagine having to type all your URL's in Chinese, or Cyrillic (or if you know those, imagine typing everything in Klingon, eg:  )  In many cultures, that's what it's like to use the web.  Some users may not be literate in Latin letters, or may have to do a lot of hunt-n-pecking.  EAI should help address that problem.

How EAI/IMA Works

The basic idea of the EAI working group is to stick email in UTF-8 instead of ASCII.  UTF-8 works pretty well in many systems, and many mailers already handle 8 bit encodings, so this is a pretty "simple" solution.  Unfortunately email touches a lot of places, so there're a lot of protocols that need updates (eg: STMP, POP, mailto:, etc.)  Additionally everyone knows that UTF-8 email can't happen instantly, so there needs to be a system for existing servers to talk to UTF-8 aware ones, which leads to a few more RFCs.

UTF8SMTP allows the servers to make decisions about the "local" part of the email address, which allows for groups to fit their own needs.  The backwards compatibility means that users also need ASCII addresses, as they do today.  The server would alias from one address to another so mail to @microsoft.com could map to my normal mailbox, and I'd only have one mail.  Unfortunately that simple concept means that places that didn't have to worry about aliasing before may now have to consider aliases and fallback addresses.  Contact lists may need to have both forms, etc.

Current Status of EAI/IMA

Currently there are several experimental RFCs, and several people have created interoperating systems that work with each other to demonstrate the feasibility of UTF8SMTP.   The next step is to move towards a standards track process, which could happen "reasonably quickly".  I'm optimistic that the standards will move quickly, but sometimes these things take a while.

So Who's Gonna Use It?

There are a lot of markets where ASCII doesn't work very well for various reasons.  Even when people have ASCII aliases, it may seem artificial, and there may be a desire for an email that reflects them or their country.  There are many ISPs in countries like Korea, China, & Japan that are very eager to be able to send email in a native script.  Some governments like Russia and China are weighing in on the importance of being able to send mail and use the Internet in their script. 

What's IMA Mean To Me As a Software Developer? (who cares?)

If you are a developer, then you may run into IMA addresses.  Even if your app doesn't explicitly deal with mail, there may be a place for email to sneak into your app.  For example, IDN and domain names don't really have much to do with Word or PowerPoint, yet they often show up in documents and presentations.  I could imagine an author address in metadata, such as a photographer contact in a photo's metadata.  Many apps probably will run into IMA addresses whether they realize it or not.

Anyway, I have been thinking about this space for a while and thought I'd share my observations.  It's worth considering what impact IMA will have on your application (while you're at it, how's IDN behave?)

 -Shawn

 

Writing "fields" of data to an encoded file.

The moral here is "Use Unicode," so you can skip the details below if you want :)

A common problem when storing string data in various fields is how to encode it.  Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.  However, sometimes data gets mixed with other non-string data or stored in a record, like a database record.  There are several ways to do that, but some common formats are delimited fields, fixed width fields, counted fields.  I'm going to ignore more robust protocols like XML for this problem.

A delimited field would be a character between fields that indicated that one field ended an another started.  Common delimiters are null (0), comma, and tab.  Using delimited fields, a list of names would look something like "Joe,Mary,Sally,Fred".

A fixed width field would be a field of a known size regardless of the input data size.  Generally data that is too short is padded with a space or null, and data that is too long is clipped.  If our "names" field was of fixed size four, then the previous list could look something like "Joe_MarySallFred".  Note the _ to pad the 3 character name, that Sally is clipped, and that the other names are "run together".

A counted field would indicate the field size for each piece of data before outputting the data.  The advantage is that it doesn't have the size restriction/clipping of fixed width fields, nor does it have to waste space with unnecessary padding.  (It could still be clipped for large strings as the count is likely restricted so some # of bits).  Similarly delimiters aren't a problem.  Generally the count is binary, but I'll show an example using numbers "3Joe4Mary5Sally4Fred"

A somewhat obvious way to store and read Unicode char or Unicode string data in the above formats is to write it in Unicode.  Counted fields can just count the Unicode code points to be read in.  Fixed width fields can similarly check for the space available and use Unicode character counts.   Delimited fields can also use Unicode.

When the desired output isn't Unicode (UTF-16) however, then you start running into some interesting problems.  Encodings (code pages) don't have a 1:1 relationship with UTF-16 code points, so you have to be careful.  Additionally some encodings shift modes and maintain state through shift or escape sequences.

For all of the fixed, counted, delimited techniques shift states cause an additional problem in that either the writer has to terminate the sequence, or persist the state until the next field.  Consider 2 fields where field 1 has some ASCII data that looks like "Joe" followed by shift sequence, then a Japanese character, and field 2 has "Kelly" in what looks like ASCII.  If the decoder retains the state between reading the 2 fields, it may accidentally read in "Kelly" as Japanese and presumably corrupt the output.  Alternatively if "Kelly" was really intended to read in "japanese" mode, then any application starting to read at field 2 gets confused since it didn't see the shift at the end of field 1. 

For that reason I like to make sure the fields are "complete", flushing the encoder at the end of each field (this is different than writing a pure-text document like XML).  So then field 1 above would have a shift-back-to-ASCII sequence at the end.

For fixed fields this could introduce another problem because the shift-back-to-ASCII sequence may exceed the allowed field size.  In that case the string would have to be made smaller before encoding to allow enough room for flushing.

For delimited fields there's an additional problem in that the delimiter could accidentally look like part of an encoded sequence.  Delimiters should only be tested on the decoded data.

For counted fields you start having trouble if the count isn't in encoded bytes.  If you counted the Unicode code points, then encode those code points, you don't know how many bytes to read back in when decoding.  It isn't possible to "just guess" when to stop reading data because there may or may not be some state changing data that you are expected to either ignore or read.  For example "Joe++" where ++ is a Japanese character could look like:

4<shift-to-ascii>Joe<shift-to-Japanese><+><+>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-ascii>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-mode-q><shift-to-mode-z><shift-to-mode-x>

where "4" represents the count, <+> represents the encoded character, and <shift...> indicates some sort of state change that doesn't cause output directly by itself.

Since the application doesn't know whether to expect the trailing <shift> sequence(s), it may not read enough data, and then may try to use <shift-to-ascii> as the count of the next field.  Similarly if it does see a <shift-to-ascii> and tries to read it in, then maybe it'll be confused if that was actually the count of the next field that just happened to look like a mode change.

So the moral is: Use UTF-16 because that's what the strings look like so they're less likely to get shifty about their sizes. 

  • Use Unicode.  Either UTF-16, or maybe use UTF-8, though it still can change size and you have to be careful, but at least each code point represents a Unicode code point. 
  • If you must count, try to count the actual encoded data size, not the unencoded form since that'll be confusing when decoding.
  • Be good and flush your encoder if you must encode, so that the state gets back into a known state (usually ASCII) and then the decoding application doesn't get confused if they don't reset their decoder.
  • Make sure you say which encoding you used.

Of course you may be talking to a GPS or something where you don't get to define the standard.  In that case you can just watch out for these caveats.  Should you be designing such a protocol however, make sure to use Unicode.  If that cannot happen, at least make sure to pay attention to the impact of encoding and decoding the data when the protocol's used.

-Shawn

 

Locale Builder and Two Letter ISO name and Three Letter Windows Language Name

When you use the Microsoft Locale Builder tool to build a custom locale, it asks for a lot of fields.  Two may not be obvious:

The Two Letter ISO Language name is permitted to be 3 letters for locales that don't have a 2 letter code (eg: haw for Hawaiian). 

The Three Letter Windows Language Name is mostly used for in-box locales for things having to do with our build process, so you can pretty much pick anything.  Mostly I just use the ISO code, but note that Windows tries to keep this value unique.  Note:  Do NOT use this 3 letter windows code in your application, instead use the ISO standard codes.

 

Cheating to UNinstall Custom Cultures / Locales

In Cheating To Install Custom Cultures, I mentioned how to add the custom cultures without using CultureAndRegionInfoBuilder.Register().  Should you have any problems with a custom culture / locale and want to uninstall it but are having difficulty with an uninstaller or whatnot, this is how to get rid of it:

{Warning this edits system stuff and could mess up your computer if you aren't careful, or if the custom culture was required by some application.}

Warning, if your locale was installed with the Microsoft Locale Builder installer or another installer, you'll still have to run that uninstaller to make the system happy if you want to reinstall it that way.  In other words, don't use this if it came through an installer.

1) Run intl.cpl (Regional and Language Options) and change to some other locale.

2) Get rid of the custom culture file.

a) Open an elevated command window (eg: press windows key, then type "cmd", then CTRL+SHIFT+ENTER, or right click on cmd and choose "Run as Administrator"

b) "dir %windir%\globalization\*.nlp" to see installed custom locales

c) "rename %windir%\globalization\fj-FJ.nlp fj-FJ.disabled" to disable a custom culture named "fj-FJ" (Fijiian (Fiji)).

d) After rebooting, if desirable you can then "del %windir%\globalization\fj-FJ.disabled".  Often you can't just delete the .nlp file at first because it may be in use.

3) You can also clean the registry key though this isn't necessary, it won't work if it can't find the file:

a) Run regedit (warning: improper use can mess up your computer, etc.)

b) Expand all the + arrow thingies to get to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CustomLocale (each \ is a new level)

c) Select the value (eg: fj-FJ) you want to delete, then press the delete key.

Hope someone finds this helpful.

Shawn

 

 

More Posts Next page »
 
Page view tracker