Welcome to MSDN Blogs Sign in | Join | Help

Unicode use on the web

Google posted a blog about unicode use on the web http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html  They also announced that they now support Unicode 5.1, which is probably a good thing, but I found the graph most interesting http://bp1.blogger.com/_Ap14FtNN91w/SBzrtHJfLnI/AAAAAAAAA5U/TV7_g2_sWq0/s1600-h/Unicode2.gif.

The graph shows UTF-8 as now being the most prevalent encoding on the web, with a steady decline of ASCII and declining trends for other common encodings.  That's got to be a good thing for character portability between machines on the web.  Lack of proper declaration and use of encodings is one of the biggest problems with interoperability on the web, its nice to see Unicode gaining ground.  The rate of growth is also really good, a nice strong curve there at the end :)

Commenting on Comments

Wow, I see a flurry of comments, I guess I was out.  Usually the system asks me to approve the comments (there's an unbelievable amount of spam), but you can avoid that by creating an account.  If you have an account (click Join in the top-right), then it seems to auto-approve your comments :)  (Of course then I still have to remember to check for and read them).

A large part of the delay is that I was just ignoring everyone since I was busy having a wonderful time on a sailboat around Tortola :)

Posted by shawnste | 0 Comments

Code page xxx isn't available on my (windows XP or 2003) computer, what happened?

Use Unicode :) 

Beginning with windows Vista all code pages are installed by default, however in previous versions of windows code page support has to be enabled through intl.cpl (Regional and Language Options in the control panel).  You'll find a tab there for adding code pages.  For some languages the "east asian language support" checkbox also needs installed to enable complete font support.

Posted by shawnste | 0 Comments

"Yamla" blocked Microsoft from updating Wikipedia.

Bummer.  Sometimes I try to make technical edits to wikipedia, but I'm too lazy to set up an account.  I avoid the anti-microsoft pages because they're just to biased, and only focus on technical issueSeems that Microsoft's IP addresses have been blocked from making updates due to "spamming links to external sites."  Lots of people work here, so I have no clue what "Yamla" objected to, but I know I've added links such as to the windows-1252 web page.  (Might just be me, but I'd've thought that a link to the microsoft page for a code page would be appropriate for an article describing a microsoft defined code page ;-).

Personally I think wikipedia's a good source of information (except maybe for the slanted articles).  I'd've thought that it'd be good for everyone if it was easy for industry participants to make technical updates.  I can see where wikipedia wouldn't want to be a PR tool of any company, but I can't see how disallowing technical updates helps anyone.

 

Posted by shawnste | 3 Comments

Server 2008 U+FFFD behavior for unknown or illegal UTF-8 sequences.

In my post Change to Unicode Encoding for Unicode 5.0 conformance I mentioned that the behavior of illegal characters has changed for Unicode 5 conformance in Windows Vista / .Net 2.0+.  Those changes have also been inherited by Server 2008.

Also check out my collection of code page related articles at Code Pages, Unicode & Encodings

MSDN Code Gallery Custom Locales

MSDN has this new nifty Code Gallery place for samples and the like, so I stuck my custom locale examples there at http://code.msdn.microsoft.com/CustomLocales  I added a few more since posting them to my blog here.

Code pages and security issues

One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. 

One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. In short if data is going to be converted between code pages after some sort of security validation is done, then that validation could be negated. This is true of lots of data transformations, but it seems to surprise people a lot when applied to code page transformations.

There are lots of reasons for this, but some are:

  • Transformations can have "best-fit" mappings.  For example, if I test for "C:\windows" in some form, but a tranformation maps the fullwidth compatibility characters (~U+FF00, such as ff43 (c)), then the security check gets invalidated.
  • Similar things happen with different versions of code pages.  For example, ASCII is sometimes handled by "dropping" the high bit so 0xc3 ends up being 0x43.  Alternatively sometimes applications just "pretend" that it was really ANSI and map it to whatever code page the user is using.
  • Code pages aren't always tagged correctly, so an ANSI validation using a default system code page on a server may yield different results than the same code with a different default system code page on a client.
  • Some code pages also have difference between systems.  Windows itself has slightly different behavior between MLang (often used by IE) and MultiByteToWideChar().
  • Different systems handle unexpected, unassigned or illegal code points in different manners.  Sometimes that means ? or the equivilent.  Sometimes gibberish, sometimes dropped data (so then your C:\windows test on C?:?\?W?i?n?d?o?w?s doesn't work if all the ? disappear).
  • Sometimes behavior changes, such as Change to Unicode Encoding for Unicode 5.0 conformance .  In this case the change is to the Unicode parsing itself, but still any security test should be done after reading the input data.
  • Escape sequences might modify data in some environments that provide escaping mechanisms for characters that a code page doesn't support.

A related problem is the IDN and code page parsing that browsers sometimes do.  & named and numeric entities in HTML can end up with a different appearance.  % escaping is common in URLs, and IDN xn-- encoding happens in domain names.  An application may decode these, even at unexpected times, and cause problems if the data was assumed to be in a different state before the decoding.

So the moral is: Do any security tests after any conversions have been done.  If you have to retransmit the data, try to use an encoding like Unicode that has fewer edge case behaviors that could trip you up.  If possible, revalidate the data after the transmission if it has to be decoded. 

 

http://www.languagegeek.com/

http://www.languagegeek.com/ says it is dedicated to the promotion of Native North American languages.  A coworker ran into this site while she was trying to learn more about the Lakota language (and she made a Lakota custom locale too! :)  I don't know much about the site, but apparently there are keyboards and other things that might be helpful to those of you trying to get keyboards or fonts or whatnot.  I don't see custom locale files there yet, but maybe in the future ;-)

Cantonese and Manderin language tagging.

The IETF "Language Tag Registry Update" working group has noted that lots of data is tagged as "zh-Hant", regardless of whether or not it is pronounced as Cantonese or Manderin.  For video and audio however, this doesn't allow a fine enough distinction, and so the LTRU is working on revising RFC 4646/4647 and the registry to allow for new tags to distiguish Cantonese and Manderin from the "macrolanguage" of Chinese.

So in the future we should expect to see "cmn" and "yue" tags instead of zh.  The LTRU is still a bit in flux about the details, but it is clear that in the future newly tagged data will use "cmn" and "yue".  This is going to cause "an interesting time" since lots of legacy data, resources and systems will continue to use the zh tags.

User configurations may need to change, such as allowing both "cmn and zh" in a web browser's language configuration.  Applications and systems may also need to change to provide "cmn" resources if "zh" was asked for, or vice versa.  Content providers may also need to retag existing data to distinguish between Cantonese and Manderin.

With these types of changes, the adoption rate is usually quite varied, so expect some applications and content to shift rapidly to using the new recommended names once that new standard is created.  Other data and systems will probably remain unchanged for a very long time, leading to very interesting scenarios when those environments communicate with each other.

zh-Hans, zh-Hant and the "old" zh-CHS, zh-CHT

With Windows Vista and Microsoft .Net 2.0 (MS07-040 security patch) and 3.0+, we've started to use the IETF standard "zh-Hans", and "zh-Hant" names for Chinese simplified and traditional.  In windows the zh-CHS/zh-CHT names were never used because the named APIs are new to Vista.  Also in Silverlight we don't use the old names since Silverlight is new.

However in .Net 2.0/3.x we still recognize zh-CHS & zh-CHT for backwards compatibility.  Additionally we prefer the "old" zh-CHS and zh-CHT names when enumerating or returning the name of a CultureInfo created by LCID (0x0004 or 0x7c04)  The other oddity is that to recognize more resources, we made the parent of zh-CHS/zh-CHT be zh-Hans/zh-Hant. 

This allows resources labeled zh-Hans or zh-Hant (the preferred name) to be loaded by systems that used the older zh-CHS/zh-CHT names.  Unfortunately they can't be parents of each other, so this resource fallback only works one direction.  You cannot normally find zh-CHS resources if you start with a zh-Hans locale.  So the recommendation is to use zh-Hans/zh-Hant when creating resources.

In the future (like v4+, we're not sure when) zh-CHS/zh-CHT will no longer be recognized by default.  Users will still be able to create zh-CHS/zh-CHT custom cutlures if necessary to workaround legacy naming related issues, similar to the way you can make an az-AZ-Latn now to work around that name change.

 

Michael has a blog about converting apps from ANSI to Unicode

Lots of apps are now Unicode, but some need to make the shift from ANSI (like Japanese shift-jis) to Unicode.  Michael has a series of blog posts about a project conversion.  http://blogs.msdn.com/michkap/archive/2007/01/05/1413001.aspx

I recently had a customer question about removing a shift_jis dependency and moving to Unicode, so I thought I'd blog about it, but I've been busy, so in the meantime I thought maybe Michael's blog would help :0)

How do you make your regional and language options apply to new user accounts?

In general its a good idea to allow users to choose appropriate settings, but being able to adjust the default user account settings to provide users with an appropriate default locale is often helpful.  Also one cannot easily change the system account's settings. 

The Regional and Language Options Control Panel (intl.cpl) has an administrative tab that you can then choose "Copy to reserved accounts" from.  This copies the settings to the system and default accounts so that new users get those settings.  I'd recommend testing on the local account to make sure that these settings are what you want.  The label was slightly different in XP, but it does the same thing.

To do this in an automated manner see "Windows Vista Command Line Configuration of International Settings" is on-line on MSDN.  The MSDN article http://www.microsoft.com/globaldev/vista/vista_tools/vista_command_line_international_configuration.mspx discusses changing options using an xml configuration file which you create.

 

Are we going to update or maintain the best fit &/or code page mappings?

People wonder if we're going to update our best fit code page mappings, or even our code page mappings.  The answer is no.  Changing character mappings causes difficulties for applications and our experience has been that doing so breaks as much as it "fixes".  We'd prefer applications move to Unicode, then you don't have to worry about best-fit, or if a character is supported.

Best fit behavior is the behavior of some code pages to map unknown unicode characters to a character that someone thought was similar that the code page supported.  Examples would be mapping k(U+FF4B, full width k) to k, or ĩ (U+0-129 latin letter small i with tilde) to i,or ∞ (U+221e, infinity symbol) to 8.  Some of these seem reasonable, however we aren’t consistent in our mappings, most break the meaning, and some mappings (∞->8) changes the meaning completely.

The best fit mappings were created “a long time ago”, contained “omissions”, and haven’t been updated to include new Unicode characters.  “Newer” code pages don’t necessarily include the same best fit mappings, and, by now, the mappings are fairly inconsistent and incomplete.  So we don’t recommend that the mappings be used, and we don’t intend to change or “fix” the best fit behaviors.

We also don’t like to change other code page data either.  “unassigned” code points can have arbitrary behavior or map to Unicode PUA code points.  Some applications use those code points (perhaps unwisely) as formatting codes or to cause special behavior.  Adding a mapping could break such an application.  Other applications or systems may provide a glyph for an unassigned code point that round trips, however that might not be the designed intent, and changing the code point behavior could break those applications or fonts.

Code page standards are also sometimes extended, modified, or corrected.  Changing the behavior however impacts all applications using that behavior and our experience is that such changes across the installed windows code base causes as much trouble as it solves.

So we like to keep the code page mappings stable.  My recommendations for code page use are:

  • Use Unicode unless explicitly required for some standard or protocol (and try to upgrade the standards or protocols to allow Unicode).
  • If you can’t use Unicode, explicitly specify the mapping that is used.  (Some applications or standards presume whatever the OS uses, ie: windows ANSI code page, which causes serious interoperability problems.)
  • Avoid best fit mappings.  At best they cause spelling errors or offend customers.  At worst they can cause security problems.
  • Avoid unassigned code points, their behavior is undefined and could cause difficulty if a different machines or software have a different interpretation.
  • Use care when using the Unicode private use area (PUA).  Its use is private.  If data is persisted in the PUA, then there is a risk that future versions or other machines may not read the data correctly.  Eventually migration of data between different PUA mappings may become necessary, and migrating such data is rarely trivial.  The Hong Kong HKSCS mappings are an example of such a difficulty.
  • Don’t rely on illegal or undefined code page behavior.  Illegal sequences might change between versions or software.  Shift modes that aren’t implemented could be implemented on other machines, etc.
  • Don't presume that illegal or undefined code page behavior will remain stable.
  • Don’t pretend binary data is text in some code page (or Unicode).  Variations in code page mappings could then prevent the data from round tripping, particularly if the binary data ends up in undefined or illegal code point behavior.

Hope that helps, more posts about common code page concerns are at http://blogs.msdn.com/shawnste/pages/code-pages-unicode-encodings.aspx

Shawn

What's a genitive month name anyway?

I’m not a linguistic expert, so I’ll probably get this a bit wrong, but basically a genitive month name is used when there’s a number next to the month name. This doesn’t happen in English, but I think of it sort of like instead of saying “1 April 2008”, using “1 of April 2008”, where the “of April” is the “genitive month name”.

For most cultures the “genitive month names” and the “month names” are identical, and in .Net 2.0 DateTimeFormatInfo the “genitive” forms are internally delay created from the month name. Unfortunately what this means is that you cannot reliably set the DateTimeFormatInfo.MonthNames property unless you also set the DateTimeFormatInfo.MonthGenitiveNames property.

The example code demonstrates this by changing the month names and not the genitive names, then changing those as well.

MMMM dd format: April 01   
System.String[]   
w/o Genitive:   April 01   
with Genitive:  prilAy 01   

using System;   
using System.Globalization;   
public class Test   

{
     static void Main()
     {
         // Pick a date and show what it looks like normally
         DateTimeFormatInfo dtfi = new DateTimeFormatInfo();
         DateTime dt = new DateTime(2008,4,1,12,30,15);
         Console.WriteLine("MMMM dd format: " + dt.ToString("M", dtfi));

         // Force the delay create to happen in some versions of .Net
         Console.WriteLine(dtfi.MonthGenitiveNames);

         // Try to make the month names pig-latn (hey, its an example)
         dtfi.MonthNames = new string[]
             { "anuaryJay", "ebruaryJay", "archMay", "prilAy", "aMay", "uneJay",
               "ulyJay", "ugustJay", "eptemberSay", "ctoberOay", "ovemberNay", "ecemberDay", "" };
         // It doesn't work quite as expected
         Console.WriteLine("w/o Genitive:   " + dt.ToString("M", dtfi));

         // We also have to set the genitive names
         dtfi.MonthGenitiveNames = dtfi.MonthNames;
         Console.WriteLine("with Genitive:  " + dt.ToString("M", dtfi));
     }
}

"Windows Vista Command Line Configuration of International Settings" is on-line on MSDN

Generally its a good idea to let the users figure out their international settings, but sometimes they need to be managed in a command-line manner.  Windows Vista Command Line Configuration of International Settings at http://www.microsoft.com/globaldev/vista/vista_tools/vista_command_line_international_configuration.mspx describes how to manage the international settings from the command line
More Posts Next page »
 
Page view tracker