Welcome to MSDN Blogs Sign in | Join | Help

 

A helpful reader pointed out I don't really know Klingon.

PS. I just checked out your blog (very nice by the way, lots of stuff I need to read) and I noticed along the top of the page you have  (jItlhInganbe') for "I'm not a Klingon". The translation your looking for would be   (tlhIngan jIHbe' - I am not a Klingon).

Thanks, fixed it :)  As I said, I'm not a Klingon, and had a terrible time finding something to work for "am" in The Klingon Dictionary or one of the on-line lessons I found.

Although that's kind of like one of those odd things that always strikes me as funny when traveling. 

Me: "Can you tell me where a good restaurant is?"
Native: "Sorry, I don't speak English". (Sometimes even without an accent)

Hmm.  I'll continue to have to rely on others for Klingon translations :)

- Shawn
 
 
Posted by shawnste | 0 Comments
Filed under:

Email Address Internationalization / Internationalized eMail Addresses (EAI/IMA)

With the IDN work for Internationalized Domain Names using characters beyond ASCII, it is only natural to tackle the problem of Internationalized Internet eMail.

Some smart people have been working on an IETF working group to figure out how non-ASCII email would work, and I encourage people to take a look: http://www.ietf.org/html.charters/eai-charter.html.  That page has the charter, a list of drafts and RFCs that have already been produced, and links to the IMA working group mailing list.

Assuming you're an ASCII/Latin character user, imagine having to type all your URL's in Chinese, or Cyrillic (or if you know those, imagine typing everything in Klingon, eg:  )  In many cultures, that's what it's like to use the web.  Some users may not be literate in Latin letters, or may have to do a lot of hunt-n-pecking.  EAI should help address that problem.

How EAI/IMA Works

The basic idea of the EAI working group is to stick email in UTF-8 instead of ASCII.  UTF-8 works pretty well in many systems, and many mailers already handle 8 bit encodings, so this is a pretty "simple" solution.  Unfortunately email touches a lot of places, so there're a lot of protocols that need updates (eg: STMP, POP, mailto:, etc.)  Additionally everyone knows that UTF-8 email can't happen instantly, so there needs to be a system for existing servers to talk to UTF-8 aware ones, which leads to a few more RFCs.

UTF8SMTP allows the servers to make decisions about the "local" part of the email address, which allows for groups to fit their own needs.  The backwards compatibility means that users also need ASCII addresses, as they do today.  The server would alias from one address to another so mail to @microsoft.com could map to my normal mailbox, and I'd only have one mail.  Unfortunately that simple concept means that places that didn't have to worry about aliasing before may now have to consider aliases and fallback addresses.  Contact lists may need to have both forms, etc.

Current Status of EAI/IMA

Currently there are several experimental RFCs, and several people have created interoperating systems that work with each other to demonstrate the feasibility of UTF8SMTP.   The next step is to move towards a standards track process, which could happen "reasonably quickly".  I'm optimistic that the standards will move quickly, but sometimes these things take a while.

So Who's Gonna Use It?

There are a lot of markets where ASCII doesn't work very well for various reasons.  Even when people have ASCII aliases, it may seem artificial, and there may be a desire for an email that reflects them or their country.  There are many ISPs in countries like Korea, China, & Japan that are very eager to be able to send email in a native script.  Some governments like Russia and China are weighing in on the importance of being able to send mail and use the Internet in their script. 

What's IMA Mean To Me As a Software Developer? (who cares?)

If you are a developer, then you may run into IMA addresses.  Even if your app doesn't explicitly deal with mail, there may be a place for email to sneak into your app.  For example, IDN and domain names don't really have much to do with Word or PowerPoint, yet they often show up in documents and presentations.  I could imagine an author address in metadata, such as a photographer contact in a photo's metadata.  Many apps probably will run into IMA addresses whether they realize it or not.

Anyway, I have been thinking about this space for a while and thought I'd share my observations.  It's worth considering what impact IMA will have on your application (while you're at it, how's IDN behave?)

 -Shawn

 

Writing "fields" of data to an encoded file.

The moral here is "Use Unicode," so you can skip the details below if you want :)

A common problem when storing string data in various fields is how to encode it.  Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.  However, sometimes data gets mixed with other non-string data or stored in a record, like a database record.  There are several ways to do that, but some common formats are delimited fields, fixed width fields, counted fields.  I'm going to ignore more robust protocols like XML for this problem.

A delimited field would be a character between fields that indicated that one field ended an another started.  Common delimiters are null (0), comma, and tab.  Using delimited fields, a list of names would look something like "Joe,Mary,Sally,Fred".

A fixed width field would be a field of a known size regardless of the input data size.  Generally data that is too short is padded with a space or null, and data that is too long is clipped.  If our "names" field was of fixed size four, then the previous list could look something like "Joe_MarySallFred".  Note the _ to pad the 3 character name, that Sally is clipped, and that the other names are "run together".

A counted field would indicate the field size for each piece of data before outputting the data.  The advantage is that it doesn't have the size restriction/clipping of fixed width fields, nor does it have to waste space with unnecessary padding.  (It could still be clipped for large strings as the count is likely restricted so some # of bits).  Similarly delimiters aren't a problem.  Generally the count is binary, but I'll show an example using numbers "3Joe4Mary5Sally4Fred"

A somewhat obvious way to store and read Unicode char or Unicode string data in the above formats is to write it in Unicode.  Counted fields can just count the Unicode code points to be read in.  Fixed width fields can similarly check for the space available and use Unicode character counts.   Delimited fields can also use Unicode.

When the desired output isn't Unicode (UTF-16) however, then you start running into some interesting problems.  Encodings (code pages) don't have a 1:1 relationship with UTF-16 code points, so you have to be careful.  Additionally some encodings shift modes and maintain state through shift or escape sequences.

For all of the fixed, counted, delimited techniques shift states cause an additional problem in that either the writer has to terminate the sequence, or persist the state until the next field.  Consider 2 fields where field 1 has some ASCII data that looks like "Joe" followed by shift sequence, then a Japanese character, and field 2 has "Kelly" in what looks like ASCII.  If the decoder retains the state between reading the 2 fields, it may accidentally read in "Kelly" as Japanese and presumably corrupt the output.  Alternatively if "Kelly" was really intended to read in "japanese" mode, then any application starting to read at field 2 gets confused since it didn't see the shift at the end of field 1. 

For that reason I like to make sure the fields are "complete", flushing the encoder at the end of each field (this is different than writing a pure-text document like XML).  So then field 1 above would have a shift-back-to-ASCII sequence at the end.

For fixed fields this could introduce another problem because the shift-back-to-ASCII sequence may exceed the allowed field size.  In that case the string would have to be made smaller before encoding to allow enough room for flushing.

For delimited fields there's an additional problem in that the delimiter could accidentally look like part of an encoded sequence.  Delimiters should only be tested on the decoded data.

For counted fields you start having trouble if the count isn't in encoded bytes.  If you counted the Unicode code points, then encode those code points, you don't know how many bytes to read back in when decoding.  It isn't possible to "just guess" when to stop reading data because there may or may not be some state changing data that you are expected to either ignore or read.  For example "Joe++" where ++ is a Japanese character could look like:

4<shift-to-ascii>Joe<shift-to-Japanese><+><+>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-ascii>, or
4<shift-to-ascii>Joe<shift-to-Japanese><+><+><shift-to-mode-q><shift-to-mode-z><shift-to-mode-x>

where "4" represents the count, <+> represents the encoded character, and <shift...> indicates some sort of state change that doesn't cause output directly by itself.

Since the application doesn't know whether to expect the trailing <shift> sequence(s), it may not read enough data, and then may try to use <shift-to-ascii> as the count of the next field.  Similarly if it does see a <shift-to-ascii> and tries to read it in, then maybe it'll be confused if that was actually the count of the next field that just happened to look like a mode change.

So the moral is: Use UTF-16 because that's what the strings look like so they're less likely to get shifty about their sizes. 

  • Use Unicode.  Either UTF-16, or maybe use UTF-8, though it still can change size and you have to be careful, but at least each code point represents a Unicode code point. 
  • If you must count, try to count the actual encoded data size, not the unencoded form since that'll be confusing when decoding.
  • Be good and flush your encoder if you must encode, so that the state gets back into a known state (usually ASCII) and then the decoding application doesn't get confused if they don't reset their decoder.
  • Make sure you say which encoding you used.

Of course you may be talking to a GPS or something where you don't get to define the standard.  In that case you can just watch out for these caveats.  Should you be designing such a protocol however, make sure to use Unicode.  If that cannot happen, at least make sure to pay attention to the impact of encoding and decoding the data when the protocol's used.

-Shawn

 

Locale Builder and Two Letter ISO name and Three Letter Windows Language Name

When you use the Microsoft Locale Builder tool to build a custom locale, it asks for a lot of fields.  Two may not be obvious:

The Two Letter ISO Language name is permitted to be 3 letters for locales that don't have a 2 letter code (eg: haw for Hawaiian). 

The Three Letter Windows Language Name is mostly used for in-box locales for things having to do with our build process, so you can pretty much pick anything.  Mostly I just use the ISO code, but note that Windows tries to keep this value unique.  Note:  Do NOT use this 3 letter windows code in your application, instead use the ISO standard codes.

 

Cheating to UNinstall Custom Cultures / Locales

In Cheating To Install Custom Cultures, I mentioned how to add the custom cultures without using CultureAndRegionInfoBuilder.Register().  Should you have any problems with a custom culture / locale and want to uninstall it but are having difficulty with an uninstaller or whatnot, this is how to get rid of it:

{Warning this edits system stuff and could mess up your computer if you aren't careful, or if the custom culture was required by some application.}

Warning, if your locale was installed with the Microsoft Locale Builder installer or another installer, you'll still have to run that uninstaller to make the system happy if you want to reinstall it that way.  In other words, don't use this if it came through an installer.

1) Run intl.cpl (Regional and Language Options) and change to some other locale.

2) Get rid of the custom culture file.

a) Open an elevated command window (eg: press windows key, then type "cmd", then CTRL+SHIFT+ENTER, or right click on cmd and choose "Run as Administrator"

b) "dir %windir%\globalization\*.nlp" to see installed custom locales

c) "rename %windir%\globalization\fj-FJ.nlp fj-FJ.disabled" to disable a custom culture named "fj-FJ" (Fijiian (Fiji)).

d) After rebooting, if desirable you can then "del %windir%\globalization\fj-FJ.disabled".  Often you can't just delete the .nlp file at first because it may be in use.

3) You can also clean the registry key though this isn't necessary, it won't work if it can't find the file:

a) Run regedit (warning: improper use can mess up your computer, etc.)

b) Expand all the + arrow thingies to get to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CustomLocale (each \ is a new level)

c) Select the value (eg: fj-FJ) you want to delete, then press the delete key.

Hope someone finds this helpful.

Shawn

 

 

A Pet Peeve of Mine

One of my pet peeves is software that is too restrictive about installing.  The #1 compatibility thing I find is applications that refuse to install on a newer OS for no good reason.  Generally if you can get them to install OK then they run OK.

I feel better now,

Shawn

Posted by shawnste | 1 Comments

Don't use MB_COMPOSITE, MB_PRECOMPOSED or WC_COMPOSITECHECK

This pretty much demonstrates another reason to Use Unicode, but if you do need to use some non-Unicode encoding until you can convert to Unicode, please don't use these flags. 

MultiByteToWideChar() and WideCharToMultiByte() provide some interesting sounding flags that are actually useless, slow, badly broken, or far worse.  All of these flags would be expected to behave like Unicode Normalization, so you should instead use NormalizeString() to handle the desired behavior, either Form C for composed strings or Form D for decomposed strings.

MB_PRECOMPOSED is the simplest to address:  Basically this flag doesn't really do anything.  Nominally it would put data into something like Normalization Form C, however most code pages are already in a composed form, so there's little real impact.  Just to make sure, the flag's ignored internally :)

MB_COMPOSITE is my most hated of these flags.  First of all, it nominally pretends to put the data into something like Normalization Form D, decomposed into a base character and combining characters.  To me that's the opposite of "Composite".  Indeed, I've seen numerous code examples that seem to be passing MB_COMPOSITE expecting Form C data, and pretty much zero examples expecting Form D data.  Windows leans towards Form C internally (though you may use Form D or mixed data), so this flag probably isn't that helpful.  If you really want to decompose your data, then use NormalizeString with Form D instead of this flag.

MB_COMPOSITE also is very slow because it does a lookup in some data tables.  NormalizeString with Form D is probably faster.

MB_COMPOSITE also has some horrible behavior for many code points:

  • Several code points will not round trip if this flag is set, even if WC_COMPOSITECHECK is used when converting back to the code page.
  • Additionally its data tables are incomplete and inconsistent with the normalization
  • Worse, some characters are decomposed into nonsensical sequences.
  • Lastly some sequences decompose to strange choices, breaking some text.  Japanese is particularly impacted.

WC_COMPOSITECHECK basically has all of the problems of MB_COMPOSITE (its used in the other direction).  Its name isn't as annoying to me though.  Nominally WC_COMPOSITECHECK puts the data into Normalization Form C before encoding.  Since most code pages are in a composed form Normalization Form C isn't a bad idea, however please use NormalizeString with Form C instead of this flag.

WC_COMPOSITECHECK is also very slow because of the way it does lookup.  NormalizeString with Form C is probably faster.

WC_COMPOSITECHECK also has horrible behavior for many code points:

  • It will convert sensible sequences into a form that, when round tripped by MB_COMPOSITE will end up in nonsensical forms.
  • Sequences of 3 code points created by MB_COMPOSITE aren't correctly decoded by WC_COMPOSITECHECK back into their single code point form, resulting in extra ? when round tripping data.
  • Several sequences map to a single code point, which MB_COMPOSITE will map back to a single form, so they won't round trip.  If you really need similar behavior try Normalization Form C, or KC if you really need the multiple mappings.  KC causes data to not round trip, so it might not be appropriate for all applications.  (Of course converting to the code page will also likely cause data to be lost so that may not matter so much).
  • Again some sequences are composed in a strange form based on appearance rather than linguistics.  This could cause some unexpected behavior.
  • Some scripts, like Japanese, are particularly impacted.

Hopefully I've terrified you and you'll stop using these flags, perhaps using NormalizeString() if you really need similar behavior.  Most applications don't even really need that though.  Of course you always have the option of Using Unicode!

'til next time,
Shawn

 

Front page uses windows-1252, shouldn't it be iso-8859-1?

I received this question:

I use Frontpage for my webpage design and FP automatically inserts the meta tag "<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">".
 
Should I have reference to ISO-8859-1 ?

I'm not a front page expert, and I can't answer all questions like this, however this is an common confusion.  Windows-1252 is very similar to ISO-8859-1, but they aren't identical.  Web sites and browsers have historically often treated these as equivilent, but they aren't, which is a great reason to use unicode for your encoding.  (No, I don't know how to make front page use UTF-8, but that'd be the best solution).  Looking on search.live.com (of course) for iso-8859-1 and windows-1252 will find some discussion of the differences.  Wikipedia has some articles (they change so I won't quote them directly, but their encoding related articles are usually informative and often accurate.)

 

 

Changing the currency symbol (Euro, etc) in Windows XP & Vista & Server

Countries sometimes change which currency symbol they're using.  This is most obvious for countries using the Euro (wikipedia currently says those are currently: Austria, Belgium, Cyprus, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, Malta, Netherlands, Portugal, Slovenia, Spain, Mayotte, Monaco, Saint Pierre and Miquelon, San Marino, Vatican City, Akrotiri and Dhekelia, Andorra, Kosovo, Montenegro, Saint Barthélemy, and Saint Martin)

Other countries have changed their currency symbol as well, either because of a political shift, currency devaluation or other causes.  In the future Slovakia, Lithuania, Estonia, Bulgaria, Czech Republic, Hungary, Latvia, Poland and Romania are expected to adopt a Euro.

So what happens if you use a locale that changes currency?  How do you get that set as your currency symbol in Windows or .Net?

The easiest solution is to use the Regional Options control panel (Windows Key + r then type intl.cpl and OK opens the control panel, or you can select it from the control panel).  From intl.cpl you can use the advanced settings to change the format of the currency symbol and enter the Euro symbol.  This change only impacts the current locale for the current user though, and has to be reset for each user or if the user changes their locale to something else and then back.

Another option for Vista, Server 2008 & .Net 2.0+ is to create a custom culture with the desired symbol.  Ironically anyone with the locale already set isn't going to see the update because the old symbol is set in their currency.  The advantage of the custom locale solution is that it provides the ability to update the currency name (and other data) as well as the symbol, and that it persists and impacts non-current locales.

So a complete solution is probably to create a custom locale and to also change the user override.  I have a link to the custom locale tool at http://blogs.msdn.com/shawnste/pages/custom-cultures-vista-custom-locales.aspx or directly from http://www.microsoft.com/downloads/details.aspx?FamilyID=e4588c5e-8f21-45cc-b862-38df8d9bd528&displaylang=en or search for "Locale Builder" on msdn.

Our Unicode Globalization Windows Language Support Presentation

Recently Peter Constable and I gave a presentation on Windows Language Support, which I've attached here.

Poornima Priyadarshini and I also gave a presentation in Globalization in Silverlight, which I've attached in the previous post.

Ironically this is too big to attach as PDF, and the previous post is too big to attach as a pptx.

Our Unicode & Globalization Silverlight Presentation

Recently Peter Constable and I gave a presentation on Windows Language Support, which I've attached in the next post.

Poornima Priyadarshini and I also gave a presentation in Globalization in Silverlight, which I've attached here.

Posted by shawnste | 4 Comments
Filed under:

Attachment(s): Globalization and Silverlight.pdf

How come Substring(0, xxx) matches something, but StartsWith returns false?

I was asked how a string can match a substring of another string, yet StartsWith can return false?  For example:

 

string str = "Mu\x0308nchen";
string find = "Mu";
Console.WriteLine("Substring: " + (str.Substring(0,2) == find));
Console.WriteLine("StartsWith:" + str.StartsWith(find));
Console.WriteLine("IndexOf:   " + str.IndexOf(find));

 

returns this:

 

Substring: True
StartsWith:False
IndexOf:   -1

 

So if you test the first 2 characters with the search string, you'll see that they match, yet StartsWith() returns false, and IndexOf can't find it.  This is because the 0308 diacritic is considered part of the u that it is modifying, so it won't be found.  In many languages diacritics like this are really different letters.  Since you don't expect a == z, then you wouldn't expect u == ü. 

 

Doing the substring effectively "breaks" the character, changing its meaning.  Substring can even create illegal Unicode if it chops off part of a surrogate pair (eg: U+D800, U+DC00).

 

A similar oddity would be characters with no weight like U+FFFD.  So if I have str = "A\xFFFD\xFFFD\xFFFD", then all of str.Substring(0,1) == str.Substring(0,2) == str.Substring(0,3) == str.Substring(0,4) == "A".  And in this case str.StartsWith("A") would be true.

 

Another perhaps unexpected behavior would be unweighted characters (or ignored by a flag) at the beginning of hte string.  So if str="\xFFFD" + "A", then str.IndexOf("A") can return 1, yet str.StartsWith() will return true (even though IndexOf didn't return 0).

 

Similar behaviors can be seen with LastIndexOf() and EndsWith(), and with the native Vista API FindNlsString and its variations.  In addition with the FindNlsString() API, the found substrings may be unexpected.

 

Posted by shawnste | 0 Comments
Filed under:

World Wide Telescope is really cool

This isn't really related to anything I talk about, but I thought that Microsoft Research's World Wide Telescope is pretty fun:  http://worldwidetelescope.org.

This is a free program you can install and then it lets you zoom in on celestial objects like a virtual planetarium.  It zooms pretty well and expands to lots of hubble and other photos, so you can drill down pretty far on parts of the sky.

Posted by shawnste | 1 Comments

Silverlight Time Zone World Clock (Very Beta) Demo

For my presentation of globalization of Silverlight at the Unicode Conference I wanted to make a quick Silverlight demo application that would show at least a little bit of globalization and not be too hard to write.  My first choice was to find an existing app, and thought I was close when I found a pretty application, but it was always stuck in English and didn't respect the user settings :(.

Then I thought about making a world clock in Silverlight.  I knew the Olsen tz database would provide the data, but I needed a map, so I did a live search for some maps.  Most seemed out of date, I didn't know if I could use them, and I'd have to map latitude/longitude to the image.  I sort of had a "duh" moment when I found VIEWS at http://www.codeplex.com/views.  VIEWS is a silverlight wrapper for the Virtual Earth control.  Virtual Earth (http://www.microsoft.com/virtualearth/) is really cool but, better yet, gives me latitude & longitude when you click.  Serious overkill for a world clock, but oh, well.

It took me about an hour to figure out how to make a silverlight app that used VIEWS.  Ironically this is the first time I've used the Visual Studio IDE to make a silverlight app.  Most of the silverlight code I write is low-level, so I use a console based test tool and don't make "real" silverlight apps normally.  After getting the flashy stuff done really quickly it took me a bit more effort to get the timezone database into a format I could read and use in the application.

My demo works for the most part, but has some serious bugs.  I didn't worry about getting the daylight savings transitions to behave, so the demo can be off by an hour for a few weeks around the transition times (I only enabled checking the month, not the day rules).  Also the tz database only has cities, not bounderies, so it can be hard to find the right data point.  I added Seattle by hand so that it wouldn't show Vancouver, BC when I did the demo, but many places can be a bit unexpected.  Clicking on Disney World in Florida (I just got back from vacation) will happily show you times for Havana, which probably isn't expected.  You have to go all the way "up" to New York to get Eastern Time.

I called the demo "SilverTime" and stuck it on CodePlex at http://www.codeplex.com/SilverTime. Its kind of cool, so I'm hoping that other people will participate in the open project and fix some of the bugs or extend its features.  There's some interesting potential in the app, and my bugs, although serious, aren't really that hard to fix.  (I was just running out of time before my vacation :-)

Have Fun,

Shawn

 

More Posts Next page »
 
Page view tracker