<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://blogs.msdn.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>I'm not a Klingon (&lt;span style="font-family:pIqaD,code2000"&gt; &lt;/span&gt;) : System.Text</title><link>http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx</link><description>Tags: System.Text</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP1 (Build: 61025.2)</generator><item><title>What is Title Case?</title><link>http://blogs.msdn.com/shawnste/archive/2009/08/18/what-is-title-case.aspx</link><pubDate>Tue, 18 Aug 2009 20:30:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9874351</guid><dc:creator>shawnste</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/9874351.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=9874351</wfw:commentRss><description>&lt;P&gt;Disclaimer: I'm not an English teacher (that's my mom), so I'm sure my description of title casing in English probably has exceptions/variations.&lt;/P&gt;
&lt;P&gt;Title casing has an interesting history in computer programming.&amp;nbsp; Programmers like to use CamelCase to make variable names more readable, and, particularly amongst developers native to some languages, there's an idea that title casing is interesting, such as in String.ToTitleCase(), and in Windows 7, LCMapString(LCMAP_TITLECASE).&amp;nbsp; Most title casing algorythms are linguistically bad, even in English.&amp;nbsp; For other languages it's worse.&lt;/P&gt;
&lt;P&gt;ToTitleCase() takes a very simple approach to title casing.&amp;nbsp; Maybe in the future it'll be smarter, but for now it just uppercases the first letter in a group of letters, and tries to pay attention to non-letters and word breaks.&amp;nbsp; It also tries to keep acronyms all upper-case.&lt;/P&gt;
&lt;P&gt;Even in English this is a simplistic approach.&amp;nbsp; The title of this post is "What is Title Case?"&amp;nbsp; Is is supposed to be lower case, but ToTitleCase() would mess it up.&amp;nbsp; Additionally unexpected word breaks or punctuation could trick the algorithm.&amp;nbsp; Even the acronym test isn't complete since it just expects all-upper case&amp;nbsp;and sometimes acronyms keep the lower case of the full title.&amp;nbsp; Also it messess up names like DiSilva or McConnell.&amp;nbsp; Contractions can also be messed up.&lt;/P&gt;
&lt;P&gt;Outside of English, ToTitleCase() rapidly gets silly.&amp;nbsp; In English we&amp;nbsp;capitalize everything except articles, short prepositions and some other short words.&amp;nbsp;&amp;nbsp;In German it's just like a normal sentence, with only nouns getting capitalized, so the English slightly over-eager capitilization behavior becomes very over-eager.&amp;nbsp; Other languages also can have letters before the main word, eg: l'État, so the ToTitleCase rules can mess&amp;nbsp;up those words as well.&lt;/P&gt;
&lt;P&gt;And then there're scripts/languages that don't even have an upper/lower case distinction, so&amp;nbsp;ToTitleCase gets pointless.&lt;/P&gt;
&lt;P&gt;Anyway, use care when using ToTitleCase().&amp;nbsp; It might work&amp;nbsp;in some cases, but don't expect it to work&amp;nbsp;linguistically, particularly&amp;nbsp;globally, particularly in non-English cases.&amp;nbsp; Also maybe we'll get smarter and figure out a more correct way to do it in the future.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;-Shawn&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9874351" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/Custom+Cultures+_2F00_+Locales+_2F00_+CultureInfo/default.aspx">Custom Cultures / Locales / CultureInfo</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/sorting/default.aspx">sorting</category></item><item><title>Writing "fields" of data to an encoded file.</title><link>http://blogs.msdn.com/shawnste/archive/2009/06/01/writing-fields-of-data-to-an-encoded-file.aspx</link><pubDate>Tue, 02 Jun 2009 03:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:9683075</guid><dc:creator>shawnste</dc:creator><slash:comments>2</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/9683075.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=9683075</wfw:commentRss><description>&lt;P&gt;The moral here is "Use Unicode," so you can skip the details below if you want :)&lt;/P&gt;
&lt;P&gt;A common problem when storing string data in various fields is how to encode it.&amp;nbsp; Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file.&amp;nbsp; However, sometimes data gets mixed with other non-string data or stored in a record, like a database record.&amp;nbsp; There are several ways to do that, but some common formats are delimited fields, fixed width fields, counted fields.&amp;nbsp; I'm going to ignore more robust protocols like XML for this problem.&lt;/P&gt;
&lt;P&gt;A delimited field would be a character between fields that indicated that one field ended an another started.&amp;nbsp; Common delimiters are null (0), comma, and tab.&amp;nbsp; Using delimited fields, a list of names would look something like "Joe,Mary,Sally,Fred".&lt;/P&gt;
&lt;P&gt;A fixed width field would be a field of a known size regardless of the input data size.&amp;nbsp; Generally data that is too short is padded with a space or null, and data that is too long is clipped.&amp;nbsp; If our "names" field was of fixed size four, then the previous list could look something like "Joe_MarySallFred".&amp;nbsp; Note the _ to pad the 3 character name, that Sally is clipped, and that the other names are "run together".&lt;/P&gt;
&lt;P&gt;A counted field would indicate the field size for each piece of data before outputting the data.&amp;nbsp; The advantage is that it doesn't have the size restriction/clipping of&amp;nbsp;fixed width fields, nor does it have to waste space with unnecessary padding.&amp;nbsp; (It could still be clipped for&amp;nbsp;large strings as the&amp;nbsp;count is likely restricted so some # of bits).&amp;nbsp; Similarly delimiters aren't a problem.&amp;nbsp;&amp;nbsp;Generally the count is binary, but I'll show an example using numbers "3Joe4Mary5Sally4Fred"&lt;/P&gt;
&lt;P&gt;A somewhat&amp;nbsp;obvious way to store and read Unicode char or Unicode string data in the above formats is to write it in Unicode.&amp;nbsp; Counted fields can just count the Unicode code points to be read in.&amp;nbsp; Fixed width fields can similarly check for the space available and use Unicode character counts.&amp;nbsp;&amp;nbsp; Delimited fields can also use Unicode.&lt;/P&gt;
&lt;P&gt;When the desired output isn't Unicode (UTF-16)&amp;nbsp;however, then you start running into some interesting problems.&amp;nbsp; Encodings (code pages) don't have a 1:1 relationship with UTF-16 code points, so you have to be careful.&amp;nbsp; Additionally some encodings shift modes and maintain state through shift or escape sequences.&lt;/P&gt;
&lt;P&gt;For&amp;nbsp;all of the fixed, counted, delimited techniques shift states cause an additional problem&amp;nbsp;in that either the writer has to terminate the sequence, or persist the state until the next field.&amp;nbsp; Consider 2 fields where field 1 has some ASCII data that looks like "Joe" followed by shift&amp;nbsp;sequence, then&amp;nbsp;a Japanese character, and field 2 has "Kelly" in what looks like ASCII.&amp;nbsp; If the decoder retains the state between reading the 2 fields, it may accidentally read in "Kelly" as Japanese and presumably corrupt the output.&amp;nbsp; Alternatively if "Kelly" was really intended to read in "japanese" mode, then any application starting to read at field 2 gets confused since it didn't see the shift at the end of field 1.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;For that reason I like to make sure the fields are "complete", flushing the encoder at the end of each field (this is different than writing a pure-text document like XML).&amp;nbsp; So then field 1 above would have a shift-back-to-ASCII sequence at the end.&lt;/P&gt;
&lt;P&gt;For fixed fields this could introduce another problem because the shift-back-to-ASCII sequence may exceed the allowed field size.&amp;nbsp; In that case the string would have to be made smaller before encoding to allow enough room for flushing.&lt;/P&gt;
&lt;P&gt;For delimited fields there's an additional problem in that the delimiter could accidentally look like part of an encoded sequence.&amp;nbsp; Delimiters should only be tested on the decoded data.&lt;/P&gt;
&lt;P&gt;For counted fields you start having trouble if the count isn't in encoded bytes.&amp;nbsp; If you counted the Unicode code points, then encode those code points, you don't know how many bytes to read back in when decoding.&amp;nbsp; It isn't possible to "just guess" when to stop reading data because there may or may not be some state changing data that you are expected to either ignore or read.&amp;nbsp; For example "Joe++" where ++ is a Japanese character could look like:&lt;/P&gt;
&lt;P&gt;4&amp;lt;shift-to-ascii&amp;gt;Joe&amp;lt;shift-to-Japanese&amp;gt;&amp;lt;+&amp;gt;&amp;lt;+&amp;gt;, or&lt;BR&gt;4&amp;lt;shift-to-ascii&amp;gt;Joe&amp;lt;shift-to-Japanese&amp;gt;&amp;lt;+&amp;gt;&amp;lt;+&amp;gt;&amp;lt;shift-to-ascii&amp;gt;, or&lt;BR&gt;4&amp;lt;shift-to-ascii&amp;gt;Joe&amp;lt;shift-to-Japanese&amp;gt;&amp;lt;+&amp;gt;&amp;lt;+&amp;gt;&amp;lt;shift-to-mode-q&amp;gt;&amp;lt;shift-to-mode-z&amp;gt;&amp;lt;shift-to-mode-x&amp;gt;&lt;/P&gt;
&lt;P&gt;where "4" represents the count, &amp;lt;+&amp;gt; represents the encoded character, and &amp;lt;shift...&amp;gt; indicates some sort of state change that doesn't cause output directly by itself.&lt;/P&gt;
&lt;P&gt;Since the application doesn't know whether to expect the trailing &amp;lt;shift&amp;gt; sequence(s), it may not read enough data, and then may try to use &amp;lt;shift-to-ascii&amp;gt; as the count of the next field.&amp;nbsp; Similarly if it does see a &amp;lt;shift-to-ascii&amp;gt; and tries to read it in, then maybe it'll be confused if that was actually the count of the next field that just happened to look like a mode change.&lt;/P&gt;
&lt;P&gt;So the moral is: Use UTF-16 because that's what the strings look like so they're less likely to get shifty about their sizes.&amp;nbsp; &lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Use Unicode.&amp;nbsp; Either UTF-16, or maybe use UTF-8, though it still can change size and you have to be careful, but at least each code point represents a Unicode code point.&amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;If you must count, try to count the actual encoded data size, not the unencoded form since that'll be confusing when decoding.&lt;/LI&gt;
&lt;LI&gt;Be good and flush your encoder if you must encode, so that the state gets back into a known state (usually ASCII) and then the decoding application doesn't get confused if they don't reset their decoder.&lt;/LI&gt;
&lt;LI&gt;Make sure you say which encoding you used.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Of course you may be talking to a GPS or something where you don't get to define the standard.&amp;nbsp; In that case you can just watch out for these caveats.&amp;nbsp; Should you be designing such a protocol however, make sure to use Unicode.&amp;nbsp; If that cannot happen, at least make sure to pay attention to the impact of encoding and decoding the data when the protocol's used.&lt;/P&gt;
&lt;P&gt;-Shawn&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=9683075" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Why can't we strip the diacritics?</title><link>http://blogs.msdn.com/shawnste/archive/2007/06/08/why-can-t-we-strip-the-diacritics.aspx</link><pubDate>Sat, 09 Jun 2007 05:20:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:3149153</guid><dc:creator>shawnste</dc:creator><slash:comments>5</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/3149153.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=3149153</wfw:commentRss><description>&lt;P&gt;We have some "best-fit" behavior which we generally consider to be "bad".&amp;nbsp; Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything).&amp;nbsp; Assuming you can't use Unicode, why is it so bad to just make everything ASCII-like?&amp;nbsp; Maybe you have a published house or direct marketing firm that can't handle Unicode, so you'll just get rid of those annoying decorations.&lt;/P&gt;
&lt;P&gt;In American English the diacritics are effectively quaint decorations.&amp;nbsp; Many people naïvely assume that when word auto-corrects naive to naïve that this is just a prettiness factor.&amp;nbsp; When they resume spell checking their résumé the diacritics become more important.&amp;nbsp; In English its fair to spell résumé as resume, but it seems cooler to add the accents.&amp;nbsp; Since we stole (borrowed is more politically correct) the word from French, we have a french-like pronunciation of résumé, and aren't likely to confuse it with resume.&lt;/P&gt;
&lt;P&gt;In most other languages diacritics aren't optional.&amp;nbsp; You wouldn't exchange a z with an s in english just because they look similar.&amp;nbsp; "A real singer" is a lot different than "a real zinger".&lt;/P&gt;
&lt;P&gt;Recently I encountered the the following example, a user wanted to get around those pesky diacritics by mapping to ASCII.&lt;/P&gt;
&lt;P&gt;The suggested input was:&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; último año de carrera&lt;/P&gt;
&lt;P&gt;The desired output was:&lt;BR&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ultimo ano de carrera&lt;/P&gt;
&lt;P&gt;My Spanish is nearly non-existent, however word's spell checker tells me these are all legitimate Spanish words, even without the accents.&amp;nbsp; The meaning goes from something like "the last year of the race" to "I completed the anus of the race."&lt;/P&gt;
&lt;P&gt;Now imagine that you're trying to reach a new market and you do that to your customer's names or potential customer's names, how long will they remain your customer? &lt;/P&gt;
&lt;P&gt;- Shawn&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=3149153" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted</title><link>http://blogs.msdn.com/shawnste/archive/2007/06/07/encoder-decoder-encoding-fallbacks-fail-after-2gb-of-data-has-been-converted.aspx</link><pubDate>Thu, 07 Jun 2007 23:19:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:3148083</guid><dc:creator>shawnste</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/3148083.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=3148083</wfw:commentRss><description>&lt;P&gt;We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail.&amp;nbsp; That's a lot of data, but it still shouldn't do that.&amp;nbsp; This is a problem with our built in fallbacks.&lt;/P&gt;
&lt;P&gt;Ironically, if you encounter bad bytes then the bug is reset and you're "good" for another 2GB.&amp;nbsp; This bug happens to most of our code pages for valid data, but some optimizations make it unlikely to happen&amp;nbsp;in Unicode, ASCII &amp;amp; Latin-1.&amp;nbsp; There are some workarounds.&amp;nbsp; Some of these don't work if you're insulated from the decoder/encoder (like using a StreamWriter):&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Change the encoder and decoder&amp;nbsp;fallbacks to&amp;nbsp;custom fallbacks, or use&amp;nbsp;the built-in EncoderExceptionFallback.&amp;nbsp; If you have known-good data, the ExceptionFallback would be a good choice.&lt;/LI&gt;
&lt;LI&gt;Use UTF-8 or UTF-16.&amp;nbsp; I think this nearly completely solves the problem.&amp;nbsp; At the minimum it extends the data by enough magnitudes that your computer would probably die of hardware failure before you hit the bug.&lt;/LI&gt;
&lt;LI&gt;Unconvertible data resets the bug, so you have another 2GB before it'll die.&amp;nbsp; You may be able to occasionally introduce an unconvertible code point (like U+FFFD).&lt;/LI&gt;
&lt;LI&gt;This only happens&amp;nbsp;when the encoder/decoder fallback buffers aren't reset.&amp;nbsp; Using the Encoding.GetBytes/GetChars&amp;nbsp;won't fail unless you&amp;nbsp;try&amp;nbsp;a string longer than 2GB.&amp;nbsp; If you are using short text segments that&amp;nbsp;don't need the Encoder or Decoder state this would be a good state.&amp;nbsp; For example, if you're piping a bunch of messages to the console, you might consider just sending one line at a time using the Encoding class.&lt;/LI&gt;
&lt;LI&gt;Getting a new Encoder or Decoder object when possible will&amp;nbsp;give you a fresh start.&amp;nbsp; For example if you process a bunch of smaller documents you might change encoders/decoders between documents, or between records or whatever.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Hope that helps,&lt;/P&gt;
&lt;P&gt;Shawn&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=3148083" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>How do I get HKSCS 2004 characters from Big-5 in .Net?</title><link>http://blogs.msdn.com/shawnste/archive/2007/05/03/how-do-i-get-hkscs-2004-characters-from-big-5-in-net.aspx</link><pubDate>Thu, 03 May 2007 23:29:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:2399152</guid><dc:creator>shawnste</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/2399152.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=2399152</wfw:commentRss><description>&lt;P&gt;Well, that's pretty tricky.&amp;nbsp; We provide the &lt;A href="http://www.microsoft.com/downloads/details.aspx?FamilyID=0e6f5ac8-7baa-4571-b8e8-78b3b776afd7&amp;amp;DisplayLang=en#Overview" mce_href="http://www.microsoft.com/downloads/details.aspx?FamilyID=0e6f5ac8-7baa-4571-b8e8-78b3b776afd7&amp;amp;DisplayLang=en#Overview"&gt;Microsoft Character Code Conversion Routines For HKSCS-2004&lt;/A&gt;&amp;nbsp;functions, but those are intended for use with unmanaged code.&lt;/P&gt;
&lt;P&gt;The fundemental problem is that these "HKSCS" characters were in use prior to the assigment of a code point for them in Unicode.&amp;nbsp; In order to support them, we mapped Big 5 / Code Page 950 HKSCS characters to the Unicode Private Use area.&amp;nbsp; So now there is data with these code points in the PUA and in Big 5, AND at the Unicode 5.0 code points.&amp;nbsp; The expectation is to use Unicode long term, so these functions were provided to help map old data to the new Unicode 5 code points.&lt;/P&gt;
&lt;P&gt;Another way for a managed application to solve this problem would be to create your own Encoding and map the Big 5 code points to their new Unicode code points instead of the old code page 950 mappings.&amp;nbsp; It is nearly impossible for Microsoft to provide a patch to do this because some users have data in the old PUA code space and their applications would break if the data was suddenly migrated to the assigned HKSCS code points without them opting in.&amp;nbsp; Eventually "all" the interesting data should be migrated from the PUA code points to the Unicode HKSCS code points, but until then the problem remains.&lt;/P&gt;
&lt;P&gt;The code samples and links from the "Microsoft Character Code Conversion Routines For HKSCS-2004" document would be a good starting spot to generate the necessary mappings to make an Encoding that moved code page 950 data to the new HKSCS code points.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=2399152" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Please avoid UTF-7</title><link>http://blogs.msdn.com/shawnste/archive/2007/05/01/please-avoid-utf-7.aspx</link><pubDate>Tue, 01 May 2007 21:13:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:2361079</guid><dc:creator>shawnste</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/2361079.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=2361079</wfw:commentRss><description>&lt;P&gt;UTF-7 inherently some of the security issues that concern people about encodings.&amp;nbsp; For example, by shifting in &amp;amp; out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems.&lt;/P&gt;
&lt;P&gt;UTF-7 is primarily interesting for legacy mail and NNTP applications that don't properly handle native or MIME encoded UTF-8.&amp;nbsp; The need for new content to be encoded in UTF-7 is very low.&amp;nbsp; In particular UTF-7 should be avoided with any modern systems that are natively 8-bit.&amp;nbsp; For example XML files don't inherently have any limitations that would force the need for UTF-7, so there should be no need for UTF-7 in XML files.&lt;/P&gt;
&lt;P&gt;Of course with any general rule there may be some exceptions, but I'd encourage you to support UTF-8 or UTF-16 and only use UTF-7 if you run into some system that can't support an 8-bit encoding.&amp;nbsp; If you run into such 7 bit limitations it should probably be a warning that some redesign might be necessary.&amp;nbsp; For mail this is being considered by the IETF's&amp;nbsp;eai working group&amp;nbsp;at &lt;A href="http://www.ietf.org/html.charters/eai-charter.html"&gt;http://www.ietf.org/html.charters/eai-charter.html&lt;/A&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=2361079" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Some Reasons to Make Your Application Unicode</title><link>http://blogs.msdn.com/shawnste/archive/2007/03/20/some-reasons-to-make-your-application-unicode.aspx</link><pubDate>Tue, 20 Mar 2007 22:58:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1914417</guid><dc:creator>shawnste</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/1914417.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=1914417</wfw:commentRss><description>&lt;P&gt;[Updated Mar 30 2007: Mike pointed out errors which I've corrected]&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many applications are "still" ANSI and can't handle Unicode.&amp;nbsp; We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better.&amp;nbsp; In particular there are a bunch of good reasons to move your app to Unicode.&amp;nbsp; I'm rushed so I'm only listing a few here.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;We have been adding many new locales and keyboards.&amp;nbsp; Many of those don't have code pages.&amp;nbsp; These "Unicode Only" locales have to either pick a system code page that only marginally supports their language, if it supports it at all.&amp;nbsp; In these cases your ANSI application will be completely unusable in these locales.&lt;/LI&gt;
&lt;LI&gt;Data passed between ANSI systems is easy to misinterprete if the systems have different code pages.&amp;nbsp; This leads to random data corruption, some of which isn't always recoverable.&lt;/LI&gt;
&lt;LI&gt;ANSI apps don't support the full range of characters, so users with unique requirements may not be able to enter data completely or correctly.&lt;/LI&gt;
&lt;LI&gt;Mixed language&amp;nbsp;environments fail with ANSI only applications.&lt;/LI&gt;
&lt;LI&gt;ANSI only bugs like &lt;A href="http://blogs.msdn.com/shawnste/archive/2007/03/19/some-keyboards-fail-with-ansi-applications-on-windows-vista-rtm.aspx" mce_href="http://blogs.msdn.com/shawnste/archive/2007/03/19/some-keyboards-fail-with-ansi-applications-on-windows-vista-rtm.aspx"&gt;Some Keyboards fail with ANSI applications on Windows Vista RTM&lt;/A&gt; won't impact your application&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;We have encountered numerous customer issues which could've been solved fairly trivially by using Unicode applications.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Some popular messenging applications are not Unicode, so users cannot always send or receive messages properly in multi-lingual environments.&amp;nbsp; For people working in other countries this is a common case.&lt;/LI&gt;
&lt;LI&gt;Most people have seen "gibberish" on web sites due to mistagged data.&amp;nbsp; UTF-8 or UTF-16 would solve most of this confusion.&lt;/LI&gt;
&lt;LI&gt;Many media tagging systems didn't originally specify an encoding for metadata, causing corrupted metadata when viewing on other machines.&amp;nbsp; (This is also an example of data that can be very difficult to recover)&lt;/LI&gt;
&lt;LI&gt;Data being sync'd to phones, etc. hasn't always worked if part of the chain is ANSI.&lt;/LI&gt;
&lt;LI&gt;Wireless SSIDs (wireless WLAN names) don't specify code pages, so if you're in a foreign airport or other multicultural environment you might get gibberish for the names when trying to find a network to connect to.&lt;/LI&gt;
&lt;LI&gt;Customer names have accents dropped or turned into ? when unexpected code points are expected.&amp;nbsp; (for example, you have a web form that the user enters their own name correctly, but when a printer merges this for a magazine subscription or whatever the non-ANSI/non-ASCII data gets lost).&amp;nbsp; Some users are very irritated when their name gets misprinted in this manner.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Most of these issue are easily solved by considering the encoding requirements and character repertoires, but it is often overlooked.&amp;nbsp; US developers seem to be particularly susceptible to this design problem since ANSI is easy to use and their applications have a large US market even if they only use ANSI characters.&lt;/P&gt;
&lt;P&gt;There are some occasions, primarily for backward compatibility or existing protocols that didn't plan for Unicode, where applications can't avoid ANSI.&amp;nbsp; In those cases I suggest A) trying to get a plan to remove the back compat issue or fix the protocol so that this problem&amp;nbsp;doesn't continue for decades, B) use Unicode in the meantime and only convert it to the ANSI code page when necessary, and C) tag the data with the appropriate code page when possible so the receiver has a hope of decoding it properly.&amp;nbsp; It is best to avoid these situations though because they invariably have edge cases that are difficult to handle.&lt;/P&gt;
&lt;P&gt;Use Unicode :)&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1914417" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>A History of Code Pages or What Made Code Page XXXX (or many other computer things) The Way It Is?</title><link>http://blogs.msdn.com/shawnste/archive/2007/03/13/The-History-of-Code-Pages.aspx</link><pubDate>Tue, 13 Mar 2007 21:29:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1867403</guid><dc:creator>shawnste</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/1867403.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=1867403</wfw:commentRss><description>&lt;P&gt;&lt;EM&gt;Disclaimer:&amp;nbsp; This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me.&amp;nbsp; I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages.&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;Taking a snapshot of the current windows code pages (or any other code pages), one can wonder how some of these code pages ended up in their current state.&amp;nbsp; We also wonder about other things such as peculiarities of a function call and other related behavior.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It is important to remember that modern computer systems evolved from earlier systems and “we”, as in the entire computer science community on the planet, have learned a lot since the beginnings of computer science.&amp;nbsp; Rarely do we get the chance to start with “a clean slate” and redesign APIs or systems.&amp;nbsp; Even when we do, we only have our best intentions and previous lessons to learn from, and sometimes those new designs prove to have weaknesses that weren’t originally seen.&lt;/P&gt;
&lt;P&gt;In DOS days of PC history, “code pages” were the bytes used to directly print to the console.&amp;nbsp; Apple, Commodore, IBM, and probably many others used bytes to map to a character on the console.&amp;nbsp; (Before that there were the values that showed up on Teletypes or punch cards, but I’m kind of focusing on Windows history).&amp;nbsp; The US and “western” cultures seem to have had a great influence on the development of early PCs, and the ASCII standard was very common.&amp;nbsp; Many future behaviors were based on ASCII or similar work.&lt;/P&gt;
&lt;P&gt;ASCII only specified 7 bits of information, but since PCs had 8 bits most manufacturers extended the code pages to provide additional glyphs, such as diacritics or additional scripts (besides latin).&amp;nbsp; This provided the ability to represent many languages, but at a hidden cost of data portability.&amp;nbsp; Since most data was confined to single companies and global exchange of data wasn’t a primary concern this wasn’t a big problem at first. &lt;/P&gt;
&lt;P&gt;Additionally, since these bytes were used to render glyphs on the screen it seemed wasteful to ignore the non-printable control sequences from 01-1f, so smiley faces, hearts, spades and the like were added.&lt;/P&gt;
&lt;H3&gt;Users Want&amp;nbsp;&lt;EM&gt;Their&lt;/EM&gt; Glyphs:&lt;/H3&gt;
&lt;P&gt;As computing evolved users wanted more glyphs and several techniques evolved to solve that problem.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Font changing was used by some systems (and continue to be used in some cases).&amp;nbsp; Early DOS PCs effectively changed the font used for the display when they changed the “OEM code page”.&amp;nbsp; Once multiple font use became common, this technique evolved to allow multiple glyph sets to be displayed in a single application merely by changing the font.&amp;nbsp; In modern systems Unicode provides a Private Use Area (PUA) for users to stick their custom glyphs, but font hacks continue to be used.&amp;nbsp; The PUA solution doesn’t work on the console or for ANSI applications, so some groups have created font hacks that render the desired glyphs, yet their system uses a code page with different characters than those the font displays.&amp;nbsp; The adoption of this technique ranges from users “playing” with the invented Klingon script to national “standards” attempting to make computers work for them where OEMs have been slow to create fonts or other solutions.&lt;/LI&gt;
&lt;LI&gt;Switching fonts is effectively built-in to the ISCII standard.&amp;nbsp; The idea is that escape sequences are used to select which font is to be used for the 8th bit character ranges.&amp;nbsp; Originally this included the idea of simple transliteration (by merely changing the rendered font), but this doesn’t seem to be used much in practice.&amp;nbsp; This technique sort of standardizes the use of the font changing technique.&amp;nbsp; This is obviously an evolution beyond the early PCs that could only display a single font.&amp;nbsp;&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;For CJK (Chinese, Japanese &amp;amp; Korean) scripts, 8 bit fonts aren’t enough.&amp;nbsp; CJK code pages are usually still ASCII compatible, but they’ve evolved other techniques for rendering additional characters.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Double byte code pages have the idea that a specific range of bytes are “lead bytes” and are to be followed by a “trail byte”.&amp;nbsp; Combined the lead &amp;amp; trail bytes provide many additional characters.&amp;nbsp; CP 932, etc. are examples of this technique.&lt;/LI&gt;
&lt;LI&gt;This idea was extended by GB18030 to provide additional lead bytes that indicate 4 byte sequences, allowing even more characters to be encoded.&lt;/LI&gt;
&lt;LI&gt;Shifting code pages are similar to ISCII in that they select additional modes.&amp;nbsp; They typically “start” in an ASCII-like code page, but particular escape sequences cause the following bytes to be interpreted according to other rules.&amp;nbsp; Generally these provide additional two byte or single byte sequences.&amp;nbsp; Note that the shift sequences are typically single byte, even when currently in a double byte mode.&lt;/LI&gt;&lt;/UL&gt;
&lt;H3&gt;Evolution of the Character Repertoire:&lt;/H3&gt;
&lt;P&gt;In addition to the evolution of techniques, the repertoire of supported characters has been evolving.&amp;nbsp; Unfortunately the drivers of this process are rarely coordinated across the industry.&amp;nbsp; As a need for a new character becomes apparent, organizations add it to the standards that they influence or control, but this doesn’t guarantee adoption across the industry, particularly if they don’t coordinate with other standards or organizations.&lt;/P&gt;
&lt;P&gt;This repertoire evolution can cause the behavior of code pages to evolve as well.&amp;nbsp; For example, the Euro was invented well after the creation of ASCII and many of the many other code pages.&amp;nbsp; Obviously it was needed, so it was added to most code pages, squeezing into unused spaces where possible.&amp;nbsp; For single byte code pages that could mean replacing a previously rarely used code point.&amp;nbsp; Of course if a vendor used that rarely used code point for something special in their application, then this caused behavioral changes.&lt;/P&gt;
&lt;P&gt;For other standards the repertoire evolution has meant evolving iterations of the standard.&amp;nbsp; Several organizations add characters to their standards, but it can take a while for those to make it to the font vendor or other level necessary for complete support.&amp;nbsp; Shifting standards can also change existing user data or private use behavior, so supporting new standards isn’t always a trivial undertaking.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;Some character sets have been complicated by standards dependencies.&amp;nbsp; For example if a desirable standard assigns a bunch of characters and users want Windows support, then Windows has to find space in Unicode since Windows is Unicode based.&amp;nbsp; In the best case the desired characters are already assigned to Unicode so windows can “just” add font support (not necessarily trivial) and is good to go.&amp;nbsp; Historically however, characters are usually created by some other authority and may take a while to get official Unicode support.&amp;nbsp; In those cases, the characters can remain unsupported, or someone can add PUA characters to support them until Unicode supports them. &lt;/P&gt;
&lt;P&gt;If PUA characters are used to temporarily support additional characters, then there are additional problems when they are added to Unicode since existing data will need to be migrated from the PUA to the actual Unicode code point.&amp;nbsp; Migration may also be complicated by the fact that all users may not be able to upgrade at the same time.&lt;/P&gt;
&lt;H3&gt;Implementation Details:&lt;/H3&gt;
&lt;P&gt;Another problem impacting the way code pages behave is how (and when) they’ve been implemented.&amp;nbsp; Occasionally standards have had errors that were corrected in later versions.&amp;nbsp; Other times a platform vendor may have interpreted the behavior in an unexpected way.&amp;nbsp; Sometimes a font vendor for a common font could make an error with a code point.&amp;nbsp; Additionally users may commonly confuse a glyph with a similar glyph and abuse the existing standard.&lt;/P&gt;
&lt;P&gt;All of these contribute to variations in the way code page data is handled.&amp;nbsp; Once data is coded in a particular way, correcting the data may be complicated.&amp;nbsp; It can be easy to identify an implementation bug and find the “correct” solution, but making the fix can break existing behavior or data portability.&lt;/P&gt;
&lt;H3&gt;Oddities:&lt;/H3&gt;
&lt;P&gt;For historical reasons there are also some oddities in encoded data.&amp;nbsp; Remember that code points were often merely glyphs on the computer screen?&amp;nbsp; And those glyphs depended on the rendering of that machine?&amp;nbsp; Well DOS used the \ character to delimit folders on the file system.&amp;nbsp; CJK users however wanted to be able to type their currency symbol on their machines.&amp;nbsp; Since people don’t use \ very often, it got replaced with the appropriate currency symbol on Asian machines.&amp;nbsp; Internally it was always 0x5C however, and the machine always used that byte value to delimit folders.&amp;nbsp; The end result is a mess where 0x5c doesn’t convert to Unicode very well, where users have different file system delimiter characters, and where fonts end up hacked to render ¥ instead of \ if you have a certain system code page.&amp;nbsp; This is obviously really undesirable, yet it is pretty obvious how this happened and pretty difficult to “fix” at this time.&lt;/P&gt;
&lt;H3&gt;Conclusion:&lt;/H3&gt;
&lt;P&gt;I find it helpful to remember this stuff when confronted with another code page oddity.&amp;nbsp; One of my goals is to reduce any further complexity in this evolutionary tree of code pages.&amp;nbsp; It is often “clear” what the desired or proper behavior should be when you consider only current standards or when you know the lessons we’ve always learned.&amp;nbsp; With a living system of data it isn’t always possible to get from the current state to the perfect state in a simple manner without causing pain to some users.&amp;nbsp; In that case I try to limit the long term pain, and reduce the problems to as few users as possible.&lt;/P&gt;
&lt;P&gt;Similar examples exist for nearly all API sets and programming languages, OS’s and techniques.&amp;nbsp; The global “we” of computer science has learned a lot and continues to learn a lot, but sometimes it’s helpful to remember how an API may have evolved when it doesn’t seem to be doing the most appropriate thing.&lt;/P&gt;
&lt;P&gt;For code pages this is a good reason to use Unicode.&amp;nbsp; Windows is natively Unicode and most other systems understand it.&amp;nbsp; It is also reasonably unambiguous, although it does have its own evolutionary quirks.&amp;nbsp; By focusing on a single encoding (Unicode), we can reduce the complexities cause by natural variations introduced as encodings evolve.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Reminder: This is mostly my conjecture and seems reasonable to me, although it might be wrong or lack specifics.&lt;BR&gt;&lt;/EM&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1867403" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Expected names of Microsoft Windows "ANSI" Code Pages (Encodings)</title><link>http://blogs.msdn.com/shawnste/archive/2006/11/06/expected-names-of-microsoft-windows-ansi-code-pages-encodings.aspx</link><pubDate>Tue, 07 Nov 2006 02:52:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:1006751</guid><dc:creator>shawnste</dc:creator><slash:comments>0</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/1006751.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=1006751</wfw:commentRss><description>&lt;P&gt;I was asked about our use of the windows "ansi" code page names, as used in things&amp;nbsp;like MIME types,&amp;nbsp;http content-type tags, etc.&amp;nbsp; Each "code page" has a name that most accuratly round trips back to the same code page, which I've listed as the "preferred name" below.&amp;nbsp; Additionally, when you ask for a code page matching a name, some code pages have several aliases that map to the identical behavior.&amp;nbsp; (listed as "aliases" in the table below).&lt;/P&gt;
&lt;P&gt;Note that there are quite a few inconsistencies and other odd behaviors.&amp;nbsp; Some have names of windows-xxx, and others don't even recognize that form as an alias.&amp;nbsp; Additionally some reference material I've seen refers to any microsoft code page in the windows-xxxx or CPxxx form, whether or not windows itself recognizes those names.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;We have no intention of trying to create a more consistent naming scheme, we prefer that applications use Unicode.&lt;/P&gt;
&lt;P&gt;So in practice, you should use the "preferred name" to identify data tagged by a particular code page, but if you are accepting input data, recognize that it may also use one of the listed aliases.&amp;nbsp; Encoding.GetEncoding() should "do the right thing".&amp;nbsp; You might want to look at my previous posts "&lt;A id=ctl00___ctl00___ctl01___Results___postlist___EntryItems_ctl02_PostTitle href="http://blogs.msdn.com/shawnste/archive/2006/07/18/669963.aspx" mce_href="http://blogs.msdn.com/shawnste/archive/2006/07/18/669963.aspx"&gt;&lt;FONT color=#006bad&gt;Encoding.GetEncodings() has a couple "duplicate" names&lt;/FONT&gt;&lt;/A&gt;" and "&lt;A id=ctl00___ctl00___ctl01___Results___postlist___EntryItems_ctl06_PostTitle href="http://blogs.msdn.com/shawnste/archive/2005/12/28/507816.aspx" mce_href="http://blogs.msdn.com/shawnste/archive/2005/12/28/507816.aspx"&gt;&lt;FONT color=#006bad&gt;What's my Encoding Called?&lt;/FONT&gt;&lt;/A&gt;".&lt;/P&gt;
&lt;H3&gt;Code Page 874 (ANSI/OEM - Thai)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-874&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;DOS-874&lt;/DD&gt;
&lt;DD&gt;iso-8859-11&lt;/DD&gt;
&lt;DD&gt;TIS-620&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 932 (ANSI/OEM - Japanese Shift-JIS)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;shift_jis&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;csShiftJIS&lt;/DD&gt;
&lt;DD&gt;csWindows31J&lt;/DD&gt;
&lt;DD&gt;ms_Kanji&lt;/DD&gt;
&lt;DD&gt;shift-jis&lt;/DD&gt;
&lt;DD&gt;sjis&lt;/DD&gt;
&lt;DD&gt;x-ms-cp932&lt;/DD&gt;
&lt;DD&gt;x-sjis&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 936 (ANSI/OEM - Simplified Chinese GBK)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;gb2312&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;chinese&lt;/DD&gt;
&lt;DD&gt;CN-GB&lt;/DD&gt;
&lt;DD&gt;csGB2312&lt;/DD&gt;
&lt;DD&gt;csGB231280&lt;/DD&gt;
&lt;DD&gt;csISO58GB231280&lt;/DD&gt;
&lt;DD&gt;GB2312-80&lt;/DD&gt;
&lt;DD&gt;GB231280&lt;/DD&gt;
&lt;DD&gt;GBK&lt;/DD&gt;
&lt;DD&gt;GB_2312-80&lt;/DD&gt;
&lt;DD&gt;iso-ir-58&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 949 (ANSI/OEM - Korean)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;ks_c_5601-1987&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;csKSC56011987&lt;/DD&gt;
&lt;DD&gt;iso-ir-149&lt;/DD&gt;
&lt;DD&gt;korean&lt;/DD&gt;
&lt;DD&gt;ks-c-5601&lt;/DD&gt;
&lt;DD&gt;ks-c5601&lt;/DD&gt;
&lt;DD&gt;KSC5601&lt;/DD&gt;
&lt;DD&gt;KSC_5601&lt;/DD&gt;
&lt;DD&gt;ks_c_5601&lt;/DD&gt;
&lt;DD&gt;ks_c_5601-1989&lt;/DD&gt;
&lt;DD&gt;ks_c_5601_1987&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 950 (ANSI/OEM - Traditional Chinese Big5)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;big5&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;Big5-HKSCS&lt;/DD&gt;
&lt;DD&gt;cn-big5&lt;/DD&gt;
&lt;DD&gt;csbig5&lt;/DD&gt;
&lt;DD&gt;x-x-big5&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1250 (ANSI - Central Europe)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1250&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;x-cp1250&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1251 (ANSI - Cyrillic)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1251&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;x-cp1251&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1252 (ANSI - Latin I)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;Windows-1252&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;x-ansi&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1253 (ANSI - Greek)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1253&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1254 (ANSI - Turkish)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1254&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1255 (ANSI - Hebrew)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1255&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1256 (ANSI - Arabic)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1256&lt;/DD&gt;&lt;BR&gt;&lt;B&gt;Aliases:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;cp1256&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1257 (ANSI - Baltic)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1257&lt;/DD&gt;&lt;BR&gt;
&lt;H3&gt;Code Page 1258 (ANSI/OEM - Viet Nam)&lt;/H3&gt;&lt;B&gt;Preferred Name:&lt;/B&gt;&lt;BR&gt;
&lt;DD&gt;windows-1258&lt;/DD&gt;&lt;BR&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=1006751" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Example of overriding your own Encoding.</title><link>http://blogs.msdn.com/shawnste/archive/2006/10/12/example-of-overriding-your-own-encoding.aspx</link><pubDate>Fri, 13 Oct 2006 02:00:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:821099</guid><dc:creator>shawnste</dc:creator><slash:comments>7</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/821099.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=821099</wfw:commentRss><description>&lt;P&gt;Previously I wrote about&amp;nbsp;the&amp;nbsp;&lt;A id=bp___ctl00___RecentPosts___postlist___EntryItems_ctl02_PostTitle href="http://blogs.msdn.com/shawnste/archive/2006/09/29/777452.aspx"&gt;Best Way to Make Your Own Encoding&lt;/A&gt;, but didn't include an example, so today I'm including an example of a replacement Encoding.&amp;nbsp; I also included an EncoderFallback example, which replaces unknown characters with numerical entity style replacements (&amp;amp;#12345;).&amp;nbsp; &lt;/P&gt;
&lt;P&gt;This example isn't complete.&amp;nbsp; If you need Encoder or Decoder functionality you'd have to override those as well.&amp;nbsp; Also I didn't include a DecoderFallback example.&amp;nbsp; From this example those should be reasonably straight forward.&amp;nbsp; The biggest issues are that Encoders/Decoders maintain state such as lead bytes or high surrogates, so they may have data buffered from a previous conversion.&lt;/P&gt;
&lt;P&gt;I included a simple Main that just converts some text to bytes and back.&amp;nbsp; Its not very pretty, but it demonstrates that something did actually happen :)&amp;nbsp; Be forwarned that I spent almost no time testing this sample, so caveat programmer!&amp;nbsp; As always this sample is provided as-is with no warrenties or guarantees.&lt;/P&gt;
&lt;P&gt;Hope you find this helpful,&lt;/P&gt;
&lt;P&gt;Shawn&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=821099" width="1" height="1"&gt;</description><enclosure url="http://blogs.msdn.com/shawnste/attachment/821099.ashx" length="13600" type="text/plain" /><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Best Way to Make Your Own Encoding</title><link>http://blogs.msdn.com/shawnste/archive/2006/09/29/best-way-to-make-your-own-encoding.aspx</link><pubDate>Sat, 30 Sep 2006 01:06:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:777452</guid><dc:creator>shawnste</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/777452.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=777452</wfw:commentRss><description>&lt;P&gt;Martin recently asked what the best way to roll his own encoding in .Net 2.0, in particular can you override Encoding/Encoder/Decoder, or should he write his own StreamWriter.&lt;/P&gt;
&lt;P&gt;#1 is, of course, to use Unicode :), but apparently Martin doesn't have that option.&lt;/P&gt;
&lt;P&gt;The answer is that you can write your own Encoding derived from the Encoding class and use it as any of the built-in Encodings.&amp;nbsp; They'll be a little slower in some cases due to some shortcuts we take internally, but otherwise you should be able to use them everywhere you use a normal Encoding/Encoder/Decoder object.&amp;nbsp; &lt;/P&gt;
&lt;P&gt;There's an example of using the fallbacks at &lt;A href="http://windowssdk.msdn.microsoft.com/en-us/library/tt6z1500.aspx" mce_href="http://windowssdk.msdn.microsoft.com/en-us/library/tt6z1500.aspx"&gt;http://windowssdk.msdn.microsoft.com/en-us/library/tt6z1500.aspx&lt;/A&gt;, which isn't quite the same thing, but might help a little.&lt;/P&gt;
&lt;P&gt;[updated 12 Oct 2006]&lt;BR&gt;I've stuck an example of overriding encodings in this post: &lt;A id=bp___ctl00___RecentPosts___postlist___EntryItems_ctl00_PostTitle href="http://blogs.msdn.com/shawnste/archive/2006/10/12/example-of-overriding-your-own-encoding.aspx"&gt;&lt;FONT color=#006bad&gt;Example of overriding your own Encoding.&lt;/FONT&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;- Shawn&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=777452" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Encoding.GetEncodings() has a couple "duplicate" names</title><link>http://blogs.msdn.com/shawnste/archive/2006/07/18/encoding-getencodings-has-a-couple-duplicate-names.aspx</link><pubDate>Tue, 18 Jul 2006 20:41:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:669963</guid><dc:creator>shawnste</dc:creator><slash:comments>4</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/669963.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=669963</wfw:commentRss><description>&lt;P&gt;The Microsoft.Net v2.0 Encoding.GetEncodings() method returns a complete list of supported encodings, uniquely distinguished by code page.&amp;nbsp; Note that in general I consider the code page number to be a poor way to exchange code page information since its not a standard, however for now it does provide a unique ID for .Net Encodings.&amp;nbsp; By Name there are some duplicates, although they have different DisplayNames.&lt;/P&gt;
&lt;P&gt;Encodings 20932 and 51932 both return the Name "euc-jp", and indeed are identical code pages.&amp;nbsp; If you ask for "euc-jp", the framework will return 51932, so if you want to remove one, I'd remove 20932 from any list you make.&lt;/P&gt;
&lt;P&gt;UPDATE - 11/29/1006 (snow day in Redmond).&amp;nbsp; Actually it was pointed out that 51932 doesn't work in native windows APIs, so you'd have to pick 20932 for native applications and 51932 for .Net applications (so that it would round trip in .Net).&lt;/P&gt;
&lt;P&gt;50220 and 50222 also return "iso-2022-jp" for their Name.&amp;nbsp; If you ask for "iso-2022-jp", you'll end up with 50220, so I'd remove 50222 from any list of encodings.&amp;nbsp; Which one you should prefer depends on the preferred treatment of the half width katakana.&lt;/P&gt;
&lt;P&gt;Unlike the euc-jp encodings, the 50220 and 50222 encodings are slightly different.&amp;nbsp; When encoding, 50220 will convert half width katakana to full width and 50222 will use a shift-in/shift-out sequence to encode half width katakana.&amp;nbsp; The DisplayName for 50222 is “Japanese (JIS-Allow 1 byte Kana - SO/SI)” to distinguish it from 50222 “Japanese (JIS)” even though they have the same iso-2022-jp Name.&lt;BR&gt;&lt;/P&gt;
&lt;P mce_keep="true"&gt;&amp;nbsp;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=669963" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? Part 2</title><link>http://blogs.msdn.com/shawnste/archive/2006/06/21/642225.aspx</link><pubDate>Thu, 22 Jun 2006 03:38:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:642225</guid><dc:creator>shawnste</dc:creator><slash:comments>1</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/642225.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=642225</wfw:commentRss><description>&lt;P&gt;A little over a year ago I wrote &lt;A id=_ctl0____ctl0___CategoryView___postlist___EntryItems__ctl8_PostTitle href="http://blogs.msdn.com/shawnste/archive/2005/03/02/383903.aspx"&gt;&lt;FONT color=#0000ff&gt;What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()?&lt;/FONT&gt;&lt;/A&gt;&amp;nbsp;to address the question "Why does GetMaxCharCount(1) for my favorite Encoding return 2 instead of 1."&amp;nbsp; (Short answer is that the Decoder/Encoder could have stored data from a previous call).&lt;/P&gt;
&lt;P&gt;To follow up, what about the special case of zero?&amp;nbsp; It seems that GetMaxByte/CharCount(0) should always be 0.&amp;nbsp; The answer again is because of the encoder/decoder and the fallback.&lt;/P&gt;
&lt;P&gt;Consider that a call to Decoder.GetChars() ends with a lead byte for UTF-8.&amp;nbsp; The decoder is going to remember that lead byte, expecting the next call to GetChars() to contain the remaining byte(s) necessary to decode a complete UTF-8 sequence.&lt;/P&gt;
&lt;P&gt;However if the next call passes in an empty input buffer, yet requests that the buffer get flushed, then the decoder's going to have to process that lonely lead byte anyway.&amp;nbsp; This happens for example at the end of a sequence.&amp;nbsp; In this case, the decoder's going to call the fallback for the lone lead byte, which by default for UTF-8 will &lt;A href="http://blogs.msdn.com/shawnste/archive/2006/06/16/634666.aspx"&gt;&lt;FONT color=#0000ff&gt;now&lt;/FONT&gt;&lt;/A&gt; return a U+FFFD.&amp;nbsp; So even with an empty input buffer, UTF-8 can return a character.&lt;/P&gt;
&lt;P&gt;Similar cases happen with most other encodings, although there are a few cases where encodings don't have left over bytes when decoding.&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=642225" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Change to Unicode Encoding for Unicode 5.0 conformance</title><link>http://blogs.msdn.com/shawnste/archive/2006/06/16/change-to-unicode-encoding-for-unicode-5-0-conformance.aspx</link><pubDate>Sat, 17 Jun 2006 01:07:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:634666</guid><dc:creator>shawnste</dc:creator><slash:comments>3</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/634666.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=634666</wfw:commentRss><description>&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0&amp;nbsp;with MS07-040 update applied.&amp;nbsp; See&amp;nbsp;the list of known issues for &lt;A href="http://www.microsoft.com/technet/security/Bulletin/ms07-040.mspx"&gt;MS07-040&lt;/A&gt; described in &lt;A href="http://support.microsoft.com/kb/931212"&gt;KB 931212&lt;/A&gt;.&amp;nbsp; &lt;A href="http://support.microsoft.com/kb/940521/"&gt;KB 940521&lt;/A&gt; describes this behavior in particular.]&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;In .Net Framework V2.0 RTM we chose to respect the Unicode 4.1 standard which disallowed passing illegal UTF code points by dropping any bad data that was encountered, considering that this behavior would have the minimal impact to existing applications.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;Since the .Net Framework 2.0 was released, the latest Unicode 5.0 specification has become stricter.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;There was a concern that just ignoring invalid bytes could allow insecure hostile data because invalid characters would be dropped so and invalid string could become valid.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;The new requirement for Unicode 5.0 is that bad bytes cannot be dropped, so we are now replacing them with U+FFFD, the Unicode Replacement Character, in Windows Vista, and future versions of the .Net Framework, including the .Net Framework 2.0 on Vista, and .Net 2.0 with the MS07-040 update applied.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;The new default behavior is equivalent to setting the replacement fallbacks to "\xFFFD” instead of the empty string.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If applications prefer the old behavior, they can create their UTF8Encoding with an EncoderReplacementFallback("") and DecoderReplacementFallback(""), causing the fallbacks to drop the bad data.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;Because of the +- and other oddities with "UTF-7" its generally considered insecure anyway for similar reasons and UTF-8 is generally preferred.&lt;o:p&gt;&lt;/o:p&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;o:p&gt;&lt;FONT face=Calibri&gt;&amp;nbsp;&lt;/FONT&gt;&lt;/o:p&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;FWIW:&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;My recommendation is that applications shouldn’t make trust decisions on encoded data, this goes for the other code page encodings as well as Unicode.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;Encoding and decoding data can cause it to change its form.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;(See &lt;/FONT&gt;&lt;A href="http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx" mce_href="http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx"&gt;&lt;FONT face=Calibri&gt;Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided&lt;/FONT&gt;&lt;/A&gt;&lt;FONT face=Calibri&gt; &lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp;&lt;/SPAN&gt;for one example).&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;If your application needs to make sure that an input string doesn’t include C:\windows, it should do the validation after decoding the data.&lt;SPAN style="mso-spacerun: yes"&gt;&amp;nbsp; &lt;/SPAN&gt;I’ll probably blog more about this later.&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;&lt;/FONT&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;'til then,&lt;/FONT&gt;&lt;/P&gt;
&lt;P class=MsoNormal style="MARGIN: 0in 0in 0pt"&gt;&lt;FONT face=Calibri&gt;Shawn&lt;/FONT&gt;&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=634666" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item><item><title>Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided</title><link>http://blogs.msdn.com/shawnste/archive/2006/01/19/515047.aspx</link><pubDate>Thu, 19 Jan 2006 23:16:00 GMT</pubDate><guid isPermaLink="false">91d46819-8472-40ad-a661-2c78acb4018c:515047</guid><dc:creator>shawnste</dc:creator><slash:comments>6</slash:comments><comments>http://blogs.msdn.com/shawnste/comments/515047.aspx</comments><wfw:commentRss>http://blogs.msdn.com/shawnste/commentrss.aspx?PostID=515047</wfw:commentRss><description>&lt;P&gt;Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings.&amp;nbsp; Best fit can be interesting, but often its not a good idea.&amp;nbsp; In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag.&amp;nbsp; In .Net you can use the EncoderFallback to control whether or not to get Best Fit behavior.&amp;nbsp; Unfortunately in both cases best fit is the default behavior.&amp;nbsp; In Microsoft .Net 2.0 best fit is also slower.&lt;/P&gt;
&lt;P&gt;The underlying problem that best fit behavior tries to solve is "Gee, Unicode has about a gajillion more characters than 1252, how do we get them all in?".&amp;nbsp; Unfortunately that's the problem, they won't all fit in.&amp;nbsp; 1252 has 256 characters, the nearly 100,000 Unicode characters just won't fit.&amp;nbsp; So what Best Fit tries to do is to cram as many characters as possible into the limited set of the code page by mapping them to things they might look like.&amp;nbsp; So c with a dot above, ċ, U+010b is mapped to a plain old c with no dot.&amp;nbsp; Japanese full-width forms are mapped to their half width forms, etc.&amp;nbsp; There are lots of problems with this solution, and I'll mention some of them here:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The mappings are somewhat random and sometimes bizarre.&amp;nbsp; The infinity symbol, ∞, U+221e, is mapped to 8.&amp;nbsp; Sure it looks like a sideways 8, but its sideways, and its meaning is very different. 
&lt;LI&gt;The mappings are somewhat random and inconsistent between code pages.&amp;nbsp; In some code pages&amp;nbsp;Japanese fullwidth forms are "best fit" to the non full-width form, in others they are not. 
&lt;LI&gt;The best fit behavior has not been updated in years, so new code points aren't present, so c, ć U+0107 c with acute, ĉ U+0109 c with circumflex, ċ 0x010b c with dot above, č 0x010d c with caron and ｃ U+ff43 fullwidth c, are all mapped to c in code page 1252.&amp;nbsp; However ƈ U+0188 c with hook, ɕ U+0255 c with curl, с U+0441 Cyrillic es, ḉ U+1e09 c with cedilla and acute above and others are not mapped and turn into ?.&amp;nbsp; Also, ç U+00e7 c with cedilla doesn't change since it has its own character in 1252. 
&lt;LI&gt;Many mappings lead to security holes.&amp;nbsp; A common test for .,&amp;nbsp;\ and other characters&amp;nbsp;to prevent ..\ style attacks on paths fail if fullwidth forms are used and not tested for.&amp;nbsp; Since fullwidth forms are often mapped, any English string, like a user name or password can also have multiple variations, leading to security holes.&amp;nbsp; Even if fullwidth forms are considered&amp;nbsp;other mappings with diacritics as mentioned in the previous bullet exist for common English characters. 
&lt;LI&gt;Most of the best fit mappings in our tables were thought of by English speaking Americans and could be culturally inappropriate for other locales. 
&lt;LI&gt;ü and u aren't the same character.&amp;nbsp; Düssledorf has the alternate spelling Duessledorf, replacing the ü with ue, not u.&amp;nbsp; In languages that use diacritics the pronunciation of the character changes.&amp;nbsp; If you made mailing labels for your customers would you really want to change their name?&amp;nbsp; Best case the spelling looks stupid and the customer thinks "gee, these guys have an old computer too".&amp;nbsp; Worst case you turned their name to crap... literally.&amp;nbsp; In that case ? would probably be better, at least your customer would&amp;nbsp;probably understand it was a computer limitation&amp;nbsp;[:)] 
&lt;LI&gt;For typical English US spellings UTF-8 is&amp;nbsp;exactly as space efficient as ASCII or 1252.&amp;nbsp; So if you use UTF-8 you won't need best fit and it won't cost you anything.&amp;nbsp; In .Net and Windows Vista UTF-8 is also much faster than 1252 or ASCII.&amp;nbsp; Even in other languages UTF-8 is faster and for most languages it doesn't even create significantly larger file sizes.&amp;nbsp; A small price to pay to insure that you don't corrupt your data.
&lt;LI&gt;Best fit doesn't even help some of the alternate English spellings, such as the ae, those just turn into ? anyway.&amp;nbsp;
&lt;LI&gt;As I alluded to above, frankly our best fit mapping is pretty poor, even if that behavior is desirable.&amp;nbsp; We're inconsistent with the behavior for different code pages, we haven't added new characters, and&amp;nbsp;we've made some strange decisions.&lt;/LI&gt;&lt;/UL&gt;
&lt;P&gt;Its also worth noting that there are a few rare cases where best fit can happen when decoding data with MultiByteToWideChar or Encoder.GetString or the Decoder class.&lt;/P&gt;
&lt;P&gt;For both Windows and Microsoft .Net, the best plan is to use Unicode when possible, either UTF-8 or UTF-16 is usually a good choice.&amp;nbsp; Sometimes its not possible, usually because of a protocol limitation.&amp;nbsp; Often best fit behavior is a poor choice when hitting a protocol problem, since such protocols are usually explicit and such mappings could cause security holes or protocol violations.&amp;nbsp; In those cases finding extensions or newer protocols that handle Unicode are good, but some, like e-mail headers [;)], we're stuck with.&lt;/P&gt;
&lt;P&gt;In Windows you can disable the best fit behavior by using the WC_NO_BEST_FIT_CHARS flag.&amp;nbsp; In the framework you can do so by changing the EncoderFallback and DecoderFallback.&amp;nbsp; Encoder.GetEncoding(xxx, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback) or ExceptionFallback are good choices.&amp;nbsp; Note that in the .Net 2.0 there is no "best fit" fallback, except for an internal best fit fallback that is used by default, so once you change a class's EncoderFallback or DecoderFallback you cannot easily retrieve the best fit fallback.&lt;/P&gt;
&lt;P&gt;If you are aware of the limitations of the fallbacks and want consistent behavior anyway, one option to consider is making your own fallback.&amp;nbsp; I made a prototype fallback that uses Normalization to decompose a string to its component parts.&amp;nbsp; This is particularly nifty because characters can be decomposed to their component parts.&amp;nbsp;&amp;nbsp;By doing this, things like&amp;nbsp;the kPa symbol can change to k + P + a.&amp;nbsp; It still doesn't work across all languages though since ü would still become a u instead of a ue in German.&amp;nbsp; So even though this can be a fun experiment, it's still better to Use Unicode!&lt;/P&gt;&lt;img src="http://blogs.msdn.com/aggbug.aspx?PostID=515047" width="1" height="1"&gt;</description><category domain="http://blogs.msdn.com/shawnste/archive/tags/Unicode+and+Code+Pages_2F00_Encodings/default.aspx">Unicode and Code Pages/Encodings</category><category domain="http://blogs.msdn.com/shawnste/archive/tags/System.Text/default.aspx">System.Text</category></item></channel></rss>