I'm not a Klingon ( )

Shawn Steele's thoughts about Windows and .Net Framework globalization APIs

Browse by Tags

Tagged Content List
  • Blog Post: Converting text file code pages

    I've said "use Unicode" a lot, but sometimes there are programs that aren't doing what you'd expect, and outputting stuff in a different code page. Additionally, you might sometimes encounter a text file that was created using the system code page of a different machine. (Like if someone emailed me a...
  • Blog Post: What is Title Case?

    Disclaimer: I'm not an English teacher (that's my mom), so I'm sure my description of title casing in English probably has exceptions/variations. Title casing has an interesting history in computer programming. Programmers like to use CamelCase to make variable names more readable, and, particularly...
  • Blog Post: Writing "fields" of data to an encoded file.

    The moral here is "Use Unicode," so you can skip the details below if you want :) A common problem when storing string data in various fields is how to encode it. Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or text file. However, sometimes data gets mixed...
  • Blog Post: Why can't we strip the diacritics?

    We have some "best-fit" behavior which we generally consider to be "bad". Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything). Assuming you can't use Unicode, why is it so bad to just make everything ASCII-like? Maybe you have a published house...
  • Blog Post: Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted

    We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail. That's a lot of data, but it still shouldn't do that. This is a problem with our built in fallbacks. Ironically, if you encounter bad bytes then the bug is reset and you're "good" for another...
  • Blog Post: How do I get HKSCS 2004 characters from Big-5 in .Net?

    Well, that's pretty tricky. We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code. The fundemental problem is that these "HKSCS" characters were in use prior to the assigment of a code point for them in Unicode. In...
  • Blog Post: Please avoid UTF-7

    UTF-7 inherently some of the security issues that concern people about encodings. For example, by shifting in & out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems. UTF-7 is primarily interesting for legacy mail and NNTP applications...
  • Blog Post: Some Reasons to Make Your Application Unicode

    [Updated Mar 30 2007: Mike pointed out errors which I've corrected] Many applications are "still" ANSI and can't handle Unicode. We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better. In particular there are a bunch of good reasons to move...
  • Blog Post: A History of Code Pages or What Made Code Page XXXX (or many other computer things) The Way It Is?

    Disclaimer: This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me. I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages. Taking a snapshot of the current windows code pages...
  • Blog Post: Expected names of Microsoft Windows "ANSI" Code Pages (Encodings)

    I was asked about our use of the windows "ansi" code page names, as used in things like MIME types, http content-type tags, etc. Each "code page" has a name that most accuratly round trips back to the same code page, which I've listed as the "preferred name" below. Additionally, when you ask for a code...
  • Blog Post: Example of overriding your own Encoding.

    Previously I wrote about the Best Way to Make Your Own Encoding , but didn't include an example, so today I'm including an example of a replacement Encoding. I also included an EncoderFallback example, which replaces unknown characters with numerical entity style replacements (〹). This...
  • Blog Post: Best Way to Make Your Own Encoding

    Martin recently asked what the best way to roll his own encoding in .Net 2.0, in particular can you override Encoding/Encoder/Decoder, or should he write his own StreamWriter. #1 is, of course, to use Unicode :), but apparently Martin doesn't have that option. The answer is that you can write your...
  • Blog Post: Encoding.GetEncodings() has a couple "duplicate" names

    The Microsoft.Net v2.0 Encoding.GetEncodings() method returns a complete list of supported encodings, uniquely distinguished by code page. Note that in general I consider the code page number to be a poor way to exchange code page information since its not a standard, however for now it does provide...
  • Blog Post: What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? Part 2

    A little over a year ago I wrote What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? to address the question "Why does GetMaxCharCount(1) for my favorite Encoding return 2 instead of 1." (Short answer is that the Decoder/Encoder could have stored data from a previous call). To follow...
  • Blog Post: Change to Unicode Encoding for Unicode 5.0 conformance

    The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0 with MS07-040 update applied. See the list of known issues for MS07-040...
  • Blog Post: Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided

    Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net you can use the EncoderFallback to control whether...
  • Blog Post: What's my Encoding Called?

    There is a bit of confusion about the System.Text.Encoding names, primarily "Which name do I use for my Encoding?" The Encoding class has 3 hame properties: BodyName, WebName and HeaderName, and the EncodingInfo objects returned by Encoding.GetEncodings have an additional Name property. The examples...
  • Blog Post: Code Page 21027 "Extended/Ext Alpha Lowercase"

    I was playing with code pages and ran into an interesting case: Code Page 21027 - Ext Alpha Lowercase. This code page has some interesting behavior. It looks like a Japaneses EBCDIC code page, however its kind of "missing" mappings for some characters, like 8, 9, =, H, I, Q, R, Y, Z, Halfwidth Katakana...
  • Blog Post: Encoding/Decoding/Crypting and buffer lengths

    This code snippet has a somewhat common bug. I've seen this bug in all sorts of buffer manipulation code, not just cryptography, so I thought I'd share this. CryptoStream cs = new CryptoStream(myStream, myTransform, CryptoStreamMode.Read); byte[] fromEncrypt = new byte[inputByteArray.Length]; cs.Read...
  • Blog Post: UTF8 Security and Whidbey Changes

    Unicode is always in the process of evolving, and some changes have been made to UTF8 in the last few versions. The UTF-8 algorithm is fairly simple, but there are a few clarifications that are important for security reasons. Primarily there is the requirement that non-shortest form UTF-8 should...
  • Blog Post: Don't Use Encoding.Default

    So you want to save some data and don't know which Encoding to use. My biggest suggestion is please do NOT use Encoding.Default. Huh? That can't be right. You heard me right, please don't use Encoding.Default. Encoding.Default sounds like the right thing to do (after all it does say "Default" right...
  • Blog Post: What's the difference between an Encoding, Code Page, Character Set and Unicode?

    Encoding, Code Page and Character Set are often used interchangeably, even when that isn't strictly correct. There are some distinctions though: Characters are usually thought of as the smallest element of writing that has a meaning. It could be a punctuation mark, spacing character, letter, word, letter...
  • Blog Post: What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()?

    The behavior of Encoding.GetMaxByteCount() changed somewhat between .Net version 1.0/1.1 and version 2.0 (Whidbey). The reason for this change is partially because GetMaxByteCount() didn't always return the worst-case byte count, and also because the fallbacks can create larger maximums that previous...
Page 1 of 1 (23 items)