Welcome to MSDN Blogs Sign in | Join | Help

Browse by Tags

All Tags » System.Text   (RSS)

Why can't we strip the diacritics?

We have some "best-fit" behavior which we generally consider to be "bad". Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything). Assuming you can't use Unicode, why is it so bad to just make everything

Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted

We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail. That's a lot of data, but it still shouldn't do that. This is a problem with our built in fallbacks. Ironically, if you encounter bad bytes then

How do I get HKSCS 2004 characters from Big-5 in .Net?

Well, that's pretty tricky. We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code. The fundemental problem is that these "HKSCS" characters were in use prior to the assigment

Please avoid UTF-7

UTF-7 inherently some of the security issues that concern people about encodings. For example, by shifting in & out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems. UTF-7 is primarily

Some Reasons to Make Your Application Unicode

[Updated Mar 30 2007: Mike pointed out errors which I've corrected] Many applications are "still" ANSI and can't handle Unicode. We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better. In particular

A History of Code Pages or What Made Code Page XXXX (or many other computer things) The Way It Is?

Disclaimer: This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me. I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages. Taking

Expected names of Microsoft Windows "ANSI" Code Pages (Encodings)

I was asked about our use of the windows "ansi" code page names, as used in things like MIME types, http content-type tags, etc. Each "code page" has a name that most accuratly round trips back to the same code page, which I've listed as the "preferred

Example of overriding your own Encoding.

Previously I wrote about the Best Way to Make Your Own Encoding , but didn't include an example, so today I'm including an example of a replacement Encoding. I also included an EncoderFallback example, which replaces unknown characters with numerical

Best Way to Make Your Own Encoding

Martin recently asked what the best way to roll his own encoding in .Net 2.0, in particular can you override Encoding/Encoder/Decoder, or should he write his own StreamWriter. #1 is, of course, to use Unicode :), but apparently Martin doesn't have that
Posted by shawnste | 3 Comments
Filed under:

Encoding.GetEncodings() has a couple "duplicate" names

The Microsoft.Net v2.0 Encoding.GetEncodings() method returns a complete list of supported encodings, uniquely distinguished by code page. Note that in general I consider the code page number to be a poor way to exchange code page information since its

What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? Part 2

A little over a year ago I wrote What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? to address the question "Why does GetMaxCharCount(1) for my favorite Encoding return 2 instead of 1." (Short answer is that the Decoder/Encoder could
Posted by shawnste | 1 Comments
Filed under:

Change to Unicode Encoding for Unicode 5.0 conformance

The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0 with MS07-040 update

Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided

Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net

What's my Encoding Called?

There is a bit of confusion about the System.Text.Encoding names, primarily "Which name do I use for my Encoding?" The Encoding class has 3 hame properties: BodyName, WebName and HeaderName, and the EncodingInfo objects returned by Encoding.GetEncodings
Posted by shawnste | 1 Comments
Filed under:

Code Page 21027 "Extended/Ext Alpha Lowercase"

I was playing with code pages and ran into an interesting case: Code Page 21027 - Ext Alpha Lowercase. This code page has some interesting behavior. It looks like a Japaneses EBCDIC code page, however its kind of "missing" mappings for some characters,
More Posts Next page »
 
Page view tracker