Welcome to MSDN Blogs Sign in | Join | Help

Browse by Tags

All Tags » Unicode and Cod... » System.Text   (RSS)
Sorry, but there are no more tags available to filter with.

Writing "fields" of data to an encoded file.

The moral here is "Use Unicode," so you can skip the details below if you want :) A common problem when storing string data in various fields is how to encode it. Obviously you can store the Unicode as Unicode, which is a good choice for an XML file or

Why can't we strip the diacritics?

We have some "best-fit" behavior which we generally consider to be "bad". Any loss of data is generally a bad thing, so we recommend storing data in Unicode (so you don't lose anything). Assuming you can't use Unicode, why is it so bad to just make everything

Encoder/Decoder Encoding fallbacks fail after 2GB of data has been converted

We have an unfortunate bug in .Net v2.0+ that causes encoding or decoding of more than 2GB of data to fail. That's a lot of data, but it still shouldn't do that. This is a problem with our built in fallbacks. Ironically, if you encounter bad bytes then

How do I get HKSCS 2004 characters from Big-5 in .Net?

Well, that's pretty tricky. We provide the Microsoft Character Code Conversion Routines For HKSCS-2004 functions, but those are intended for use with unmanaged code. The fundemental problem is that these "HKSCS" characters were in use prior to the assigment

Please avoid UTF-7

UTF-7 inherently some of the security issues that concern people about encodings. For example, by shifting in & out of the base64 mode one can create multiple representations of the same string, enabling spoofing and other problems. UTF-7 is primarily

Some Reasons to Make Your Application Unicode

[Updated Mar 30 2007: Mike pointed out errors which I've corrected] Many applications are "still" ANSI and can't handle Unicode. We (Microsoft) have even released non-Unicode applications reasonably recently. even though we should know better. In particular

A History of Code Pages or What Made Code Page XXXX (or many other computer things) The Way It Is?

Disclaimer: This is mostly my conjecture, so I could be completely wrong about some of this, but it seems plausible to me. I’m aiming for the general concepts here, not to start a discussion about the specific details of the history of code pages. Taking

Expected names of Microsoft Windows "ANSI" Code Pages (Encodings)

I was asked about our use of the windows "ansi" code page names, as used in things like MIME types, http content-type tags, etc. Each "code page" has a name that most accuratly round trips back to the same code page, which I've listed as the "preferred

Example of overriding your own Encoding.

Previously I wrote about the Best Way to Make Your Own Encoding , but didn't include an example, so today I'm including an example of a replacement Encoding. I also included an EncoderFallback example, which replaces unknown characters with numerical

Encoding.GetEncodings() has a couple "duplicate" names

The Microsoft.Net v2.0 Encoding.GetEncodings() method returns a complete list of supported encodings, uniquely distinguished by code page. Note that in general I consider the code page number to be a poor way to exchange code page information since its

Change to Unicode Encoding for Unicode 5.0 conformance

The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0 with MS07-040 update

Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided

Windows and the .Net Framework have the concept of "best-fit" behavior for code pages and encodings. Best fit can be interesting, but often its not a good idea. In WideCharToMultiByte() this behavior is controlled by a WC_NO_BEST_FIT_CHARS flag. In .Net

Code Page 21027 "Extended/Ext Alpha Lowercase"

I was playing with code pages and ran into an interesting case: Code Page 21027 - Ext Alpha Lowercase. This code page has some interesting behavior. It looks like a Japaneses EBCDIC code page, however its kind of "missing" mappings for some characters,

UTF8 Security and Whidbey Changes

Unicode is always in the process of evolving, and some changes have been made to UTF8 in the last few versions. The UTF-8 algorithm is fairly simple, but there are a few clarifications that are important for security reasons. Primarily there is the requirement

Don't Use Encoding.Default

So you want to save some data and don't know which Encoding to use. My biggest suggestion is please do NOT use Encoding.Default. Huh? That can't be right. You heard me right, please don't use Encoding.Default. Encoding.Default sounds like the right thing
More Posts Next page »
 
Page view tracker