Welcome to MSDN Blogs Sign in | Join | Help

Change to Unicode Encoding for Unicode 5.0 conformance

The behavior for UTF8Encoding, UnicodeEncoding and UTF32Encoding has changed in Windows Vista to conform better to the Unicode 5.0 requirements for Unicode Encodings. [23 July 2007: Now this behavior has also been made to .Net 2.0 with MS07-040 update applied.  See the list of known issues for MS07-040 described in KB 931212KB 940521 describes this behavior in particular.]

 

In .Net Framework V2.0 RTM we chose to respect the Unicode 4.1 standard which disallowed passing illegal UTF code points by dropping any bad data that was encountered, considering that this behavior would have the minimal impact to existing applications.

 

Since the .Net Framework 2.0 was released, the latest Unicode 5.0 specification has become stricter.  There was a concern that just ignoring invalid bytes could allow insecure hostile data because invalid characters would be dropped so and invalid string could become valid.  The new requirement for Unicode 5.0 is that bad bytes cannot be dropped, so we are now replacing them with U+FFFD, the Unicode Replacement Character, in Windows Vista, and future versions of the .Net Framework, including the .Net Framework 2.0 on Vista, and .Net 2.0 with the MS07-040 update applied.

 

The new default behavior is equivalent to setting the replacement fallbacks to "\xFFFD” instead of the empty string.  If applications prefer the old behavior, they can create their UTF8Encoding with an EncoderReplacementFallback("") and DecoderReplacementFallback(""), causing the fallbacks to drop the bad data.

 

Because of the +- and other oddities with "UTF-7" its generally considered insecure anyway for similar reasons and UTF-8 is generally preferred.

 

FWIW:  My recommendation is that applications shouldn’t make trust decisions on encoded data, this goes for the other code page encodings as well as Unicode.  Encoding and decoding data can cause it to change its form.  (See Best Fit in WideCharToMultiByte and System.Text.Encoding Should be Avoided  for one example).  If your application needs to make sure that an input string doesn’t include C:\windows, it should do the validation after decoding the data.  I’ll probably blog more about this later.

 

'til then,

Shawn

Published Friday, June 16, 2006 3:07 PM by shawnste

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? Part 2

A little over a year ago I wrote What's with Encoding.GetMaxByteCount() and Encoding.GetMaxCharCount()? to...
Wednesday, June 21, 2006 8:47 PM by Shawn Steele's Blog

# Code Pages, Unicode & Encodings

I hope to put some links to interesting posts about Code Pages/Unicode/Encodings here. Use Unicode! That

Tuesday, February 27, 2007 3:08 PM by I'm not a Klingon

# UTF8 Encoding Changes in Vista (Hashing Gotcha)

If you've used hashing to store passwords for your application, you may want to double-check you code

Tuesday, May 01, 2007 11:45 PM by Scott Van Vliet

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker