One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. 

One of the reasons I always suggest "Use Unicode!" is that there are security problems converting between code pages. In short if data is going to be converted between code pages after some sort of security validation is done, then that validation could be negated. This is true of lots of data transformations, but it seems to surprise people a lot when applied to code page transformations.

There are lots of reasons for this, but some are:

  • Transformations can have "best-fit" mappings.  For example, if I test for "C:\windows" in some form, but a tranformation maps the fullwidth compatibility characters (~U+FF00, such as ff43 (c)), then the security check gets invalidated.
  • Similar things happen with different versions of code pages.  For example, ASCII is sometimes handled by "dropping" the high bit so 0xc3 ends up being 0x43.  Alternatively sometimes applications just "pretend" that it was really ANSI and map it to whatever code page the user is using.
  • Code pages aren't always tagged correctly, so an ANSI validation using a default system code page on a server may yield different results than the same code with a different default system code page on a client.
  • Some code pages also have difference between systems.  Windows itself has slightly different behavior between MLang (often used by IE) and MultiByteToWideChar().
  • Different systems handle unexpected, unassigned or illegal code points in different manners.  Sometimes that means ? or the equivilent.  Sometimes gibberish, sometimes dropped data (so then your C:\windows test on C?:?\?W?i?n?d?o?w?s doesn't work if all the ? disappear).
  • Sometimes behavior changes, such as Change to Unicode Encoding for Unicode 5.0 conformance .  In this case the change is to the Unicode parsing itself, but still any security test should be done after reading the input data.
  • Escape sequences might modify data in some environments that provide escaping mechanisms for characters that a code page doesn't support.

A related problem is the IDN and code page parsing that browsers sometimes do.  & named and numeric entities in HTML can end up with a different appearance.  % escaping is common in URLs, and IDN xn-- encoding happens in domain names.  An application may decode these, even at unexpected times, and cause problems if the data was assumed to be in a different state before the decoding.

So the moral is: Do any security tests after any conversions have been done.  If you have to retransmit the data, try to use an encoding like Unicode that has fewer edge case behaviors that could trip you up.  If possible, revalidate the data after the transmission if it has to be decoded.