XmlWriterSettings Encoding Being Ignored?
I had an interesting exchange in an internal mailing list today, and thought that my readers could possibly benefit from this clarification as well.
Scenario
You are working with the XmlWriter class and trying to write the contents to something. You are expecting the XML declaration to look like:
<?xml version="1.0" encoding="utf-8"?>
However, when you inspect the generated XML, it instead looks like:
<?xml version="1.0" encoding="utf-16"?>
Background
The XmlWriter method has an overloaded Create method that accepts an XmlWriterSettings object. The XmlWriterSettings allows you to specify if the output XML will be indented, what the indentation character will be, how newlines are handled, and how encoding is handled.
When you use the XmlWriterSettings type, you expect the encoding that you specified to show up in the ouput. For instance, consider the following code that reads a string and sends the output to the Console via a StringBuilder, then directly to Console.Out.
The output of these 2 methods look like the following:
Notice that we specified the encoding in the XmlWriterSettings class as UTF-8, yet the output here is UTF-16 and IBM437. The short explanation is that StringBuilder is incapable of containing bytes, it is designed to contain characters, so strings in .NET are always going to contain UTF-16 encoded values. Similarly, the Console.Out property is an implementation of a TextWriter which uses a specific encoding for displaying text in a console window... in this case, it is IBM437.
The XML declaration provides the intended encoding, but it should match the underlying stream. Imagine if your XML content indicates it is UTF-8 encoded, but the underlying stream is something else. This would cause odd side effects since the parser would try to parse the UTF-16 encoded content as UTF-8, ending up with some pretty odd looking output.
Solution
If you need UTF-8 encoding to be preserved, you need to write to a backing store that supports UTF-8 encoding. One way to do that is to use a MemoryStream.
The output of this method would be as you would expect.
The difference is that we are able to serialize and deserialize to a backing store capable of storing UTF-8 encoded content.