Today I want to talk about XmlWriter and the generation of a Byte Order Mark (BOM).

XmlWriter provides an API that generates, unsurprisingly, XML. This XML will typically end up as a managed string of characters or possibly a sequence of bytes. Of course, text transformed into bytes implies an encoding, as previously discussed.

Now XML has its own ways of determining the encoding that a document has, by peeking at the first bytes that make up an opening <?xml declaration or, more explicitly, with the encoding on this declaration.

Unicode is used for all sorts of puposes, not just XML encoding, and so it also has a mechanism to distinguish between small-endian and big-endian encodings, which determine which byte comes first in UTF-16 and UTF-32. It's also allowed for UTF-8, for that matter.

How do these mechanisms interact when using the .NET Framework classes? Let's write some code!

First, we'll write a short helper method to display the contents of a byte array.

private static void ShowBuffer(string linePrefix, byte[] bytes, long length) {
  int bytesOnLine = 0;
  for (long i = 0; i < length; i++) {
    if (bytesOnLine == 0) {
      Console.Write(linePrefix);
    }

    Console.Write("{0:X2} ", bytes[i]);
    bytesOnLine++;
    if (bytesOnLine > 16) {
      Console.WriteLine();
      bytesOnLine = 0;
    }
  }
}

Next, let's write a method to write out some short XML.

private static void WriteXml(XmlWriter xmlWriter) {
  xmlWriter.WriteStartElement("hello");
  xmlWriter.WriteString("#1");
  xmlWriter.WriteEndElement();
  xmlWriter.Flush();
}

Wel'll try different combinations of layering an XmlWriter with some encoding over a StreamWriter with a different encoding (or directly over a stream) to see what happens. These two methods will help us out.

private static long WriteEncodedXml(
Encoding streamEncoding,
Encoding xmlEncoding,
Stream stream) { XmlWriterSettings settings = new XmlWriterSettings(); settings.Encoding = xmlEncoding; settings.Indent = false; if (streamEncoding != null) { using (StreamWriter writer = new StreamWriter(stream, streamEncoding)) using (XmlWriter xmlWriter = XmlWriter.Create(writer, settings)) { WriteXml(xmlWriter); return stream.Length; } } else { using (XmlWriter xmlWriter = XmlWriter.Create(stream, settings)) { WriteXml(xmlWriter); return stream.Length; } } } private static void ShowXmlEncoding(
Encoding streamEncoding,
Encoding xmlEncoding) {


Console.WriteLine("Stream Encoding: " +
((streamEncoding == null) ?
"(no stream)" : streamEncoding.EncodingName)); Console.WriteLine(" XML Encoding: " + xmlEncoding.EncodingName);
MemoryStream stream = new MemoryStream(); long length = WriteEncodedXml(streamEncoding, xmlEncoding, stream); byte[] bytes = stream.GetBuffer(); ShowBuffer(" ", bytes, length); Console.WriteLine(); }

Finally, here is the method to drive it all.

public static void Main(string[] args) {
  // First encoding is for stream writer, second is XML writer.
  ShowXmlEncoding(null, Encoding.UTF8);
  ShowXmlEncoding(null,
new UTF8Encoding(/* encoderShouldEmitUTF8Identifier */false)); ShowXmlEncoding(null, Encoding.Unicode); ShowXmlEncoding(null, Encoding.BigEndianUnicode); ShowXmlEncoding(Encoding.ASCII, Encoding.Unicode); // Muhaha. Encoding muhaha = Encoding.GetEncoding( "x-IA5-Norwegian", new EncoderExceptionFallback(), new DecoderExceptionFallback()); ShowXmlEncoding(null, muhaha); }

You can run this now and see what comes up. Tomorrow, a short analysis of some interesting results.

Enjoy!