Today I want to talk about XmlWriter and the generation of a Byte Order Mark (BOM).
XmlWriter provides an API that generates, unsurprisingly, XML. This XML will typically end up as a managed string of characters or possibly a sequence of bytes. Of course, text transformed into bytes implies an encoding, as previously discussed.
Now XML has its own ways of determining the encoding that a document has, by peeking at the first bytes that make up an opening <?xml declaration or, more explicitly, with the encoding on this declaration.
Unicode is used for all sorts of puposes, not just XML encoding, and so it also has a mechanism to distinguish between small-endian and big-endian encodings, which determine which byte comes first in UTF-16 and UTF-32. It's also allowed for UTF-8, for that matter.
How do these mechanisms interact when using the .NET Framework classes? Let's write some code!
First, we'll write a short helper method to display the contents of a byte array.
private static void ShowBuffer(string linePrefix, byte[] bytes, long length) { int bytesOnLine = 0; for (long i = 0; i < length; i++) { if (bytesOnLine == 0) { Console.Write(linePrefix); } Console.Write("{0:X2} ", bytes[i]); bytesOnLine++; if (bytesOnLine > 16) { Console.WriteLine(); bytesOnLine = 0; } } }
Next, let's write a method to write out some short XML.
private static void WriteXml(XmlWriter xmlWriter) { xmlWriter.WriteStartElement("hello"); xmlWriter.WriteString("#1"); xmlWriter.WriteEndElement(); xmlWriter.Flush(); }
Wel'll try different combinations of layering an XmlWriter with some encoding over a StreamWriter with a different encoding (or directly over a stream) to see what happens. These two methods will help us out.
private static long WriteEncodedXml( Encoding streamEncoding, Encoding xmlEncoding, Stream stream) { XmlWriterSettings settings = new XmlWriterSettings(); settings.Encoding = xmlEncoding; settings.Indent = false; if (streamEncoding != null) { using (StreamWriter writer = new StreamWriter(stream, streamEncoding)) using (XmlWriter xmlWriter = XmlWriter.Create(writer, settings)) { WriteXml(xmlWriter); return stream.Length; } } else { using (XmlWriter xmlWriter = XmlWriter.Create(stream, settings)) { WriteXml(xmlWriter); return stream.Length; } } } private static void ShowXmlEncoding( Encoding streamEncoding, Encoding xmlEncoding) { Console.WriteLine("Stream Encoding: " + ((streamEncoding == null) ? "(no stream)" : streamEncoding.EncodingName)); Console.WriteLine(" XML Encoding: " + xmlEncoding.EncodingName); MemoryStream stream = new MemoryStream(); long length = WriteEncodedXml(streamEncoding, xmlEncoding, stream); byte[] bytes = stream.GetBuffer(); ShowBuffer(" ", bytes, length); Console.WriteLine(); }
Finally, here is the method to drive it all.
public static void Main(string[] args) { // First encoding is for stream writer, second is XML writer. ShowXmlEncoding(null, Encoding.UTF8); ShowXmlEncoding(null, new UTF8Encoding(/* encoderShouldEmitUTF8Identifier */false)); ShowXmlEncoding(null, Encoding.Unicode); ShowXmlEncoding(null, Encoding.BigEndianUnicode); ShowXmlEncoding(Encoding.ASCII, Encoding.Unicode); // Muhaha. Encoding muhaha = Encoding.GetEncoding( "x-IA5-Norwegian", new EncoderExceptionFallback(), new DecoderExceptionFallback()); ShowXmlEncoding(null, muhaha); }
You can run this now and see what comes up. Tomorrow, a short analysis of some interesting results.
Enjoy!