Kirk Evans Blog

.NET From a Markup Perspective

Postel again? Writing Well-Formed XML with XmlWriter

Postel again? Writing Well-Formed XML with XmlWriter

  • Comments 4

Recently, I was forced to reconsider my stance on Postel's law again.  I was provided an “XML“ feed that was not well-formed, and did not have the luxury of simply rejecting it due to the multiple well-formedness violations it contained.

Side note:  it is a pet peeve of mine when people make quote signs in the air.

Specifically, the “XML” contained several characters that XML 1.0 explicitly forbids, like x008 and x016..  Instead of throwing an exception, as done in the Customized XML Writer Creation topic on MSDN, I was forced to process the “XML” and write portions of it to a new stream, ensuring that it is well-formed.  The solution?  A custom XmlWriter that overrides the calls to WriteString.  The overridden call to WriteString uses a regular expression to determine if the text contains any of the characters not in the Char production in XML 1.0.  If any character is encountered that is not within that production, it is escaped with a character entity.

using System;
using System.IO;
using System.Xml;
using System.Text;
using System.Text.RegularExpressions;

namespace XmlAdvice
{
 
/// <summary>
 /// An XmlTextWriter that strives to ensure the generated markup is well-formed.
 /// </summary>
 public class XmlWellFormedTextWriter : System.Xml.XmlTextWriter
 {    
  Regex _regex =
new Regex(@"[\x01-\x08\x0B-\x0C\x0E-\x1F\xD800-\xDFFF\xFFFE-\xFFFF]");

  
/// <summary>
  /// Creates an instance of the XmlWellFormedTextWriter
  /// </summary>
  /// <param name="stream">The stream to write to.</param>
  /// <param name="encoding">The encoding for the stream (typically UTF8).</param>
  public XmlWellFormedTextWriter(Stream stream, Encoding encoding)  : base(stream,encoding)
  {  
  }

  
/// <summary>
  /// Replaces any occurrence of characters not within the production
  /// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
  /// with a character entity.
  /// </summary>
  /// <param name="text">The text to write to the output.</param>
  public override void WriteString(string text)
  {
   Match m = _regex.Match(text);

   
int charCount = text.Length - 1;
   
int idx = 0;

   
if(m.Success)
   {
    
while(m.Success)
    {
     
base.WriteString(text.Substring(idx,m.Index - idx));
     WriteCharEntity(text[m.Index]);
     idx = m.Index + 1;
     m = m.NextMatch();
    }
    
if(idx < charCount)
    {
     
base.WriteString(text.Substring(idx, charCount - idx));
    }
   }
   
else
   {
    
base.WriteString(text);
   }
  }
 }
}

Update:  Dare Obasanjo corrected me.  In fact, this example does not assist in creating a well-formed XML document, instead it postpones the problem until the XML document is parsed again. 

  • The XML you are generating isn't well-formed. Those characters are invalid in XML even if escaped as character entities. All you've done is made it possible for the XmlTextReader to consume them when the Normalization (and thus character checking) property is set to false.
  • Then shouldn't the WriteCharEntity call throw an XmlException according to the same production?
  • XmlTextWriter doesn't do character checking. As mentioned in my post at http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=d8b4c69e-164f-4bff-9b91-9207e9620c10

    "The XmlTextWriter writes Unicode characters in the range 0x0 to 0x20, and the characters 0xFFFE and 0xFFFF, which are not XML characters. "

  • lightbulb just went off.

    This is what I had pointed out in the post you referenced, http://blogs.xmladvice.com/kaevans/archive/2004/01/13/321.aspx

    The difference is that I had interpreted this statement as "the WriteString method does not attempt to escape character entities", not as "the WriteCharEntity method does not perform checks on character ranges."

    I was, admittedly, confused on the concept of that the presence of x0A is considered the same as referring to it with the &#x00A; character entity. I had been taking advantage of the very issue I was trying to avoid, just changing the semantics of the problem.
Page 1 of 1 (4 items)
Leave a Comment
  • Please add 1 and 4 and type the answer here:
  • Post
Translate This Page
Search
Archive
Archives