Out of the Angle Brackets
This is the first part of a mini-series of blog posts about using Xml on .NET Framework platform in an effective way. Although I will be focusing on .NET Framework platform I hope that at least some of the information will be general enough to apply to working with Xml on any platform.
Managed Xml API contains a few different APIs that allow to start working with Xml documents. These are:
The reason why there are that many APIs is that each of these can be very effective when used in some scenarios while cannot be used or is less effective in other scenarios. Let’s take a look at these APIs and scenarios where they shine.
XmlReader – as per MSDN “provides fast, non-cached, forward-only access to XML data”. In general XmlReader reads data from the source stream, ensures that the data is a valid Xml and reports what it read. XmlReader is typically controlled by some code that tells the reader what to read (ReadXXX() methods) or what to skip (MoveToXXX()/Skip() methods). This is the responsibility of the controlling code to process (or cache) the results reported by the XmlReader since the XmlReader will “forget” what it read once moved to the next node.
Note: XmlReader is the lowest level API in the whole .NET Framework that allows reading Xml documents. This means that *any* class in the whole .NET Framework that needs to read Xml documents is – either explicitly or implicitly – using XmlReader under the cover.
If you decide to use XmlReader remember to create the reader using XmlReader.Create() factory method which will create the right reader according to the provided XmlReaderSettings. Avoid using XmlTextReader. It contains quite a few bugs that could not be fixed without breaking existing applications already using it.
Scenarios where using XmlReader is not effective or impossible:
XPathDocument – is an Xml cache optimized for querying Xml documents. It’s a cache so the whole Xml document has to be read (yes, it is using the XmlReader to do this) before it can be queried. For better query performance and smaller memory footprint the XPathDocument does not allow for any modifications to the cached document.
Scenarios where using XPathDocument is not effective or impossible:
XmlDocument – is an API modeled on Xml DOM (Document Object Model). Similarly to the XPathDocument the XmlDocument is a cache, so when working with Xml data from an external source the whole Xml document has to be read first. Unlike the XPathDocument however the cached document can be modified. With XmlDocument it is also possible to build Xml documents programmatically from scratch rather than load them from external sources. The price to pay for this flexibility is worse query performance and bigger memory footprint when comparing to XPathDocument.
Scenarios where using XmlDocument is not effective:
XDocument/XElement et al. (LINQ to Xml) – a new XML API added to the .NET Framework in version 3.5 along with a new way of querying – Language Integrated Query (LINQ). With Linq to Xml APIs you can achieve most of what you can achieve with XmlDocument (e.g. build and modify Xml documents, validate Xml documents against Xsd, do Xslt transformations, query with XPath etc.) but in a much easier and lighter way. You have also a new way of querying Xml documents – LINQ queries. In new projects you should prefer LINQ to Xml APIs over XmlDocument.
Although XDocument needs to cache the entire Xml most LINQ queries work with just XElments (or XAttributes etc.) or to be more precise with IEnumerable<T> where T is XElement (or XAttribute etc.). Due to the nature of IEnumerable (deferred execution) it is possible to combine XmlReader and IEnumerable in and be able to query Xml documents without having to load the whole document to memory first. See this blog entry for more details: http://blogs.msdn.com/b/xmlteam/archive/2007/03/24/streaming-with-linq-to-xml-part-2.aspx
One scenario where you may want to use XPath with XDocument however is when you generate queries dynamically. It’s not easy to generate Linq queries on the fly and using XPath in this scenario is probably the best solution if your application is using XDocument/XElement.
Mixing LINQ to Xml with other Xml APIs
There is no magic cast that allows converting XDocument/XElement to XPathDocument or XmlDocument. “Converting” actually means building the whole XmlDocument structure from scratch based on the source XDocument/XElement. It is a costly operation and doing this continuously or for bigger documents can cause performance problems. As far as perf is considered it is usually better to stick to XmlDocument and/or XPathDocument in legacy applications already using these APIs while use LINQ to Xml in new apps.
Writing Xml Documents
The XmlWriter is the lowest level Xml API used to write Xml documents. Any other .NET Framework API persisting Xml documents is using this API. If you just need to write an Xml document this is the fastest way to do it. If you need to write a document that has already been cached (i.e. you already have XmlDocument or XDocument instance in hand) you don’t need to do anything special.
How can i parse a bad formatted XML, like a typical bad formatted HTML file?
When i tried i get exceptions.
@Nekketsu: You don't. HTML (but not XHTML) is not Xml and therefore should not be processed with Xml tools. You get exceptions because HTML is not as strict as Xml - especially it does not have to well-formed, attributes don't have to be put in quotes etc. Move to XHTML and then you should be able to process your documents with Xml tools.
Just wanted to parse some HTML webs, but think i will need a different API.