Oleg Tkachenko has a nice post comparing the StAX (java) and XmlReader (.NET and XmlLite) approaches to streaming over a potentially large XML data source and filtering out unwanted elements.  He concludes:

if you work with StAX you can readily work with .NET XmlReader and the other way. Great unification saves hours learning for developers. I wonder if streaming XML processing API should be standardized?

We've been discussing how to add streaming capabilities onto LINQ to XML for some time now.  The value proposition is something like: Our target audience will sometimes encounter large documents or arbitrary streams of XML; they want the ease of use that LINQ to XML offers, but they don't want to have to load an entire data source into an in-memory tree before starting to work with it.  They could use XmlReader, of course, but that is a considerably lower-level API that requires attention to all sorts of details of XML syntax that we know mainstream developers don't want to worry about. Let's offer some easy to use methods to allow LINQ to XML users to load a well-structured XML data source in definable chunks that can be worked on one at a time.

The obvious way to do this is imperatively , much like StaX or XOM does: the user writes a filter function / subclass and the XML API uses that to determine which elements in the XML source to pass through to the calling application.  We think, however, that the better way is to do it more declaratively -- specifying what to do rather than how to do it.  We're not ready to publicize a specific streaming input API, but let's talk about why it's worthwhile to avoid the easy (and arbitrarily powerful!) imperative filtering approach. 

Consider, for example, Eric White's post on using the querying style that LINQ supports rather than the traditional imperative approach to process large text files. He follows up with another post explaining why he thinks this is so cool. Taking Eric's points and elaborating / extending them a bit, here are some concrete reasons for using the declarative / functional style rather than the imperative style: 

  • Non-imperative application code is likely to be more easy to test and debug because the execution path depends on explicit inputs rather than on some funky internal state-driven logic.
  • The less imperative, the more likely code is to be modularizable / refactorable as requirements change and code evolves, again because there is less need to pass that funky internal state around.
  • The more purely declarative / functional, the more easily code can be mapped into some other more declarative language such as XQuery, XSLT, SQL, etc.  This might allow someone (maybe you, maybe us, maybe a third party) in the future to leverage LINQ expression trees to pass queries around and have them evaluated efficiently on some combination of the client, midtier, and server.  This is the design philosophy of LINQ to SQL, and there's no intrinsic reason it could be applied to "LINQ to XQuery" or "LINQ to XSLT" ... *if* the logic isn't tainted by a bunch of imperative statefulness.
  • In the future, this analysis can be automated: PLINQ is (probably?) coming in a subsequent version of the .NET framework. As Joe Duffy puts it, this will mean that code written to the initial release of LINQ that uses "filters, projections, reductions, sorts, and joins can be evaluated in parallel... transparently... with little-to-no extra input from the developer."

OK, I admit that a lot of the concrete advantages are maybe going to be realized in some fuzzy future timeframe, not immediately. Nevertheless, we're trying to design a solid foundation today on which to build tomorrow, so we're trying to avoid making quick and easy design choices that will limit options in the future.

Finally, Ralf Lämmel will be presenting a  paper on this topic at the XML 2006 Conference, and I'm sure he will be able to explain it all in much more detail than I can!