Recently I had the opportunity to work with a customer that was looking for suggestions around storing large volumes of XML. Currently they collect these XML documents and store them in the Windows Server file system with a very clever folder & file naming method. As they collect more and more of these files the customer is experiencing performance issues around searching and maintaining those XML documents.
The account team requested a MTC Application Design Session so I could help them determine the best method to store, search and manipulate their XML docs. The customer wanted to determine if they should use the varbinary(max) , Filestream (new in SQL Server 08), XML data type or some other method to store these XML.
Filestream is a new data type that is great for storing objects like video, audio (supports streaming API’s) or other very large files that you want to keep transactionally consistent with the database and want to store in the local file system. Because the customer wanted to ability to query the XML data, I ruled Filestream out in favor of the XML data type.
I determined the XML data the customer collects will shred nicely into a relational model and recommended the most commonly queried data be put into traditional relational columns. Because the customer needs to keep a record of the original XML document in it’s entirety, one of the columns would use the XML data type. The XML column could be stored out of row or in a related table. Other XML fragments or less commonly queried data could reside in other XML columns. Next we checked into the performance differences using both SQL queries and XQuery. It was immediately evident that for the most commonly queried columns the performance was radically better than XQuery. We tried returning columns via XQuery as well as using XQuery in “where” conditions.
We also explored the XML indexing methods available in SQL Server 2008, such as Primary and Secondary XML indexes (Node, Path and Property). The Primary XML index showed immediate performance improvements, and for this customer the QP used both the Node & Path secondary indexes but ignored the Property index. We also discussed FullText Indexing, since the customers data is structured XML vs. unstructured document format, we ruled out this index type.
Next we examined the new Sparse Column data type and Column Sets. Sparse columns are interesting because they use zero space on disk if the field is empty. This allows you to create wide rows in a SQL table. If using Sparse Columns, look into Filtered Indexes. A filtered index is an optimized nonclustered index, especially suited to cover queries that select from a well-defined subset of data, in this case non NULL values. It uses a filter to index a portion of rows in the table. A well-designed filtered index can improve query performance, reduce index maintenance costs, and reduce index storage costs compared with full-table indexes.
Tables that use sparse columns can designate a column set to return all sparse columns in the table. A column set is an untyped XML representation that combines all the sparse columns of a table into a structured output, which would work nicely for this customer. A column set is like a calculated column in that the column set is not physically stored in the table. A column set differs from a calculated column in that the column set is directly updatable.
Hopefully if you are considering what to do with XML documents and SQL Server this blog entry will provide some guidance.
Kirk Nason
Technical Arch – Data Platform
MTC Irvine