Before I talk with the examples of semi-structured property management system, I like to define the semi-structured data:
· Group: A set of ordered, open, heterogeneous properties. Every group has a name and every property in the group.
· Object: An instance of a group.
· Property: A scalar property, a collection property, or a complex property
· Scalar Property: A property of one of the basic SQL types (int,float,…)
· Collection Property: A property which is multi-valued, consisting of 0-N values (collection elements) of the same group. The collection element may be any property.
· Complex Property: A set of properties, each of which may be any property. Each is a child of the parent property, thus forming a hierarchy within the properties of a group.
· Ordered: The order properties are declared in can be provided for any instance.
· Open: Any number of properties can be added to an instance of a group and that instance remains of the same group, with the same set of services available as on all other instances.
· Heterogeneous: A property can be a union of 2 or more groups, with each instance being just one of the groups.
The Semi/structured data are the data of groups as defined above, and a Semi-structure management is the system/application which builds for management semi/structured data. In this blog, I will describe performance tuning for one kind of semi-structured applications: Project management system.
In the software project management system, we use “WorkItem” to track and mange the work to be finished for a project. Different work item type represents different type of work. For example, we use Defect item to track the bugs in the projects, Feature item to track the features to be implemented, and Task item to track the tasks to be finished for a feature. The data model for such system can be defined as Closed Scalar Property management system by giving following fact of such system:
1. A WorkItem type is a Group, and an instance of WorkItem is an Object.
2. Different WorkItem type has different set of properties. Different WorkItem types might share the same properties, such as create date.
3. The properties are usually scalar property, i.e., there is no collection or complex property. For example, a WorkItem can only be assigned to one person at any given time.
4. The chance of adding new WorkItem types or adding new properties to existing WorkItem is low.
There are many ways to design such a system, such as using relation model, or using XML model. The principle is that 1) if the system can be modeled in relational tables, then tries to use relational model, 2) changing table schema of a product server is not cost free, we should avoid change schema frequently. Based on the above assumption, we can use a single table to store all different kind of WorkItem instances. The columns of the tables contain all distinct properties, and there is another table to define the mapping how WorkItem Type + Property maps to the column name. Given a WorkItem instance, the values of most columns will be Null. The reason is that for the properties not belong to the target WorkItem type, the value will be Null. Also, to avoid frequently change table schemas, we should add all possible properties into the table definition in advanced, although these properties might rarely be used in the real case (that is the reason why item 4 above is true). For example, in one instance of the system, the defect type has 141 properties, the Issue type has 78 properties and Task has 88 properties defined. However, a typical defect work item only has less than 20 properties.
Two SQL Server 2008 semi-structured features can be used for tuning such system. The first feature is the Sparse Column feature. Given that many of the properties of a WorkItem is empty or null, changing some columns into sparse column will reduce storage size, and thus improve performance by reduce physical I/O and logical I/O. In following paragraphs, I will use one instance of the project management system as example to how I use sparse column feature to improve the system.
The instance has three tables, WorkItemsLatest and WorkitemsAre contain the up-to-date information about a workitem instance, and the WorkItemsWere contains history information. Whenever the property value of a workitem changes, the WorkItemsLast and WorkItemsAre will change to contain the new values, and one row will be added in WorkItermsWere for the tracking the history. The table has 31 Int columns, 40 DataTime columns, 63 Float Columns, and 226 NVarchar columns. The column and data distributions are shown in following tables:
has >99% null
has >80% null
has >60% null
The above fact shows that most of the columns are null and we can compress the Null values by using Sparse Column feature. I did experiments on WorkItemsLatest tables by using following techniques: 1) Rebuild the clustered index defined in the tables since the original table has some fragmentation 2) using Katmai Row Compression 3) Using Katmai Page Compression 4) Using sparse column for selected columns. For details about Katmai data compression features, please go to SQL Server Books Online and SQL Server Storage Engine’s Blog.
The above result shows that Page Compression has the best compression rate which is 16% of the original size, and the size of sparse column table is 31%, while row compression is 36%. Data Compression has a nice stored procedure sp_estimate_data_compression_savings which can estiamte the space saving by sampling data into tempdb, and compress the sampled data. The result is listed in “Estimated Saving” column in the table. I will also discuss the changing time later in this blog.
Sparse and Data Compression can both saving storage, and thus improve the performance of an I/O dominated application. However, since the data page is compressed, thus they both have CPU overhead when accessing the values of certain columns. I did some experiments on CPU overhead for the above tables. The query is quite simple, it selects 17 regular columns, plus 27 other sparse columns with at least 20% not null value plus 10 random sparse columns (most of them should be null) from the table. The result in the next table shows that the Page Compression has the less Physical I/O and Logical I/O. However, it has 78% more CPU overhead (In Warm Run, the response time has little different with CPU time), and the response time for both Cold Run and Warm Run is slower than without compression. Sparse Column has the 8% more CPU overhead, and it achieve the better performance result for Warm Run and Cold Run. The row compression also has larger CPU overhead (36%). The reason behind the result is that Sparse Column does not have negative impact on accessing regular columns, but Page Compression and Row Compression are in table level, the CPU overhead of uncompressing Not Null values are higher. Note, the experimental result here is only for reference, user need to select the appropriate feature according the target application/scenario, and study the impact of these features in details.
Response Time (ms)
for Cold Run
Another useful feature which can help to tune the performance is the Filtered Indices feature. The query pattern for such project management system is unlike the typical relational database system where the queries are mostly predefined, and usually filtered on certain columns. In this project management, user have freedom to define queries (through certain query builder like interface) to search any properties. The experiment shows that many of such queries have to use Table Scan because of lacking certain indices. However, it is not recommended to define many indices on the table because of the high index maintenance cost in term of CPU and I/O. Given that most of the properties are “Sparse” and they only appear in small percentage of the WorkIterm instance, and user are general not interested in querying Null values, we can take advantage of the Filtered Indices features to add more indices to the table. For example, if column “TestName” is a sparse property, and was used in several user queries, we can define a filtered index as:
create index idx_testname on WorkItemsAre(TestName) where TestName is not null
The index can be used on queries which has filter on TestName. The index size is quite small compared with regular index, and the maintenance cost is low because of less I/O. Note, it might be true that some of frequent referenced/queried properties are the “non-sparse” properties, however if they are not shared by many WorkItem types, we still can use Filtered Indices. I haven’t done experiment on filtered indices for this project, and I will report result here once I have the result.
Finally, I like to discuss how I did the experiment. Since data compression is table/partition level. I found it is trivial to change a table into compression. In addition, the SQL Server Management Studio has Dialog for changing storage types (by right click on a table or an index on Object Explorer, and select [Manage Compression]). The time of compression is also not too long. In my case, it took 3 minutes to compress the table with 521450 pages (my machine has 4 proc, and 8G memory).
However, changing columns into sparse is not so trivial. It need perform following four steps:
1. Find the null value distribution for all columns in a table
2. Find all candidate columns which can be turn on to null
3. Estimate the space saving and performance impact on the change
4. Find an appropriate way to change table schema.
Item 1 is quite simple to be implemented; we can just use count(a_column) to get the not null count for a column. For item 2, we have to take a lot of factor into consideration, such as: sparse column cannot be created at text type, or columns have default, unique constraint, index defined, and different type has different cut-off null ratio (such as only 98% of a bit column is null, we can turn it into sparse). The MSDN BOL on Using Sparse Column has the minimal null percentage requirement for different types in order to achieve 40% space saving. For item 3, we can use similar way as the compression did by sampling data into temp db and change columns into sparse. For item 4, it is not trivial as well, because currently we can also change one column into sparse in a DDL statement, and it is a size operation. Suppose you have ~200 columns to be changed to sparse, you can issue 200 alter table statement (of course, it will scan the table 200 times). The workaround is that you can move the data to a temp table, and truncate the original table, so that it will be a non-size operation. After that, you can copy your data back. The solution needs to be finding case by case. In the project management case, I can use the second option for two tables which took 20 minutes, but for one table, WorkItemsAre, which has timestamp defined, I have to use alter table ~200 times which took one and half hours. Another import fact is that alter columns into sparse will not automatic save space, user have to explicit rebuild index to reclaim the space. Finally, I use the tablediff Utility to make sure the data is consistent after applying sparse to the columns.
I attached one slide about my experiment on this project in this Blog. Also, I shared the code I used for changing columns into sparse. My next blog entry will be re-designing the system using new features available in SQL Server 2008.