Welcome to MSDN Blogs Sign in | Join | Help

Hierarchies (trees) in SQL Server 2005

There is much debate about how to implement hierarchies (trees) relationally in SQL Server. I decided to test the main techniques out and see which performed best for my application.

My problem is efficiently representing hierarchies in an experimental application code analysis database. The main hierarchy I am interested in is the class hierarchy. I want users to be able to determine quickly all the subclasses of a class. Class hierarchies tend to be broad and shallow. Since the .Net framework has about 18 500 classes (counting compiler generated classes) creation and query performance is an issue. In addition, since developers will be creating custom reports over the database, ease of query construction (queriability) is an issue.

There are three standard techniques for representing hierarchies in relation stores: the adjacency technique, the path technique, and the nested sets technique. A fourth approach based on floating point intervals (nested intervals) is sometimes mentioned but it is equivalent to the nested set approach with gaps and too hard for my users to understand.

Adjacency technique

In this representation, rows contain the IDs of their corresponding parent row. For example, the organization chart:

would be represented as:

Id

Name

Manager

1

Edgar

2

2

Andrew

null

3

Betty

2

4

Charles

3

5

Denise

3

This is the first representation many people think of and is often considered a poor practice. However, many operations are straightforward to implement and users can easily understand it.

Path technique

In this representation, rows contain the path to their parents. For example, the organization chart:

would be represented as:

Id

Name

Path

1

Edgar

2

2

Andrew

 

3

Betty

2

4

Charles

2,3

5

Denise

2,3

This makes queries like finding all the descendents of a node straightforward since you only need to look for all nodes with the appropriate path prefix. Because of index limitations, the trees cannot be too deep since to be efficient you need to index the Path column, which in SQL Server 2005 must be at most 900 bytes.

Nested sets

This is the approach championed by the indomitable Joe Celko (see 1, 2, 3 and 4). Essentially the idea is to label nodes with their left and right traversal ranks as a node is entered and exited respectively. For example, the organization chart:

would be represented as:

Id

Name

Left

Right

1

Edgar

8

9

2

Andrew

1

10

3

Betty

2

7

4

Charles

3

4

5

Denise

5

6

The representation makes operations like determining the descendents of a node and if two trees overlap easy because the left and right values bracket the left and right nodes of every descendent node.

The fundamental issue with the nested sets approach is that adding a node is very expensive (O(n)) since all subsequent nodes need to be renumbered.

Performance

Table 1 shows a simple benchmark for my problem involving 20 000 nodes. First the tree is created as 20 000 inserts and then the list of dependent nodes is generated for each of the 20 000 nodes in the tree. As expected, the adjacency technique works best for creating the tree. However, the surprising winner for finding all descendents of every node is also the adjacency technique.

Table 1 Time to create a 20 000 node tree and find all the descendents of each of the 20 000 nodes.

Technique

Create (secs)

All Descendents (secs)

Adjacency

4

4

Nested set

6072

43

Path

6

337

At first, I suspected a bug but my testing showed that all the techniques produce identical results. SQL Server 2005 does a good job of generating query plans for recursive CTE expressions but I think that because the whole table lives in cache and the tree is shallow (a maximum of 8 nodes deep) SQL Server can be very fast. It is also a great example of the power of the relational model since the result set is created in a highly parallelizable breadth-first way. As is often the case, a naïve implementation is faster because the runtime can optimize it better.

Because of the adjacency technique’s superior performance (for my shallow and broad trees) and queriability it is the most attractive choice. With more work I could probably improve the performance of the other techniques but the adjacency method has the performance and queriability I need. However, I am interested in any feedback you may have on issues I have missed.

Implementation experience

I found all three approaches easy to implement. The only time I had to think about what I was doing was figuring out what the best way to test for a varbinary prefix (I represented paths as strings of 4 byte node IDs). I am no T-SQL guru so I imagine any competent T-SQL developer could easily implement any of the approaches.

XML columns

The obvious SQL Server 20005 alternative is to encode the hierarchy in an XML column. Because classes are likely to relate to many other objects in the system, query patterns over classes are unpredictable and relational implementation performance is acceptable, I decided a pure relational implementation would give users the most power.

Published Wednesday, February 15, 2006 1:49 PM by AnthonyBloesch

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS

Comments

# re: Hierarchies (trees) in SQL Server 2005

There seems to be another mechanism for this in SQL Server 2005 (I found the link to it http://www.sqlservercentral.com/columnists/sSampath/recursivequeriesinsqlserver2005.asp)
Unfortunately I have not had a chance to try this out. Have you experimented with this technique?
Wednesday, February 15, 2006 6:40 PM by Roger Bucks

# re: Hierarchies (trees) in SQL Server 2005

This is the adjacency approach mentioned in the post. They use recursive CTE queries also.
In early betas of SQL Server 2005, recursive CTE queries generated poor query plans so I did this calculation with a stored procedure. However, now recursive CTE queries generate good query plans so I use CTE queries.
Wednesday, February 15, 2006 6:54 PM by AnthonyBloesch

# re: Broken image links

Images referring to your file system my friend

file://Tkzaw-pro-14/Mydocs3/ABloesch/My%20Documents/Projects/Blog/Tree%
Wednesday, February 15, 2006 7:15 PM by Chui

# re: Hierarchies (trees) in SQL Server 2005

Thanks for the update.
Do you have a sample of the CTE query that you might use in the example you give above?
(To understand how the performance figures were obtained, is it possible to link to/display the SQL used?)
Wednesday, February 15, 2006 7:24 PM by Roger Bucks

# re: Hierarchies (trees) in SQL Server 2005

Someone quoted your blog on an SQL Sever newsgroup that an adjacency list tbel is just fine.  I took a quick peek at the blog.  So here Ia m for details.

You seemed to be measring loading time, without telling us the fomat of the data. If you are loading (superior, subordinate) pairs without any cycle and gap checking.  You did do FULL checking on EACH edge of the graph? No cycles?   Do you have a solution to an NP-Complete Problem? Or a proof it is impossible?  Ghod, you will be famous!!

I have 20+ years of SQL in 177+ produccts and when I wrote my code, it sucked rocks.  Show what you do for data integrity. I am amazed because you did better than I have been able to do in 10+ years of trying!!  

How many thousand summary queries in a tree of depth 5, 10 and 20 levels did you do in the models?  What ere the numbers?  

If you do the adjacency list model done right, you have to insert (in_node, out_node) pairs and then check the ENTIRE schema for cycles with a TRIGGER every time.  Then you have to look for gaps to assure it is one tree.  And finally check that "# of nodes = (# of edges) - 1"

This is not an option -- it is vital.  I am waiting for $1000/day contract to come thru with an insurance company that forgot these constraints [Opps!! we are dead! and since we do not expect a right answer, it is always 42:)]

Test methods 20 levels deep and 20 levels across -- the schema for a 747 airplane parts explosion. Really.  

Sunday, March 05, 2006 10:01 PM by Cel;ko

# re: Hierarchies (trees) in SQL Server 2005

wonderful article!!

A question:

Where I can found the Class (namespace) Hierarchy Chart for .net 2.0? Just like the MFC Hierarchy Chart in MSDN2001? Is it

available from microsoft or MSDN? Can you give me a  link?

Thank you very much!!
Friday, June 23, 2006 12:24 AM by shanzy

# re: Hierarchies (trees) in SQL Server 2005

Do you have any data on the relative speed of the recursive CTE vs (your old) stored procedure versions of the adjacency model?

Have you tried a materialised ancestor version of the adjacency model (using either triggers or indexed views)?

Can you publish the code used in the tests?

Thursday, October 05, 2006 6:20 AM by Misha

# re: Hierarchies (trees) in SQL Server 2005

When you post such an article on msdn.com there should be considerable amount of test data that would proove what you put forward. I have been working in this area for last 10 years and agree with Celko 100%. Please dont misguide people if you cant guide them to the correct path

Sunday, January 28, 2007 10:25 AM by Chandrashekar Kollipara

# re: Hierarchies (trees) in SQL Server 2005

I know it is too late to ask, but where is the data to test your experiment?

Saturday, March 31, 2007 1:35 AM by Cesar

# re: Hierarchies (trees) in SQL Server 2005

To the person making comments that sound a bit nasty, and mentioning NP-Complete.  it may come as a surprise to you, but not everyone releases everything they have, and there are indeed some solutions out there you have not heard of.

Tuesday, November 25, 2008 9:06 PM by dan

# re: Hierarchies (trees) in SQL Server 2005

I am missing the classical datawarehouse approach. I know it is mostly focused on querying, but in my personal experiance, the insert performance is not too bad, depending on the model of inserts.

I have successfully used a combination of adjencency with a cache table for holding all relationships. I would have the following cache table

emp_cache

descendant,ancestor,distance

2,2,0

1,1,0

1,2,1

3,3,0

3,2,1

4,4,0

4,3,1

4,2,2

5,5,0

5,3,1

5,2,1

Additionally I would add a depth column to the base table, which allows a quicker computation of the distance value inside the trigger maintaining the cache and have the cache table indexed (depending on your query patterns). Inserting, updateing & deleting is done exactly as with adjacency, which makes sure there are no loops at all. Querying will use the cache table (& depth column). Querying the cache is quite natural to sql.

I used this to handle a cost center structure which was 17 levels deep and handled approx. 15.000 nodes.

Thursday, February 05, 2009 3:41 AM by mbuzina

# re: Hierarchies (trees) in SQL Server 2005

When using Nested set, if you want to retrieve all the tree (root node and all of its descendants), all you need to do is: SELECT * FROM tree ORDER BY left ASC. The process of generating the tree is supposed to be handled by your application. So the query needed is only 1, not 6072.

You can see the implementation of this tree in PHP ORM framework (Propel) ->

http://propel.phpdb.org/trac/wiki/Users/Documentation/1.3/Tree/NestedSet

Here is some code of the implementation of retrieving tree:

public static function retrieveTree(PropelPDO $con = null)

{

$c = new Criteria(TreePeer::DATABASE_NAME);

$c->addAscendingOrderByColumn(self::LEFT_COL);

/*

EXECUTE QUERY:

SELECT * FROM tree ORDER BY lft ASC

*/

$stmt = TreePeer::doSelectStmt($c, $con);

if (false !== ($row = $stmt->fetch(PDO::FETCH_NUM))) {

$root = new Tree();

$root->hydrate($row);

$root->setLevel(0);

$tree = self::hydrateDescendants($root, $stmt);

$stmt->closeCursor();

return $tree;

}

return false;

}

Note: only 1 query is executed.

Friday, April 03, 2009 5:37 AM by Yoseph

# re: Hierarchies (trees) in SQL Server 2005

Hi there,

I have a problem which i wonder if you could point me in direction how to solve.

I have a tree structure where each node have a "rating" column (int). For "composite" nodes, nodes with children the rating column should take the value of the max value of all its children. What kind of structure do i represent such a tree best in?

Thursday, May 07, 2009 6:31 PM by Niclas

# re: Hierarchies (trees) in SQL Server 2005

The high value for all descendents of the path model makes me think the path column was not indexed - can you confirm whether it was or not?

Thursday, September 10, 2009 8:23 AM by OrbMan

Leave a Comment

(required) 
required 
(required) 

  
Enter Code Here: Required
 
Page view tracker