Sql – What are the options for storing hierarchical data in a relational database

databasehierarchical-datarelational-databasesqltree

Good Overviews

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

Options

Ones I am aware of and general features:

  1. Adjacency List:
  • Columns: ID, ParentID
  • Easy to implement.
  • Cheap node moves, inserts, and deletes.
  • Expensive to find the level, ancestry & descendants, path
  • Avoid N+1 via Common Table Expressions in databases that support them
  1. Nested Set (a.k.a Modified Preorder Tree Traversal)
  • Columns: Left, Right
  • Cheap ancestry, descendants
  • Very expensive O(n/2) moves, inserts, deletes due to volatile encoding
  1. Bridge Table (a.k.a. Closure Table /w triggers)
  • Uses separate join table with ancestor, descendant, depth (optional)
  • Cheap ancestry and descendants
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
  • Normalized encoding: good for RDBMS statistics & query planner in joins
  • Requires multiple rows per node
  1. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
  • Column: lineage (e.g. /parent/child/grandchild/etc…)
  • Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
  • Non-relational: relies on Array datatype or serialized string format
  1. Nested Intervals
  • Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
  • Has real/float/decimal representation/precision issues
  • Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.
  1. Flat Table
  • A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
  • Cheap to iterate/paginate over
  • Expensive move and delete
  • Good Use: threaded discussion – forums / blog comments
  1. Multiple lineage columns
  • Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
  • Cheap ancestors, descendants, level
  • Cheap insert, delete, move of the leaves
  • Expensive insert, delete, move of the internal nodes
  • Hard limit to how deep the hierarchy can be

Database Specific Notes

MySQL

Oracle

PostgreSQL

SQL Server

  • General summary
  • 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.

Best Answer

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Related Topic