Sql – What are the options for storing hierarchical data in a relational database

databasehierarchical-datarelational-databasesqltree

Good Overviews

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set, and Nested Interval I've found.
Models for hierarchical data: slides with good explanations of tradeoffs and example usage
Representing hierarchies in MySQL: very good overview of Nested Set in particular
Hierarchical data in RDBMSs: a most comprehensive and well-organized set of links I've seen, but not much in the way of explanation

Options

Ones I am aware of and general features:

Adjacency List:

Columns: ID, ParentID
Easy to implement.
Cheap node moves, inserts, and deletes.
Expensive to find the level, ancestry & descendants, path
Avoid N+1 via Common Table Expressions in databases that support them

Nested Set (a.k.a Modified Preorder Tree Traversal)

Columns: Left, Right
Cheap ancestry, descendants
Very expensive O(n/2) moves, inserts, deletes due to volatile encoding

Bridge Table (a.k.a. Closure Table /w triggers)

Uses separate join table with ancestor, descendant, depth (optional)
Cheap ancestry and descendants
Writes costs O(log n) (size of the subtree) for insert, updates, deletes
Normalized encoding: good for RDBMS statistics & query planner in joins
Requires multiple rows per node

Lineage Column (a.k.a. Materialized Path, Path Enumeration)

Column: lineage (e.g. /parent/child/grandchild/etc…)
Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
Writes costs O(log n) (size of the subtree) for insert, updates, deletes
Non-relational: relies on Array datatype or serialized string format

Nested Intervals

Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
Has real/float/decimal representation/precision issues
Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.

Flat Table

A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
Cheap to iterate/paginate over
Expensive move and delete
Good Use: threaded discussion – forums / blog comments

Multiple lineage columns

Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
Cheap ancestors, descendants, level
Cheap insert, delete, move of the leaves
Expensive insert, delete, move of the internal nodes
Hard limit to how deep the hierarchy can be

Database Specific Notes

MySQL

Use session variables for Adjacency List

Oracle

Use CONNECT BY to traverse Adjacency Lists

PostgreSQL

ltree datatype for Materialized Path

SQL Server

General summary
2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.

Best Answer

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Best Answer

Related Solutions

Android – How to avoid concurrency problems when using SQLite on Android

What type of NoSQL database is best suited to store hierarchical data

Related Topic