Optimal way to model documents hierarchy in CouchDB

couchdbhierarchical-data

I'm trying to model document a hierarchy in CouchDB to use in my system, which is conceptually similar to a blog. Each blog post belongs to at least one category and each category can have many posts. Categories are hierarchical, meaning that if a post belongs to CatB in the hierarchy "CatA->CatB" ("CatB is in CatA)", it belongs also to CatA.

Users must be able to quickly find all post in a category (and all its children).

Solution 1
Each document of the post type contains a "category" array representing its position in the hierarchy (see 2).

{
   "_id": "8e7a440862347a22f4a1b2ca7f000e83",
   "type": "post",
   "author": "dexter",
   "title": "Hello",
   "category":["OO","Programming","C++"]
}

Solution 2
Each document of the post type contains the "category" string representing its path in the hierarchy (see 4).

{
   "_id": "8e7a440862347a22f4a1b2ca7f000e83",
   "type": "post",
   "author": "dexter",
   "title": "Hello",
   "category": "OO/Programming/C++"
}

Solution 3
Each document of the post type contains its parent "category" id representing its path in the hierarchy (see 3). A hierarchical category structure is built through linked "category" document types.

{
   "_id": "8e7a440862347a22f4a1b2ca7f000e83",
   "type": "post",
   "author": "dexter",
   "title": "Hello",
   "category_id": "3"
}

{
   "_id": "1",
   "type": "category",
   "name": "OO"
}


{
   "_id": "2",
   "type": "category",
   "name": "Programming",
   "parent": "1"
}


{
   "_id": "3",
   "type": "category",
   "name": "C++",
   "parent": "2"
}

Question

What's the best way to store this kind of relationship in CouchDB? What's the most efficient solution in terms of disk space, scalability and retrieval speed?

Can such a relation be modelled to take into account localised category names?

Disclaimer

I know this question has been asked a few times already here on SO, but it seems there's no definitive answer to it nor an answer which deals with the pros and cons of each solution. Sorry for the length of the question 🙂

Read so far

CouchDB – The Definitive Guide

Storing Hierarchical Data in CouchDB

Retrieving Hierarchical/Nested Data From CouchDB

Using CouchDB group_level for hierarchical data

Best Answer

There's no right answer to this question, hence the lack of a definitive answer. It mostly depends on what kind of usage you want to optimize for.

You state that retrieval speed of documents that belong to a certain category (and their children) is most important. The first two solutions allow you to create a view that emits a blog post multiple times, once for each category in the chain from the leaf to the root. Thus selecting all documents can be done using a single (and thus fast) query. The only difference of second solution to first solution is that you move the parsing of the category "path" into components from the code that inserts the document to the map function of the view. I would prefer the first solution as it's simpler to implement the map function and a bit more flexible (e.g. it allows a category's name to contain a slash character).

In your scenario you probably also want to create a reduced view which counts the number of blog posts for each category. This is very simple with either of these solutions. With a fitting reduction function, the number of post in every category can be retrieved using a single request.

A downside of the first two solutions is that renaming or moving a category from one parent to another requires every document to be updated. The third solution allows that without touching the documents. But from the description of your scenario I assume that retrieval by category is very frequent and category renaming/moving is very rare.

Solution 4 I propose a fourth solution where blog post documents hold references to category documents but still reference all the ancestors of the post's category. This allows categories to be renamed without touching the blog posts and allows you to store additional metadata with a category (e.g. translations of the category name or a description):

{
    "_id": "8e7a440862347a22f4a1b2ca7f000e83",
    "type": "post",
    "author": "dexter",
    "title": "Hello",
    "category_ids": [3, 2, 1]
}

{
    "_id": "1",
    "type": "category",
    "name": "OO"
}

{
    "_id": "2",
    "type": "category",
    "name": "Programming",
    "parent": "1"
}


{
    "_id": "3",
    "type": "category",
    "name": "C++",
    "parent": "2"
}

You will still have to store the parents of categories with the categories, which is duplicating data in the posts, to allow categories to be traversed (e.g. for displaying a tree of categories for navigation).

You can extend this solution or any of your solutions to allow a post to be categorized under multiple categories, or a category to have multiple parents. When a post is categorized in multiple categories, you will need to store the union of the ancestors of each category in the post's document while preserving the categories selected by the author to allow them to be displayed with the post or edited later.

Lets assume that there is an additional category named "Ajax" with anchestors "JavaScript", "Programming" and "OO". To simplify the following example, I've chosen the document IDs of the categories to equal the category's name.

{
    "_id": "8e7a440862347a22f4a1b2ca7f000e83",
    "type": "post",
    "author": "dexter",
    "title": "Hello",
    "category_ids": ["C++", "Ajax"],
    "category_anchestor_ids": ["C++", "Programming", "OO", "Ajax", "JavaScript"]
}

To allow a category to have multiple parents, just store multiple parent IDs with a category. You will need to eliminate duplicates while finding all the ancestors of a category.

View for Solution 4 Suppose you want to get all the blog posts for a specific category. We will use a database with the following sample data:

{ "_id": "100", "type": "category", "name": "OO"                              }
{ "_id": "101", "type": "category", "name": "Programming", "parent_id": "100" }
{ "_id": "102", "type": "category", "name": "C++",         "parent_id": "101" }
{ "_id": "103", "type": "category", "name": "JavaScript",  "parent_id": "101" }
{ "_id": "104", "type": "category", "name": "AJAX",        "parent_id": "103" }

{ "_id": "200", "type": "post", "title": "OO Post",          "category_id": "104", "category_anchestor_ids": ["100"]                      }
{ "_id": "201", "type": "post", "title": "Programming Post", "category_id": "101", "category_anchestor_ids": ["101", "100"]               }
{ "_id": "202", "type": "post", "title": "C++ Post",         "category_id": "102", "category_anchestor_ids": ["102", "101", "100"]        }
{ "_id": "203", "type": "post", "title": "AJAX Post",        "category_id": "104", "category_anchestor_ids": ["104", "103", "101", "100"] }

In addition to that, we use a view called posts_by_category in a design document called _design/blog with the the following map function:

function (doc) {
    if (doc.type == 'post') {
        for (i in doc.category_anchestor_ids) {
            emit([doc.category_anchestor_ids[i]], doc)
        }
    }
}

Then we can get all the posts in the Programming category (which has ID "101") or one of it's subcategories using a GET requests to the following URL.

http://localhost:5984/so/_design/blog/_view/posts_by_category?reduce=false&key=["101"]

This will return a view result with the keys set to the category ID and the values set to the post documents. The same view can also be used to get a summary list of all categories and the number of post in that category and it's children. We add the following reduce function to the view:

function (keys, values, rereduce) {
    if (rereduce) {
        return sum(values)
    } else {
        return values.length
    }
}

And then we use the following URL:

http://localhost:5984/so/_design/blog/_view/posts_by_category?group_level=1

This will return a reduced view result with the keys again set to the category ID and the values set to the number of posts in each category. In this example, the categories name's would have to be fetched separately but it is possible to create view where each row in the reduced view result already contains the category name.

Related Solutions

Principles for Modeling CouchDB Documents

There have been some great answers to this already, but I wanted to add some more recent CouchDB features to the mix of options for working with the original situation described by viatropos.

The key point at which to split up documents is where there might be conflicts (as mentioned earlier). You should never keep massively "tangled" documents together in a single document as you'll get a single revision path for completely unrelated updates (comment addition adding a revision to the entire site document for instance). Managing the relationships or connections between various, smaller documents can be confusing at first, but CouchDB provides several options for combining disparate pieces into single responses.

The first big one is view collation. When you emit key/value pairs into the results of a map/reduce query, the keys are sorted based on UTF-8 collation ("a" comes before "b"). You can also output complex keys from your map/reduce as JSON arrays: ["a", "b", "c"]. Doing that would allow you to include a "tree" of sorts built out of array keys. Using your example above, we can output the post_id, then the type of thing we're referencing, then its ID (if needed). If we then output the id of the referenced document into an object in the value that's returned we can use the 'include_docs' query param to include those documents in the map/reduce output:

{"rows":[
  {"key":["123412804910820", "post"], "value":null},
  {"key":["123412804910820", "author", "Lance1231"], "value":{"_id":"Lance1231"}},
  {"key":["123412804910820", "comment", "comment1"], "value":{"_id":"comment1"}},
  {"key":["123412804910820", "comment", "comment2"], "value":{"_id":"comment2"}}
]}

Requesting that same view with '?include_docs=true' will add a 'doc' key that will either use the '_id' referenced in the 'value' object or if that isn't present in the 'value' object, it will use the '_id' of the document from which the row was emitted (in this case the 'post' document). Please note, these results would include an 'id' field referencing the source document from which the emit was made. I left it out for space and readability.

We can then use the 'start_key' and 'end_key' parameters to filter the results down to a single post's data:

?start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]

Or even specifically extract the list for a certain type:

?start_key=["123412804910820", "comment"]&end_key=["123412804910820", "comment", {}]

These query param combinations are possible because an empty object ("{}") is always at the bottom of the collation and null or "" are always at the top.

The second helpful addition from CouchDB in these situations is the _list function. This would allow you to run the above results through a templating system of some kind (if you want HTML, XML, CSV or whatever back), or output a unified JSON structure if you want to be able to request an entire post's content (including author and comment data) with a single request and returned as a single JSON document that matches what your client-side/UI code needs. Doing that would allow you to request the post's unified output document this way:

/db/_design/app/_list/posts/unified??start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]&include_docs=true

Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).

Combining these things, you can split up your documents at whatever level you find useful and "safe" for updates, conflicts, and replication, and then put them back together as needed when they're requested.

Hope that helps.

Sql – What are the options for storing hierarchical data in a relational database

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

Best Answer

Related Solutions

Principles for Modeling CouchDB Documents

Sql – What are the options for storing hierarchical data in a relational database

Related Topic