Principles for Modeling CouchDB Documents

couchdbmodelingprinciples

I have a question that I've been trying to answer for some time now but can't figure out:

How do you design, or divide up, CouchDB documents?

Take a Blog Post for example.

The semi "relational" way to do it would be to create a few objects:

Post
User
Comment
Tag
Snippet

This makes a great deal of sense. But I am trying to use couchdb (for all the reasons that it's great) to model the same thing and it's been extremely difficult.

Most of the blog posts out there give you an easy example of how to do this. They basically divide it up the same way, but say you can add 'arbitrary' properties to each document, which is definitely nice. So you'd have something like this in CouchDB:

Post (with tags and snippets "pseudo" models in the doc)
Comment
User

Some people would even say you could throw the Comment and User in there, so you'd have this:


post {
    id: 123412804910820
    title: "My Post"
    body: "Lots of Content"
    html: "<p>Lots of Content</p>"
    author: {
        name: "Lance"
        age: "23"
    }
    tags: ["sample", "post"]
    comments {
        comment {
            id: 93930414809
            body: "Interesting Post"
        } 
        comment {
            id: 19018301989
            body: "I agree"
        }
    }
}

That looks very nice and is easy to understand. I also understand how you could write views that extracted just the Comments from all your Post documents, to get them into Comment models, same with Users and Tags.

But then I think, "why not just put my whole site into a single document?":


site {
    domain: "www.blog.com"
    owner: "me"
    pages {
        page {
            title: "Blog"
            posts {
                post {
                    id: 123412804910820
                    title: "My Post"
                    body: "Lots of Content"
                    html: "<p>Lots of Content</p>"
                    author: {
                        name: "Lance"
                        age: "23"
                    }
                    tags: ["sample", "post"]
                    comments {
                        comment {
                            id: 93930414809
                            body: "Interesting Post"
                        } 
                        comment {
                            id: 19018301989
                            body: "I agree"
                        }
                    }
                }
                post {
                    id: 18091890192984
                    title: "Second Post"
                    ...
                }
            }
        }
    }
}

You could easily make views to find what you wanted with that.

Then the question I have is, how do you determine when to divide the document into smaller documents, or when to make "RELATIONS" between the documents?

I think it would be much more "Object Oriented", and easier to map to Value Objects, if it were divided like so:


posts {
    post {
        id: 123412804910820
        title: "My Post"
        body: "Lots of Content"
        html: "<p>Lots of Content</p>"
        author_id: "Lance1231"
        tags: ["sample", "post"]
    }
}
authors {
    author {
        id: "Lance1231"
        name: "Lance"
        age: "23"
    }
}
comments {
    comment {
        id: "comment1"
        body: "Interesting Post"
        post_id: 123412804910820
    } 
    comment {
        id: "comment2"
        body: "I agree"
        post_id: 123412804910820
    }
}

… but then it starts looking more like a Relational Database. And often times I inherit something that looks like the "whole-site-in-a-document", so it's more difficult to model it with relations.

I've read lots of things about how/when to use Relational Databases vs. Document Databases, so that's not the main issue here. I'm more just wondering, what's a good rule/principle to apply when modeling data in CouchDB.

Another example is with XML files/data. Some XML data has nesting 10+ levels deep, and I would like to visualize that using the same client (Ajax on Rails for instance, or Flex) that I would to render JSON from ActiveRecord, CouchRest, or any other Object Relational Mapper. Sometimes I get huge XML files that are the entire site structure, like the one below, and I'd need to map it to Value Objects to use in my Rails app so I don't have to write another way of serializing/deserializing data:


<pages>
    <page>
        <subPages>
            <subPage>
                <images>
                    <image>
                        <url/>
                    </image>
                </images>
            </subPage>
        </subPages>
    </page>
</pages>

So the general CouchDB questions are:

What rules/principles do you use to divide up your documents (relationships, etc)?
Is it okay to put the entire site into one document?
If so, how do you handle serializing/deserializing documents with arbitrary depths levels (like the large json example above, or the xml example)?
Or do you not turn them into VOs, do you just decide "these ones are too nested to Object-Relational Map, so I'll just access them using raw XML/JSON methods"?

Thanks a lot for your help, the issue of how to divide up your data with CouchDB has been difficult for me to say "this is how I should do it from now on". I hope to get there soon.

I have studied the following sites/projects.

…but they still haven't answered this question.

Best Answer

There have been some great answers to this already, but I wanted to add some more recent CouchDB features to the mix of options for working with the original situation described by viatropos.

The key point at which to split up documents is where there might be conflicts (as mentioned earlier). You should never keep massively "tangled" documents together in a single document as you'll get a single revision path for completely unrelated updates (comment addition adding a revision to the entire site document for instance). Managing the relationships or connections between various, smaller documents can be confusing at first, but CouchDB provides several options for combining disparate pieces into single responses.

The first big one is view collation. When you emit key/value pairs into the results of a map/reduce query, the keys are sorted based on UTF-8 collation ("a" comes before "b"). You can also output complex keys from your map/reduce as JSON arrays: ["a", "b", "c"]. Doing that would allow you to include a "tree" of sorts built out of array keys. Using your example above, we can output the post_id, then the type of thing we're referencing, then its ID (if needed). If we then output the id of the referenced document into an object in the value that's returned we can use the 'include_docs' query param to include those documents in the map/reduce output:

{"rows":[
  {"key":["123412804910820", "post"], "value":null},
  {"key":["123412804910820", "author", "Lance1231"], "value":{"_id":"Lance1231"}},
  {"key":["123412804910820", "comment", "comment1"], "value":{"_id":"comment1"}},
  {"key":["123412804910820", "comment", "comment2"], "value":{"_id":"comment2"}}
]}

Requesting that same view with '?include_docs=true' will add a 'doc' key that will either use the '_id' referenced in the 'value' object or if that isn't present in the 'value' object, it will use the '_id' of the document from which the row was emitted (in this case the 'post' document). Please note, these results would include an 'id' field referencing the source document from which the emit was made. I left it out for space and readability.

We can then use the 'start_key' and 'end_key' parameters to filter the results down to a single post's data:

?start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]

Or even specifically extract the list for a certain type:

?start_key=["123412804910820", "comment"]&end_key=["123412804910820", "comment", {}]

These query param combinations are possible because an empty object ("{}") is always at the bottom of the collation and null or "" are always at the top.

The second helpful addition from CouchDB in these situations is the _list function. This would allow you to run the above results through a templating system of some kind (if you want HTML, XML, CSV or whatever back), or output a unified JSON structure if you want to be able to request an entire post's content (including author and comment data) with a single request and returned as a single JSON document that matches what your client-side/UI code needs. Doing that would allow you to request the post's unified output document this way:

/db/_design/app/_list/posts/unified??start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]&include_docs=true

Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).

Combining these things, you can split up your documents at whatever level you find useful and "safe" for updates, conflicts, and replication, and then put them back together as needed when they're requested.

Hope that helps.

Related Solutions

Best way to do one-to-many “JOIN” in CouchDB

Thank you! This is a great example to show off CouchDB 0.11's new features!

You must use the fetch-related-data feature to reference documents in the view. Optionally, for more convenient JSON, use a _list function to clean up the results. See Couchio's writeup on "JOIN"s for details.

Here is the plan:

Firstly, you have a uniqueness contstraint on your el documents. If two of them have id=2, that's a problem. It is necessary to use the _id field instead if id. CouchDB will guarantee uniqueness, but also, the rest of this plan requires _id in order to fetch documents by ID.
```
{ "type" : "el", "_id" : "1", "content" : "first" } 
{ "type" : "el", "_id" : "2", "content" : "second" } 
{ "type" : "el", "_id" : "3", "content" : "third" } 
```
If changing the documents to use _id is absolutely impossible, you can create a simple view to emit(doc.id, doc) and then re-insert that into a temporary database. This converts id to _id but adds some complexity.

The view emits {"_id": content_id} data keyed on [list_id, sort_number], to "clump" the lists with their content.

function(doc) {
  if(doc.type == 'list') {
    for (var i in doc.elements) {
      // Link to the el document's id.
      var id = doc.elements[i];
      emit([doc.id, i], {'_id': id});
    }
  }
}

Now there is a simple list of el documents, in the correct order. You can use startkey and endkey if you want to see only a particular list.

curl localhost:5984/x/_design/myapp/_view/els
{"total_rows":2,"offset":0,"rows":[
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","0"],"value":{"_id":"2"}},
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","1"],"value":{"_id":"1"}}
]}

To get the el content, query with include_docs=true. Through the magic of _id, the el documents will load.

curl localhost:5984/x/_design/myapp/_view/els?include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","0"],"value":{"_id":"2"},"doc":{"_id":"2","_rev":"1-4530dc6946d78f1e97f56568de5a85d9","type":"el","content":"second"}},
{"id":"036f3614aeee05344cdfb66fa1002db6","key":["abc123","1"],"value":{"_id":"1"},"doc":{"_id":"1","_rev":"1-852badd683f22ad4705ed9fcdea5b814","type":"el","content":"first"}}
]}

Notice, this is already all the information you need. If your client is flexible, you can parse the information out of this JSON. The next optional step simply reformats it to match what you need.

Use a _list function, which simply reformats the view output. People use them to output XML or HTML however we will make the JSON more convenient.

function(head, req) {
  var headers = {'Content-Type': 'application/json'};
  var result;
  if(req.query.include_docs != 'true') {
    start({'code': 400, headers: headers});
    result = {'error': 'I require include_docs=true'};
  } else {
    start({'headers': headers});
    result = {'content': []};
    while(row = getRow()) {
      result.content.push(row.doc.content);
    }
  }
  send(JSON.stringify(result));
}

The results match. Of course in production you will need startkey and endkey to specify the list you want.

curl -g 'localhost:5984/x/_design/myapp/_list/pretty/els?include_docs=true&startkey=["abc123",""]&endkey=["abc123",{}]'
{"content":["second","first"]}

Bulk updating a CouchDB database without a _rev value per document

By design, you cannot update a CouchDB document blindly, you can only attempt to update a specific revision of a document.

For a single document, you can use a CouchDB update handler to hide this from the client as an update handler will be passed the existing document (if it exists) including its revision.

For a collection of documents, when using _bulk_docs, you can add "new_edits": false which will forcibly insert conflicts instead of rejection (though you'll still need to pass a _rev, it just doesn't have to be the current one).

All that said, it would be better to follow the rules. Grab the current revision of the document you would like to update, attempt to update it, if you get a 409, get the new version, merge as appropriate, and update again.

Best Answer

Related Solutions

Best way to do one-to-many “JOIN” in CouchDB

Bulk updating a CouchDB database without a _rev value per document

Related Topic