Bigtable vs cassandra vs simpledb vs dynamo vs couchdb vs hypertable vs riak vs hbase, what do they have in common

amazon-simpledbbigtablecassandracouchdbhypertable

Sorry if this question is somewhat subjective. I am new to 'could store', 'distributed store' or some concepts like this. I really wonder what do they have in common and want to get an overview on all of them. What do I need to prepare if I want to write a product similar to this?

Best Answer

The NoSQL Database site summarizes the concept like this:

Next Generation Databases mostly address some of the points: being non-relational, distributed, open-source and horizontal scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply as: schema-free, replication support, easy API, eventually consistency, and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.

That site also maintains an archive of articles on NoSQL databases. Most of them seem to focus on particular products but there are some more general overviews. If you are serious about building your own one then Design Patterns for Distributed Non-Relational Databases does a good round up of things you need to consider.

Related Solutions

The difference between Cassandra and CouchDB

CouchDB is a document store. You put documents (JSON objects) in it and define views (indexes) over them. The objects can be arbitrarily complex with potentially deep structure. Further, they are not constrained to following some consistent schema.

Cassandra is a ragged-table key-value store. It just stores rows, each of which has a set of named columns grouped in to families with values. It sounds quite close to BigTable; BigTable doesn't require each row to have the same structure (unlike an SQL database). The values may have some structure, but this kind of store doesn't know anything about that -- they're just strings/byte sequences.

Yes, they are both non-relational databases, and there is probably a fair amount of overlap in their applicability, but they do have distinctly different data organization models. Each can probably be forced into emulating the other, but each model will map best to a different set of problems.

Principles for Modeling CouchDB Documents

There have been some great answers to this already, but I wanted to add some more recent CouchDB features to the mix of options for working with the original situation described by viatropos.

The key point at which to split up documents is where there might be conflicts (as mentioned earlier). You should never keep massively "tangled" documents together in a single document as you'll get a single revision path for completely unrelated updates (comment addition adding a revision to the entire site document for instance). Managing the relationships or connections between various, smaller documents can be confusing at first, but CouchDB provides several options for combining disparate pieces into single responses.

The first big one is view collation. When you emit key/value pairs into the results of a map/reduce query, the keys are sorted based on UTF-8 collation ("a" comes before "b"). You can also output complex keys from your map/reduce as JSON arrays: ["a", "b", "c"]. Doing that would allow you to include a "tree" of sorts built out of array keys. Using your example above, we can output the post_id, then the type of thing we're referencing, then its ID (if needed). If we then output the id of the referenced document into an object in the value that's returned we can use the 'include_docs' query param to include those documents in the map/reduce output:

{"rows":[
  {"key":["123412804910820", "post"], "value":null},
  {"key":["123412804910820", "author", "Lance1231"], "value":{"_id":"Lance1231"}},
  {"key":["123412804910820", "comment", "comment1"], "value":{"_id":"comment1"}},
  {"key":["123412804910820", "comment", "comment2"], "value":{"_id":"comment2"}}
]}

Requesting that same view with '?include_docs=true' will add a 'doc' key that will either use the '_id' referenced in the 'value' object or if that isn't present in the 'value' object, it will use the '_id' of the document from which the row was emitted (in this case the 'post' document). Please note, these results would include an 'id' field referencing the source document from which the emit was made. I left it out for space and readability.

We can then use the 'start_key' and 'end_key' parameters to filter the results down to a single post's data:

?start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]

Or even specifically extract the list for a certain type:

?start_key=["123412804910820", "comment"]&end_key=["123412804910820", "comment", {}]

These query param combinations are possible because an empty object ("{}") is always at the bottom of the collation and null or "" are always at the top.

The second helpful addition from CouchDB in these situations is the _list function. This would allow you to run the above results through a templating system of some kind (if you want HTML, XML, CSV or whatever back), or output a unified JSON structure if you want to be able to request an entire post's content (including author and comment data) with a single request and returned as a single JSON document that matches what your client-side/UI code needs. Doing that would allow you to request the post's unified output document this way:

/db/_design/app/_list/posts/unified??start_key=["123412804910820"]&end_key=["123412804910820", {}, {}]&include_docs=true

Your _list function (in this case named "unified") would take the results of the view map/reduce (in this case named "posts") and run them through a JavaScript function that would send back the HTTP response in the content type you need (JSON, HTML, etc).

Combining these things, you can split up your documents at whatever level you find useful and "safe" for updates, conflicts, and replication, and then put them back together as needed when they're requested.

Hope that helps.

Best Answer

Related Solutions

The difference between Cassandra and CouchDB

Principles for Modeling CouchDB Documents

Related Topic