When to use wide column stores instead of document based stores

document-databasesnosql

I have some experience with document based stores (MongoDB and CouchDB) and I am interested in exploring wide column databases.

Based on my initial exploration I can grasp a basic understanding of how wide column stores are different, but I do not really understand in which type of operations they are a better fit than an indexed document store.

My initial impression is that column stores are better if the column combinations for the queries are highly dynamic (no indexed view really required) and/or if there is a high rate of writing (that triggers map-reduce indexes in a document store).

Performance wise, it seems that column stores might be better if I have documents with many properties but not all of them are needed. Document stores seems to promote that the whole document will be retrieved, but not sure how much impact this really has. Maybe the document needs to have many filtered columns to make a difference?

Also I got the impression that column stores "might" be more performant for multi-tenant systems which shared database where one of the columns holds the tenant id and maybe another one the roles.

And I am getting the feeling that wide column stores are very good for the queries done by data analysis applications, where there is a large set of collected data for each entry, only few fields must be extracted and the combination of columns is totally random.

My Question: What types of queries are better handled in wide column stores as opposed to document stores?

Best Answer

I can't answer this question for you, and no one else can either, because "Gorilla vs Shark" as noted in comments above. But I will help anyway.

You have omitted an important preceding question:

What are the characteristics of the data set I am querying?

That is just as important, if not more so, than the specific queries you want to run. Some useful questions to ask about your data are:

  • How much data do I have? Does it fit into memory? On one server? On a cluster?
  • How does my data change? Does it get mass updates on a predictable frequency? Unpredictable frequency? Does it get streamed updates? Trickled transaction updates? No updates at all?
  • What is the structure of the entities in my data? Are there any "one-to-many" relationships? Is it all tabular? Is it mostly tabular?
  • What is the sparseness of my data? Is it reasonably complete? Is it mostly empty?

If you are considering this in the abstract and don't have any specific data set in mind, then there is no reasonable answer to your question.

And even with a specific, well-defined set of data, and answers to all these questions, you still might not know without doing a bakeoff of particular implementations.

Related Topic