Pro’s of databases like BigTable, SimpleDB

amazon-simpledbbigtabledatabasekey-value-storescalability

New school datastore paradigms like Google BigTable and Amazon SimpleDB are specifically designed for scalability, among other things. Basically, disallowing joins and denormalization are the ways this is being accomplished.

In this topic, however, the consensus seems to be that joins on large tables don't necessarilly have to be too expensive and denormalization is "overrated" to some extent
Why, then, do these aforementioned systems disallow joins and force everything together in a single table to achieve scalability? Is it the sheer volumes of data that needs to be stored in these systems (many terabytes)?
Do the general rules for databases simply not apply to these scales?
Is it because these database types are tailored specifically towards storing many similar objects?
Or am I missing some bigger picture?

Best Answer

Distributed databases aren't quite as naive as Orion implies; there has been quite a bit of work done on optimizing fully relational queries over distributed datasets. You may want to look at what companies like Teradata, Netezza, Greenplum, Vertica, AsterData, etc are doing. (Oracle got in the game, finally, as well, with their recent announcement; Microsoft bought their solition in the name of the company that used to be called DataAllegro).

That being said, when the data scales up into terabytes, these issues become very non-trivial. If you don't need the strict transactionality and consistency guarantees you can get from RDBMs, it is often far easier to denormalize and not do joins. Especially if you don't need to cross-reference much. Especially if you are not doing ad-hoc analysis, but require programmatic access with arbitrary transformations.

Denormalization is overrated. Just because that's what happens when you are dealing with a 100 Tera, doesn't mean this fact should be used by every developer who never bothered to learn about databases and has trouble querying a million or two rows due to poor schema planning and query optimization.

But if you are in the 100 Tera range, by all means...

Oh, the other reason these technologies are getting the buzz -- folks are discovering that some things never belonged in the database in the first place, and are realizing that they aren't dealing with relations in their particular fields, but with basic key-value pairs. For things that shouldn't have been in a DB, it's entirely possible that the Map-Reduce framework, or some persistent, eventually-consistent storage system, is just the thing.

On a less global scale, I highly recommend BerkeleyDB for those sorts of problems.

Related Topic