Database – Why is NoSQL better for this scenario

databasejsonnosqlperformancesql

Hypothetical scenario: Let's say we are downloading JSON from Facebook with details of a user's friend's checkins, posts, etc… These come in as one document per friend per activity, so with 8 activities a user with 300 friends will cause our system to make 2400 requests to Facebook, downloading 2400 JSON documents.

Let's say we want to merge these 2400 documents together, sort the activities by date_created descending and then page through them in a sort of pseudo newsfeed. Please do not comment on the wisdom of recreating a facebook newsfeed in this way.

Let's also suppose that we want to re-download all of this data whenever we are notified that it has changed by Facebook. (FB has an update service you can subscribe to for users of your app). For argument sake let's assume all of the data has to be refreshed every 5 minutes, and further assume we want to be able to support 1000 simultaneous users, and that the average JSON document size is 25kb.

I am curious as to how NoSQL techniques would be better than parsing the JSON on ingestion into a relational database? To me it seems like map/reduce are just synonyms for parse/aggregate and that both approaches will require the same thing to occur. What advantages would I get from using NoSQL?

Best Answer

What advantages would I get from using NoSQL?

NoSQL will scale better as the number of users grows.

Traditional RDBMS don't really scale well. All that you can do is throw bigger machines at the problem. They aren't really suited for distributed systems (cloud e.g.).

NoSQL is (under given circumstances) better at handling hierarchical structures like documents/JSON.

The key point to understand is that these storage mechanisms are key-value based and thus can retrieved data that is stored together very fast, as opposed to data that is "merely related" (what RDBMS were built for).

In your case that would mean, that you can easily retrieve all records for a certain user very fast for example. In traditional relational databases you would either have to denormalize your schema for performance or keep the schema clean but potentially suffer performance penalties caused by joins or heavy aggregations.

Look at it this way: Why is a hash map (key value store) fast? You can retrieve items from a hashmap in almost O(1) as the hash directly translates to a memory address (simplified). Looking up a binary index in contrast to that would yield O(log(n));

For your case, MongoDB or CouchDB might be good solutions, as it's already based on JSON.

In my opinion, using a NoSQL solution here is a good choice. You want to retrieve all the activities of a user as a feed. If they're properly written to your data storage, then NoSQL should, in theory, excell at this, without the need for joining anything or worrying about proper indexes. @Earlz also mentioned that you have no ACID guarantee for NoSQL databases. This makes NoSQL fast and you probably don't need ACID properties for your application. Give it a try!

Moreover, there's a good article from Martin Fowler on the subject. He's made a nice diagram that I really like:

enter image description here

Go check out his pages to read some deep thoughts about NoSQL.

Related Topic