Is map-reduce the basic factor that makes NoSQL more scalable than SQL

MySQLnosqlscalability

I'm studying the differences between NoSQL and SQL, and what makes the first more scalable. I think I got the point, so I'll try to explain:

Suppose an app has a list of billions of users, each with a quantity of money, and it wants to display the sum of the money of them all. In SQL, it'd have to SELECT SUM(money) FROM users, which would cause the engine to iterate through every user (O(n)) to compute the sum, which is slow. In NoSQL, on other hands, it could previously set up a map reduce function such as nosql_engine.mapReduce(get_money,sum). This would cause the retrieval of the sum to be instantaneous (O(1)) as it would be cached, and updating that computation drops to O(log(n)), because of the associability of the reduction operation. So, in general, this map-reducing capacity is the basic underlying principle that makes NoSQL more scalable than SQL. Is this correct to some extent?

Best Answer

No, that's not it at all. What you describe is gaining an advantage either by caching (having computed the answer before the request arrived) or by parallelization (tasking more than one node with the computation of a big sum). Neither is necessarily exclusive to 'NoSQL' data bases. (I use scare quotes because what people call 'NoSQL' these days is mostly characterized not by the lack of a structured query language, but by non-adherence to strict relational principles.)

Caching frequently computed aggregations can be done in any kind of data store. In relational terms this is called denormalization, and while according to the strictest adherents of relational theory you should never, ever do it, it has obvious advantages in some use cases and is therefore frequently done without much regret by database engines both old and new. The trade-offs are well known (e.g. faster reads vs. slower inserts) and are generally manageable; they are in fact quite similar to the trade-offs for normal indexing.

The characteristic advantage of map-reduce is that more than one node joins in a computation, which means that the data have to be distributed. That is also often done by relational databases, whether in the form of sharding, horizontal partitioning or any other variant thereof. Obviously distributing data can also be combined with denormalization.

What typically makes non-relational data bases faster is the omission of the classical ACID guarantees on data integrity. For instance, the system may not guarantee that all your writes become visible at the same time to every other client, or even that they become visible at all. A frequent compromise is that all data you write will become visible to all clients given enough time, where 'enough' is not strictly limited amount ('eventual consistency'). Obviously, this allows very fast writing because you don't have to wait until all guarantees have been acted out. For reading, the decisive factor is often how much you care about inexact results (e.g. not counting things that are in the system but not yet visible to you). For computing a sum, it is probably that the actual computation (i.e. executing ADD instructions in the processor core) is much, much faster than the I/O involved in retrieving all relevant data chunks from wherever they are, so parallelization does not much good - the total time will almost certainly be dominated by how fast you can get to all operands, and how hard you try not to miss any.