How much does the data model affect scalability and performance in so called “NoSQL” database

database-designnosqlscalability

You can't ever have a talk about so called "NoSQL" database without bringing the CAP theorem (Consistency, Availability, Partition : pick two). If you have to pick say, between MongoDB (Partition, Consistency) and CouchDB (Availability, Partition), the first you need to think about is "Do I need correct data or do I need access all the time?".

Those new database were made to be partitioned. But what If I don't? What if I just think its pretty cool to have a Key/Value, Column, Document, whatever database instead of a relational one, and just create one server instance and never shard it? In that case, wouldn't I have both availability and consistency? MongoDB wouldn't need to replicate anything, so it would be available. And CouchDB would have only one source of data, so it would be pretty consistent.

So that would mean that, in that case, MongoDB and CouchDB would have little difference in term of use case? Well, except of course performance, API, and al, but that would be more like choosing between PostgreSQL and MySQL than having two fundamentally different set of requirements.

Am I right here? Can I change a AP or CP database to an AC one by not creating more than one instance? Or is there something that I am missing?

Let's ask the question in reverse. What if I take a relational database, let say MySQL, and put it in a master/slaves configuration. I don't use ACID transactions If I require that any write be synchronized to the slave immediately, wouldn't that make it a CP database? And what if I synchronise it a some predefined intervals, and it doesn't matter if a client read stale data from a slave. Wouldn't that make it an AP database? Wouldn't that mean that if I give up ACID compliance I can still use the relationnal model for a partionned database?

In essence : is scalability about what you are ready to give up in the CAP theorem, more than the underlying data model? Does having Column, Document, Key Value, whatever give a boost to scalability over a relational model? Could we design a relational database designed from the ground up for partition tolerance? (Maybe it already exists). Could we make NoSQL database ACID compliant?

Sorry, its a lot of questions, but I have read a lot about NoSQL database recently and it seem to me that the biggest benefit of using them is that they fit better the "shape" of your data, rather than just the partition, CAP and giving up ACID compliance. After all, not everyone has so much data that they need to partition it. Is there a performance/scalability benefit to not using the relational model before I even think about partitioning my data?

Best Answer

Does using a NoSQL database give a boost to scalability even if you aren't sharding data? Well lets define scalability. If you are referring to scalability as database/backend systems are concerned, in that you have vertical and horizontal scaling where horizontal scaling IS sharding data then this becomes a trivial question because then the answer would be absolutely no, because the only option you have left is vertical scaling (ie getting better hardware). If however you are talking about scalability in a broader sense referring to flexibility of the application, data value, etc... Then that is a completely different question with a number of answers. And like you mentioned it will often come down to what are you doing with the data and how it should be stored. Let me preface everything here with the statement that in most cases you should still be using a RDBMS and NoSQL should fill niche's. The following is a description of a specific instance where a NoSQL database would be more beneficial given specific requirements, and where we can ignore horizontal scaling.

Take for instance the idea that you are creating a cloud file storage system similar to google drive, dropbox, or box but instead of using an actual file system you decide that it would be more beneficial to you to virtualize the file system. Now you have a problem because your data model is suddenly the tree structure that is going to be horribly inefficient in a RDBMS (despite the fact that that is how everything is indexed). Because now you have a 3 column table with Name, User, and Parent. User is a foreign key to a users table and Parent is a self referencing nullable foreign key (nullable because the root directory could not have a parent). So what is the primary key? In this instance it is a compounded key across all columns... Which suddenly makes Parent our worst enemy.

Now instead think about how you would put that in some form of document store? Instead of fighting the data you are able to work with it and store it as the tree structure which will in turn decrease your development time as well as decrease maintenance costs. If you are decreasing costs doesn't that allow for a different kind of scalability? Plus in this instance you are creating the system correctly from the ground up which should give more flexibility to the application itself. Currently I am running this on a single server using MongoDB, which as you explained gives me an Available, Consistent model that is not much different than looking at the difference of MySQL or Postgres.

With MongoDB at least you can define how many servers you need to communicate with for a query to succeed so, yes you can convert it to a Consistent, Available model if you tell all queries to communicate with all server instances.

So I think that you have the right of it in that there is a big benefit in how the data is stored. There are things that don't fit well in a relational model that fit well in other models (as another brief example, Amazon uses some form of Graph Database for their recommendation engine for products).

Did I correctly understand your question?

Edit: Will more data slow things down? Yes. How much will it slow things down? I honestly don't have enough experience to give an adequate answer. Key/Value: Essentially a lookup table with large amounts of data associated with the lookup key. This is going to be really really fast because you can only look things up by the key. Column/Family: Essentially a much more structured Key/Value store. You can only query based on the Column and so this should be really fast too. Document: Aggregation style schema. Here you will want to aggregate similar data together. Denormalization is ok and expected for this kind of database. Depending on whether you are doing a lot of writes or reads you can organize your data so that it gets distributed across multiple shards to distribute the writes or the reads out (note that you can create a hybrid approach that is good for both but generally you need to choose optimization for one or the other) Graph: The strength of this one is that it can create and tear down relationships really quickly. If you have some data where you have relationships that need to change between data (think some form of recommendation engine) then you should use this.

How you store data in any of these databases will influence performance (similar to the fact that if you store data incorrectly in some RDBMS it will influence performance). So to hopefully make this more clear: You need to know which database system you should use as well as how to store data in that database system.

Related Topic