Which database technology to choose for storing (post + replies) threads

database-designnosqlrelational-database

I am designing for a new application and I don't want to suffer later for the performance that manges discussion threads (posts + replies, very similar to Facebook or StackOverflow posts).

I wonder which kind of data store / data format I shall choose to persist the threads. I looked for an answer to my question, but all I found actually was "How to tune an RDBMS design to handle this requirement"

But is RDBMS really the best fit for this ? Most answers I could find were somehow outdated or wanted to tune some legacy systems and they did not consider No-SQL DBs.

I think that handling big amount of requests using all proposed asnwers (like here and here for example) will hurt performance when data scales because of the need to ORDER BY clauses.

I thought about storing the entire thread as one json for the sake of fast read performance. But also I think it will make a problem for update, maintenance and traffic specially because I need to apply security roles on thread component (some users can see some replies, others not)

Actually I am not much into No-SQL DBs, I just worked slightly with hbase and SOLR and most of my experience is with RDBMS. I think that Document databases are well suited for blog posts but I have no hands-on experience with that.

Any recommendations about which kind of database technology would best fit such needs ?

Important note: I don't ask for recommendation on specific products or resources, but about arguments for the choice of technology (RDMBS vs No-SQL DB).

EDIT: Thanks to the answers below, I revisited the requirements in more details , they are as follows :

1- The data is a nested set of "Issues" and "Actions" and each one can have any number of comments (i.e. issues have actions and actions have issues , and each of actions and issues can have comments)

2- conversation can't handle more than five users (converastion is a set of "issues" and "actions" and their relative comments

3- only one conversation is active at a time per set of users

4- a subset of the conversation can include users other than the rest of the conversation (but not more than five)

5- System will be distributed (in the future)

6- It will good to use a new technology other than RDBMS -unless it hurts-

7- Frontend is mobile app

I think from the above that choosing a document DB will be better, specially for points 5 and 6 and also due to the fact that data -as described- is not relational and that modeling hierarchial data + enforcing joins will not be good when data scales.

again many thanks for all who helped and still open for any recommendations including changing the technology

Best Answer

Most proponents of NoSQL overstate the scaling/performance problem.

This is admittedly an oversimplified point of view, but one of the big reasons that NoSQL is popular is because Google uses it. If Google uses it, then it must be good. But Google has enormous data requirements. Internet search notwithstanding, their source control repository is so large that they had to write their own custom source control system. You will never have this problem.

The size of your data becomes an important factor in this decision, not with millions of records, but with billions or trillions of records. As long as your in the millions of records space, you will never have a problem with relational databases as long as you maintain them properly (i.e. have sensible indexes and normalized table design).

Where I work, we recently did a proof of concept project with Apache Hive. Apache hive is a NoSQL database. We were still in the millions of records space, but if it rolls out completely, it will be in the billions of records. The purpose of the project is to mine telemetry data for information about specific events. The reason NoSQL is a good choice for this is that the data is relatively flat, i.e. there are no relationships or joins to speak of, and the nature of the data resists attempts to index it. The Map/Reduce function that hive embodies is uniquely suited to this situation.

But if your data is in any way related to ordinary business operations (as is your described scenario), relational databases are almost always a more sensible choice. Data size becomes a factor only at Facebook scale or Google scale. If that scaling problem ever happens to you, it will be a good problem, and you'll have the money you need to solve it then.

I think you overstate the performance concerns. Relational databases are more than capable of providing adequate performance in most situations, and unlike our Apache Hive scenario, your situation is not novel.

Further Reading
RDBMS vs. NoSQL: How do you pick?