NoSQL – Ensuring Data Integrity in NoSQL Situations

Architecturedesign-patternsnode.jsnosqlrelational-database

Background

To start off, at work I work with a legacy system that, in its day, was quite spectacular, but now is …interesting… to work with. It uses IBM (now Rocket) UniVerse as its backing database. One particular part of it that has caused some issues is its lack of data-integrity checks. By data integrity, I don't mean file corruption, but rather things like orphaned records or invalid keys. The particular version they use doesn't support things like triggers and such and so unless the programmer remembered to update their computed indexes, it becomes "broken" and filled with bad data. Now, the other programs have been built to live with this bad data, but it is most annoying when putting it into another database, such as MySQL (using InnoDB as the engine) which actually has constraints on the data.

The question

I'm experimenting with MongoDB and NodeJS just to see what all the hype is about. I really like Mongoose as well and its schema system. I've been reading a lot about what to store on each record vs in a separate collection. Perhaps its just my own RDBMS bias, but I decided to store each "type" of thing in a separate collection and utilize Mongoose's "populate" functionality to essentially relate the records together. Now, I am sure someone is going to say that that is going against the whole NoSQL thing, but I haven't really read anywhere that says that just because something stores documents instead of records and has no set schema on the db level that it can't be made relational.

In my experiment, I have "Posts" and "Comments". I see four ways to store the relation between these two:

  • The full data for each comment is put directly onto the post as a subdocument. There are two main cons that I have seen with this: If I decide to put comments on something else (let say a "Page") as well, I have to essentially repeat myself and it isn't quite as simple to find out how many comments a user has posted if the comments are actually stored in several collections
  • Comments are a separate collection and they store their parent's key and a Schema name for mongoose to use when populating (the schema name switching wouldn't be automatic). This isn't too bad, but its biased towards loading comments first and posts later. Finding out the comments on a post isn't hard, but requires a manual query.
  • Comments are a separate collection and posts have a list of comment ids that relate to them. This is biased towards loading posts first and finding out what a comment is attached to becomes difficult. However, mongoose would let me load the comments without having to write anything additional.
  • Comments are a separate collection and have a parent id. Posts also have a list of comment ids. This combines the above two methods and neutralizes their cons and makes for relatively few "manual" queries, but introduces the possibility of data becoming dirty and out of sync like the legacy system that I described above (e.g. a comment says it belongs to one post and another post (or multiple posts) claims that they own that comment).

I was going down the latter path listed above and I realized that I was starting to enter the realm of this legacy system that had caused so many headaches with its manually updated indexes and possibility for bad data.

Now, I don't expect to do a ton with this little experiment of mine, but its the principal of the thing that has me thinking. What would you all recommend to go with on this? I want to be able to keep query counts low, but I also don't want to have to remember to update all these indexes. There has to be a happy medium somewhere.

Another option, of course, is to just use MySQL with some nice schema constraints, but that isn't the point of this particular exercise for me since I've done that tons of times already.

Best Answer

my experience on using node.js with NoSQL resulted in skipping Mongoose and using the node-mongodb-native driver.

Reasons for this is that Mongoose actually conflicts with the node.js way of doing things, this means building your own need-framework by gluing up different tools together. Mongoose looks great if you come from a traditional environment but you will notice quick limitations of things which just go complicated. Better use the native driver and if necessary create a customer helper manager for specific collections.

Regarding your origin question I would recommend you to take the first concept and put everything in one document. I know that in the beginning it sounds silly as you feel wasting resources. This is something we have learned from data modeling table structures in MySQL. This all is not necessary with document-orientated DBs. Their idea is making things as simple as possible and just store the document you take. Believe me everything else is just waste of time. When I look back how much time I wasted just trying to build a REST API which supports Mongoose populating...

As conclusion I would recommend to just think about document orientated databases as just a persistence for serialized objects and as long as you do not plan any complicated queries or you plan to have data only once in order to keep it up-to-data avoid wasting your time with trying to rebuilt MySQL.

Hope it helps you.

Related Topic