How to model hashtags with nodejs and mongodb

mongodbnode.jsnosqltagging

Existing architecture: nodejs server with mongodb backend.

I have strings coming in describing images that can have #hashtags in them.

I wish to extract the hashtags from the strings, store the hashtags and associate the image with that hashtag.

So e.g. an image is uploaded with 'having fun at #bandcamp #nyc'

#bandcamp and #nyc are extracted.

  • If they don't exist as hashtags already, they're created and the image is associated with them both.

  • If they do exist, that's recognised and the image is associated with both.

So it will be possible to build a mongo find query that gets all images for a hashtag or multiple hashtags.

I'm new to nosql, I understand that in relational I'd have:

  • table hashtags
  • table images
  • table imageshashtags

with a many to many relationship. An image can have many hash tags, and a hashtag can have many images.

What sort of approach is suitable with mongo?
From reading q&a like this: https://stackoverflow.com/questions/8455685/how-to-implement-post-tags-in-mongo

I see that I can implement a sub document in the image document with the tags. Is that efficient for searching and retrieving?

I could then use http://cookbook.mongodb.org/patterns/count_tags/ – map reduce?

So end up with:

images collection withwith tags subdocument
tags collection

  • images document with tags subdocument with tags extracted and added to it when the image is created, and new tag added to the collection if it's not already present (i.e. tags must be unique)

also create the tag in the tags collection, and run map reduce.

Is that sound? Am I understanding things correctly and is my approach sensible?

Best Answer

Store hashtags in an array within a document.

That's the benefit of having documents: you can simply nest them. And, in this particular case, it's trivial:

{
    "_id": 123,
    "file": "c43a5f46-kitten.png",
    "description": "My kitten :3 #kittens #cute"
    "hashtags": ["kittens", "cute", "cat", "animals"]
}

(I added some "synonymous" tags, this can be done automatically by looking up some other document.)

This is the most natural solution for document-oriented database:

  • Searching documents by hashtags is trivial if you just add an index, as well as inserting, updating, and deleting hashtags on random documents is also trivial
  • Massive inserting, updating, and deleting is a bit tricky, because you'd probably want to split such operations in multiple "batches", but still it's manageable and not hard to implement
  • Complex aggregations can be done with the standard aggregation pipeline or map-reduce

On the other hand, if you go with relational style, you'll be in a big trouble when you reinvent a SQL JOIN within your application code. This is one of the most common anti-patterns of using MongoDB (and such). Here's a very typical pseudocode:

for (HashTag tag: mongodb.hashtags.find()) {
   for (Image img: mongodb.images.find(
           new Document("_id", new tag.getImageId()))) {
       // ...
   }
}

This is inefficient, not scalable, and you are simply reinventing a wheel. Using this, you'll probably end up with complexity of O(N*M) because of loops within your code. If you'd choose SQL with foreign keys instead, you'd have something like O(N*log(M)) or even O(N+M).

There are no tables (relations) and foreign keys in MongoDB. Do not invent them, please. Use SQL instead, if you need. In fact, I highly suggest using SQL instead of MongoDB, unless your data really consists of documents.

Typical examples of documents are configurations, forms, and maybe user sessions. Those typically don't fit well into tables because of "random" structure.

Related Topic