Database Design – Best Data Store for Social Media Comments

application-designdatabase-designgraph

I have been developing a blogging site, where people can post an article or blog and others can comments on them. The site is coded in Angular.js, ASP.NET and SQL is being used a data store.

For each article there can be 'n' number of comments and anyone who is registered in the site can put any number of comments in for an article. Each comment can have any number of replies, and the replies can also have replies (so it's like a tree). So each comment will be like a tree and there shouldn't be any restrictions on the number of levels. And each article in turn will have a collection of such comment-trees.

I am expecting the application to be extremely read-heavy and moderately write-heavy. What can be a good data store for storing such data?

Best Answer

You're prematurely optimizing.

How much messages you'll display on a page? One hundred? One thousand? One billion?

If the number of messages is too much for Microsoft SQL Server, how would you expect any browser to show that many messages to the user, and how would you expect any user to be actually interested in seeing hundreds of thousands¹ of messages on a single page? What's the point?

When it comes to storing hierarchies, Microsoft SQL Server has hierarchical data feature. SQL Server also allows you to store XML, but I wouldn't advise you to use it in this specific case: SQL Server's XML fields are more for hierarchical data which doesn't change too much through time.

By the way, talking about hierarchical structures, you're not doing a service to your users. Tree structures are for developers; they are an extremely poor data visualization for things such as discussions, and should be avoided in this context. Flat structure of StackExchange, by opposition to old tree-oriented bulletin boards, is a good illustration of replacing an unusable approach by a very effective one.

^{¹ Actually, I imagine that Microsoft SQL Server would laugh at you if you tell him that you're scared to load hundreds of thousands of messages. If your table has proper indexes and foreign keys, the query will virtually take milliseconds.}

Related Solutions

Storing n-gram data

Given that you won't know the optimal range of N, you definitely want to be able to change it. For example, if your application predicts the likelihood that a certain text is English, you would probably want to use character N-grams for N 3..5. (That's what we found experimentally.)

You haven't shared details about your application, but the problem is clear enough. You want to represent N-gram data in a relational database (or NoSQL document-based solution). Before suggesting a solution of my own, you may want to take a look at the following approaches:

Now, having not read any of the above links, I suggest a simple, relational database approach using multiple tables, one for each size of N-gram. You could put all of the data in a single table with the maximum necessary columns (i.e. store bigrams and trigrams in ngram_4, leaving the final columns null), but I recommend partitioning the data. Depending on your database engine, a single table with a large number of rows can negatively impact performance.

  create table ngram_1 (
      word1 nvarchar(50),
      frequency FLOAT,
   primary key (word1));

  create table ngram_2 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2));

  create table ngram_3 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      word3 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2, word3));

  create table ngram_4 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      word3 nvarchar(50),
      word4 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2, word3, word4));

Next, I'll give you a query that will return the most probable next word given all your ngram tables. But first, here is some sample data that you should insert into the above tables:

  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'building', N'with', 0.5)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'hit', N'the', 0.1)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'man', N'hit', 0.2)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'bat', 0.7)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'building', 0.3)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'man', 0.4)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'with', N'the', 0.6)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'building', N'with', N'the', 0.5)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'hit', N'the', N'building', 0.3)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'man', N'hit', N'the', 0.2)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'building', N'with', 0.4)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'man', N'hit', 0.1)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'with', N'the', N'bat', 0.6)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'building', N'with', N'the', N'bat', 0.5)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'hit', N'the', N'building', N'with', 0.3)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'man', N'hit', N'the', N'building', 0.2)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'building', N'with', N'the', 0.4)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'man', N'hit', N'the', 0.1)

To query the most probable next word, you would use a query like this.

  DECLARE @word1 NVARCHAR(50) = 'the'
  DECLARE @word2 NVARCHAR(50) = 'man'
  DECLARE @word3 NVARCHAR(50) = 'hit'
  DECLARE @bigramWeight FLOAT = 0.2;
  DECLARE @trigramWeight FLOAT = 0.3
  DECLARE @fourgramWeight FLOAT = 0.5

  SELECT next_word, SUM(frequency) AS frequency
  FROM (
    SELECT word2 AS next_word, frequency * @bigramWeight AS frequency
    FROM ngram_2
    WHERE word1 = @word3
    UNION
    SELECT word3 AS next_word, frequency * @trigramWeight AS frequency
    FROM ngram_3
    WHERE word1 = @word2
      AND word2 = @word3
    UNION
    SELECT word4 AS next_word, frequency * @fourgramWeight AS frequency
    FROM ngram_4
    WHERE word1 = @word1
      AND word2 = @word2
      AND word3 = @word3
    ) next_words
  GROUP BY next_word
  ORDER BY SUM(frequency) DESC

If you add more ngram tables, you will need to add another UNION clause to the above query. You might notice that in the first query I used word1 = @word3. And in the second query, word1 = @word2 AND word2 = @word3. That's because we need to align the three words in the query for the ngram data. If we want the most likely next word for a sequence of three words, we'll need to check the first word in the bigram data against the last word of the words in the sequence.

You can tweak the weight parameters as you wish. In this example, I assumed that higher ordinal "n" grams will be more reliable.

P.S. I would structure the program code to handle any number of ngram_N tables via configuration. You could declaratively change the program to use N-gram range N(1..6) after creating the ngram_5 and ngram_6 tables.

Database Design – Architecture for Social Graph Data with Time Frame

You should check out Neo4j-open source graph database. It is a mature and well supported project. It has a well written-updated documentation. It is java based but has several client libraries for ruby, jruby, php, python, c#... It works as embedded(a jar file) or as a server(you may connect and operate over HTTP with its REST based structure). And finally, it is disk based transactional.

In Neo4j, you could have properties assigned to nodes and also to relations. That means, in your case, you may connect two nodes with a relation that has properties called "StartDate" and "EndDate" or any other property you wish.

For example if you have users as nodes you may have a relationship called "interested_in" and you may connect UserA to UserB with the relation "interested_in" and assign StartDate as the date you created relation and later on you may assign an EndDate property as the date "interested_in" relation come to an end.

It may look like:

UserA -[interested]-> UserB
      StartDate:20121102
       EndDate:20121107

And users(I mean your nodes) could be connected to your existing database via giving "id" property to Neo4j nodes coming from your existing database. Or you may copy all or several properties(i.e name, surname, birth date, etc...) to the nodes in Neo4j, but this time you may need to synchronize your users in Neo4j and your database every time an update occurs in your data stores.

There are several example data models within the documentation of Neo4j

There is also InfiniteGraph which is similar to Neo4j and stated as "Distributed Graph Database", but I do not have any experience with it. For your case(50,000 - 70,000 for each type of node) Neo4j would be a perfect match with its support up to billions of nodes and relationships.

Best Answer

Related Solutions

Storing n-gram data

Database Design – Architecture for Social Graph Data with Time Frame

Related Topic