Given that you won't know the optimal range of N, you definitely want to be able to change it. For example, if your application predicts the likelihood that a certain text is English, you would probably want to use character N-grams for N 3..5. (That's what we found experimentally.)
You haven't shared details about your application, but the problem is clear enough. You want to represent N-gram data in a relational database (or NoSQL document-based solution). Before suggesting a solution of my own, you may want to take a look at the following approaches:
- How to best store Google ngrams in a database?
- Storing n-grams in database in < n number of tables
- Managing the Google Web 1T 5-gram with Relational Database
Now, having not read any of the above links, I suggest a simple, relational database approach using multiple tables, one for each size of N-gram. You could put all of the data in a single table with the maximum necessary columns (i.e. store bigrams and trigrams in ngram_4, leaving the final columns null), but I recommend partitioning the data. Depending on your database engine, a single table with a large number of rows can negatively impact performance.
create table ngram_1 (
word1 nvarchar(50),
frequency FLOAT,
primary key (word1));
create table ngram_2 (
word1 nvarchar(50),
word2 nvarchar(50),
frequency FLOAT,
primary key (word1, word2));
create table ngram_3 (
word1 nvarchar(50),
word2 nvarchar(50),
word3 nvarchar(50),
frequency FLOAT,
primary key (word1, word2, word3));
create table ngram_4 (
word1 nvarchar(50),
word2 nvarchar(50),
word3 nvarchar(50),
word4 nvarchar(50),
frequency FLOAT,
primary key (word1, word2, word3, word4));
Next, I'll give you a query that will return the most probable next word given all your ngram tables. But first, here is some sample data that you should insert into the above tables:
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'building', N'with', 0.5)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'hit', N'the', 0.1)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'man', N'hit', 0.2)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'bat', 0.7)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'building', 0.3)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'man', 0.4)
INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'with', N'the', 0.6)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'building', N'with', N'the', 0.5)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'hit', N'the', N'building', 0.3)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'man', N'hit', N'the', 0.2)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'building', N'with', 0.4)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'man', N'hit', 0.1)
INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'with', N'the', N'bat', 0.6)
INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'building', N'with', N'the', N'bat', 0.5)
INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'hit', N'the', N'building', N'with', 0.3)
INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'man', N'hit', N'the', N'building', 0.2)
INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'building', N'with', N'the', 0.4)
INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'man', N'hit', N'the', 0.1)
To query the most probable next word, you would use a query like this.
DECLARE @word1 NVARCHAR(50) = 'the'
DECLARE @word2 NVARCHAR(50) = 'man'
DECLARE @word3 NVARCHAR(50) = 'hit'
DECLARE @bigramWeight FLOAT = 0.2;
DECLARE @trigramWeight FLOAT = 0.3
DECLARE @fourgramWeight FLOAT = 0.5
SELECT next_word, SUM(frequency) AS frequency
FROM (
SELECT word2 AS next_word, frequency * @bigramWeight AS frequency
FROM ngram_2
WHERE word1 = @word3
UNION
SELECT word3 AS next_word, frequency * @trigramWeight AS frequency
FROM ngram_3
WHERE word1 = @word2
AND word2 = @word3
UNION
SELECT word4 AS next_word, frequency * @fourgramWeight AS frequency
FROM ngram_4
WHERE word1 = @word1
AND word2 = @word2
AND word3 = @word3
) next_words
GROUP BY next_word
ORDER BY SUM(frequency) DESC
If you add more ngram tables, you will need to add another UNION clause to the above query. You might notice that in the first query I used word1 = @word3. And in the second query, word1 = @word2 AND word2 = @word3. That's because we need to align the three words in the query for the ngram data. If we want the most likely next word for a sequence of three words, we'll need to check the first word in the bigram data against the last word of the words in the sequence.
You can tweak the weight parameters as you wish. In this example, I assumed that higher ordinal "n" grams will be more reliable.
P.S. I would structure the program code to handle any number of ngram_N tables via configuration. You could declaratively change the program to use N-gram range N(1..6) after creating the ngram_5 and ngram_6 tables.
You should check out Neo4j-open source graph database. It is a mature and well supported project. It has a well written-updated documentation. It is java based but has several client libraries for ruby, jruby, php, python, c#... It works as embedded(a jar file) or as a server(you may connect and operate over HTTP with its REST based structure). And finally, it is disk based transactional.
In Neo4j, you could have properties assigned to nodes and also to relations. That means, in your case, you may connect two nodes with a relation that has properties called "StartDate" and "EndDate" or any other property you wish.
For example if you have users as nodes you may have a relationship called "interested_in" and you may connect UserA to UserB with the relation "interested_in" and assign StartDate as the date you created relation and later on you may assign an EndDate property as the date "interested_in" relation come to an end.
It may look like:
UserA -[interested]-> UserB
StartDate:20121102
EndDate:20121107
And users(I mean your nodes) could be connected to your existing database via giving "id" property to Neo4j nodes coming from your existing database. Or you may copy all or several properties(i.e name, surname, birth date, etc...) to the nodes in Neo4j, but this time you may need to synchronize your users in Neo4j and your database every time an update occurs in your data stores.
There are several example data models within the documentation of Neo4j
There is also InfiniteGraph which is similar to Neo4j and stated as "Distributed Graph Database", but I do not have any experience with it. For your case(50,000 - 70,000 for each type of node) Neo4j would be a perfect match with its support up to billions of nodes and relationships.
Best Answer
You're prematurely optimizing.
How much messages you'll display on a page? One hundred? One thousand? One billion?
If the number of messages is too much for Microsoft SQL Server, how would you expect any browser to show that many messages to the user, and how would you expect any user to be actually interested in seeing hundreds of thousands¹ of messages on a single page? What's the point?
When it comes to storing hierarchies, Microsoft SQL Server has hierarchical data feature. SQL Server also allows you to store XML, but I wouldn't advise you to use it in this specific case: SQL Server's XML fields are more for hierarchical data which doesn't change too much through time.
By the way, talking about hierarchical structures, you're not doing a service to your users. Tree structures are for developers; they are an extremely poor data visualization for things such as discussions, and should be avoided in this context. Flat structure of StackExchange, by opposition to old tree-oriented bulletin boards, is a good illustration of replacing an unusable approach by a very effective one.
¹ Actually, I imagine that Microsoft SQL Server would laugh at you if you tell him that you're scared to load hundreds of thousands of messages. If your table has proper indexes and foreign keys, the query will virtually take milliseconds.