I was hoping to brainstorm a little bit on the subject of storing n-gram data.
In my project, I am trying to solve linguistic problems where I know all (n-1) data items and want to statistically guess my n using linear interpolation over all applicable n-grams. (Yes, there is a tagger that assigns tags to known words according to its lexicon and a suffix tree that tries to guess the word kind for unknown words; the n-gram component discussed here will be tasked with resolving ambuguity.)
My initial approach would be to simply store all observed n-grams (for n = 1..3, i.e. monogram, bigram, trigram) data in respective SQL databases and call it a day. But the requirements of my project may change to include other vector lengths (n), and I would like my application to adapt to 4-gram without a lot of work (updating schema, updating application code, etc.); ideally, I would simply tell my application to work with 4-grams now without having to change code much (or at all) and train its data from a given data source.
To sum up all requirements:
- Ability to store n-gram data (initially for n = {1, 2, 3}
- Ability to change what kinds of n-grams should be used (between application runs)
- Ability to (re-)train n-gram data (between application runs)
-
Ability to query the data store (e.g. if I have observed A, B, C, I'd like to know the most frequently observed item for what might follow using my trained 4-, 3-, 2-, 1-gram data sets)
The application will most likely be read-heavy, data sets most likely won't be retrained that often
- The solution employs the .NET Framework (up to 4.0)
Now what design would be better fit for such a task?
- A fixed table managed by a SQL server (MSSQL, MySQL, …) for each n (eg. dedicated tables for bi-grams, tri-grams, etc.)
- Or a NoSQL document database solution that stores the first n-1 as the key of the document, and the document itself contains the n-th value and observed frequencies?
- Or something different?
Best Answer
Given that you won't know the optimal range of N, you definitely want to be able to change it. For example, if your application predicts the likelihood that a certain text is English, you would probably want to use character N-grams for N 3..5. (That's what we found experimentally.)
You haven't shared details about your application, but the problem is clear enough. You want to represent N-gram data in a relational database (or NoSQL document-based solution). Before suggesting a solution of my own, you may want to take a look at the following approaches:
Now, having not read any of the above links, I suggest a simple, relational database approach using multiple tables, one for each size of N-gram. You could put all of the data in a single table with the maximum necessary columns (i.e. store bigrams and trigrams in ngram_4, leaving the final columns null), but I recommend partitioning the data. Depending on your database engine, a single table with a large number of rows can negatively impact performance.
Next, I'll give you a query that will return the most probable next word given all your ngram tables. But first, here is some sample data that you should insert into the above tables:
To query the most probable next word, you would use a query like this.
If you add more ngram tables, you will need to add another UNION clause to the above query. You might notice that in the first query I used word1 = @word3. And in the second query, word1 = @word2 AND word2 = @word3. That's because we need to align the three words in the query for the ngram data. If we want the most likely next word for a sequence of three words, we'll need to check the first word in the bigram data against the last word of the words in the sequence.
You can tweak the weight parameters as you wish. In this example, I assumed that higher ordinal "n" grams will be more reliable.
P.S. I would structure the program code to handle any number of ngram_N tables via configuration. You could declaratively change the program to use N-gram range N(1..6) after creating the ngram_5 and ngram_6 tables.