Storing n-gram data

database-designnetnosqlsql

I was hoping to brainstorm a little bit on the subject of storing n-gram data.
In my project, I am trying to solve linguistic problems where I know all (n-1) data items and want to statistically guess my n using linear interpolation over all applicable n-grams. (Yes, there is a tagger that assigns tags to known words according to its lexicon and a suffix tree that tries to guess the word kind for unknown words; the n-gram component discussed here will be tasked with resolving ambuguity.)

My initial approach would be to simply store all observed n-grams (for n = 1..3, i.e. monogram, bigram, trigram) data in respective SQL databases and call it a day. But the requirements of my project may change to include other vector lengths (n), and I would like my application to adapt to 4-gram without a lot of work (updating schema, updating application code, etc.); ideally, I would simply tell my application to work with 4-grams now without having to change code much (or at all) and train its data from a given data source.

To sum up all requirements:

  • Ability to store n-gram data (initially for n = {1, 2, 3}
  • Ability to change what kinds of n-grams should be used (between application runs)
  • Ability to (re-)train n-gram data (between application runs)
  • Ability to query the data store (e.g. if I have observed A, B, C, I'd like to know the most frequently observed item for what might follow using my trained 4-, 3-, 2-, 1-gram data sets)

    The application will most likely be read-heavy, data sets most likely won't be retrained that often

  • The solution employs the .NET Framework (up to 4.0)

Now what design would be better fit for such a task?

  • A fixed table managed by a SQL server (MSSQL, MySQL, …) for each n (eg. dedicated tables for bi-grams, tri-grams, etc.)
  • Or a NoSQL document database solution that stores the first n-1 as the key of the document, and the document itself contains the n-th value and observed frequencies?
  • Or something different?

Best Answer

Given that you won't know the optimal range of N, you definitely want to be able to change it. For example, if your application predicts the likelihood that a certain text is English, you would probably want to use character N-grams for N 3..5. (That's what we found experimentally.)

You haven't shared details about your application, but the problem is clear enough. You want to represent N-gram data in a relational database (or NoSQL document-based solution). Before suggesting a solution of my own, you may want to take a look at the following approaches:

  1. How to best store Google ngrams in a database?
  2. Storing n-grams in database in < n number of tables
  3. Managing the Google Web 1T 5-gram with Relational Database

Now, having not read any of the above links, I suggest a simple, relational database approach using multiple tables, one for each size of N-gram. You could put all of the data in a single table with the maximum necessary columns (i.e. store bigrams and trigrams in ngram_4, leaving the final columns null), but I recommend partitioning the data. Depending on your database engine, a single table with a large number of rows can negatively impact performance.

  create table ngram_1 (
      word1 nvarchar(50),
      frequency FLOAT,
   primary key (word1));

  create table ngram_2 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2));

  create table ngram_3 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      word3 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2, word3));

  create table ngram_4 (
      word1 nvarchar(50),
      word2 nvarchar(50),
      word3 nvarchar(50),
      word4 nvarchar(50),
      frequency FLOAT,
   primary key (word1, word2, word3, word4));

Next, I'll give you a query that will return the most probable next word given all your ngram tables. But first, here is some sample data that you should insert into the above tables:

  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'building', N'with', 0.5)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'hit', N'the', 0.1)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'man', N'hit', 0.2)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'bat', 0.7)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'building', 0.3)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'the', N'man', 0.4)
  INSERT [ngram_2] ([word1], [word2], [frequency]) VALUES (N'with', N'the', 0.6)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'building', N'with', N'the', 0.5)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'hit', N'the', N'building', 0.3)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'man', N'hit', N'the', 0.2)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'building', N'with', 0.4)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'the', N'man', N'hit', 0.1)
  INSERT [ngram_3] ([word1], [word2], [word3], [frequency]) VALUES (N'with', N'the', N'bat', 0.6)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'building', N'with', N'the', N'bat', 0.5)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'hit', N'the', N'building', N'with', 0.3)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'man', N'hit', N'the', N'building', 0.2)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'building', N'with', N'the', 0.4)
  INSERT [ngram_4] ([word1], [word2], [word3], [word4], [frequency]) VALUES (N'the', N'man', N'hit', N'the', 0.1)

To query the most probable next word, you would use a query like this.

  DECLARE @word1 NVARCHAR(50) = 'the'
  DECLARE @word2 NVARCHAR(50) = 'man'
  DECLARE @word3 NVARCHAR(50) = 'hit'
  DECLARE @bigramWeight FLOAT = 0.2;
  DECLARE @trigramWeight FLOAT = 0.3
  DECLARE @fourgramWeight FLOAT = 0.5

  SELECT next_word, SUM(frequency) AS frequency
  FROM (
    SELECT word2 AS next_word, frequency * @bigramWeight AS frequency
    FROM ngram_2
    WHERE word1 = @word3
    UNION
    SELECT word3 AS next_word, frequency * @trigramWeight AS frequency
    FROM ngram_3
    WHERE word1 = @word2
      AND word2 = @word3
    UNION
    SELECT word4 AS next_word, frequency * @fourgramWeight AS frequency
    FROM ngram_4
    WHERE word1 = @word1
      AND word2 = @word2
      AND word3 = @word3
    ) next_words
  GROUP BY next_word
  ORDER BY SUM(frequency) DESC

If you add more ngram tables, you will need to add another UNION clause to the above query. You might notice that in the first query I used word1 = @word3. And in the second query, word1 = @word2 AND word2 = @word3. That's because we need to align the three words in the query for the ngram data. If we want the most likely next word for a sequence of three words, we'll need to check the first word in the bigram data against the last word of the words in the sequence.

You can tweak the weight parameters as you wish. In this example, I assumed that higher ordinal "n" grams will be more reliable.

P.S. I would structure the program code to handle any number of ngram_N tables via configuration. You could declaratively change the program to use N-gram range N(1..6) after creating the ngram_5 and ngram_6 tables.

Related Topic