Data Structures – Fast N-Gram Access for String Matching

data structuressearchstring-matchingtrie

TL;DR

Is there a data structure that'd quickly let me match words at any point (e.g., 'foo' matches 'foobar' and 'zoofoo'), and, ideally, returns a list of "characters that show up after the needle" (e.g., 'foo' should return ['b', $]).


I'm implementing an algorithm that generates random words from a training set of other words.

In simple terms, it's basically like this:

  1. Choose an arbitrary starting point.
  2. Choose the longest suffix of the current word that is contained in at least 2 other words
  3. Choose one of those words at random, and append the next character to the current wor.
  4. GOTO 2 until "next character" is EOW

e.g., if the current word is 'tat', some valid options would be 'potato' and 'tattoo'; if the current word is "ophtalmi", the only option is "ophtalmic", so we search if any words contain "phtalmi", "htalmi", "talmi", and so on.

I've tried a couple of implementations: in one, I've used a trie populated with every suffix of every word. This is very fast at generating words, but populating the trie is VERY slow (~4 million words have not finished in over 10 hours).

In another, I've generated a hash of:

for word in words:
    for suffix in tails(words):
        for prefix, suffix in prefixes(words): # prefixes("foo") = [("f","oo"),("fo","o"),("foo","")]
            ngrams[prefix].add(suffix) # this is a set

and it's much faster at reading the training set, and very fast at generating, but it takes a lot of RAM.

And, finally, the dumb option, of simply searching

candidates = [word for word in words if string in words]

which takes very little memory, but is much slower.

Is there a data structure with the behaviour I need?

Best Answer

The classic answer would be a trie which stores all rotations of words (in scrabble there is a very similar need and a very similar datastructure called a gaddag). It turns out you can do much better (B-tree of words where the lowest level is delta encoded), but the simplest thing you can do is store a sorted list of all rotations of all words in your dictionary and binary search for things. Example:

Our dictionary contains the word w = 'zoofoo', so we store:

sorted(w[i:] + '^' + w[:i] for i in range(len(w)))
['foo^zoo', 'o^zoofo', 'ofoo^zo', 'oo^zoof', 'oofoo^z', 'zoofoo^']

Hey look, one of those entries starts with 'foo'! We can find it via binary search, and reconstruct the original word from 'foo^zoo' by flipping around the ^.

If you can sort your input as well you can do a linear intersection of the input list and the rotation dictionary.