Mysql – Inverted search: Phrases per document

full-text-searchindexingluceneMySQLsearch

I have a database full of phrases (80-100 characters), and some longish documents (50-100Kb), and I would like a ranked list of phrases for a given document; rather than the usual output of a search engine, list of documents for a given phrase.

I've used MYSQL fulltext indexing before, and looked into lucene, but never used it.
They both seem geared to compare the short (search term), with the long (document).

How would you get the inverse of this?

Best Answer

I did something similar with a database of Wikipedia titles and managed to get down to a few hundred milliseconds for each ~50KB document. That was still not fast enough for my needs, but maybe it can work for you.

Basically the idea was to work with hashes as much as possible and only do string comparisons on possible matches, which are pretty rare.

First, you take your database and convert it into an array of hashes. If you have billions of phrases, this may not be for you. When you calculate the hash, be sure to pass the phrases through a tokenizer that will remove punctuation and whitespace. This part needs to be done only once.

Then, you go though the document with the same tokenizer, keeping a running list of the last 1,2,..,n tokens, hashed. At every iteration, you do a binary search of the hashes you have against the hashes database.

When you find a match, you do the actual string comparison to see if you found a match.

Here's some code, to give you a taste of whet I mean, tough this example doesn't actually do the string comparison:

            HashSet<Long> foundHashes = new HashSet<Long>();

            LinkedList<String> words = new LinkedList<String>();
            for(int i=0; i<params.maxPhrase; i++) words.addLast("");

            StandardTokenizer st = new StandardTokenizer(new StringReader(docText));
            Token t = new Token();
            while(st.next(t) != null) {
                String token = new String(t.termBuffer(), 0, t.termLength());
                words.addLast(token);
                words.removeFirst();

                for(int len=params.minPhrase; len<params.maxPhrase; len++) {
                    String term = Utils.join(new ArrayList<String>(words.subList(params.maxPhrase-len,params.maxPhrase)), " ");

                    long hash = Utils.longHash(term);

                    if(params.lexicon.isTermHash(hash)) {
                        foundHashes.add(hash);
                    }
                }
            }

            for(long hash : foundHashes) {
                if(count.containsKey(hash)) {
                    count.put(hash, count.get(hash) + 1);
                } else {
                    count.put(hash, 1);
                }
            }
Related Topic