R – Search Lucene with precise edit distances

fuzzy-searchlucene

I would like to search a Lucene index with edit distances. For example, say, there is a document with a field FIRST_NAME; I want all documents with first names that are 1 edit distance away from, say, 'john'.

I know that Lucene supports fuzzy searches (FIRST_NAME:john~) and takes a number between 0 and 1 to control the fuzziness. The problem (for me) is this number does not directly translate to an edit distance. And when the values in the documents are short strings (less than 3 characters) the fuzzy search has difficulty finding them. For example if there is a document with FIRST_NAME 'J' and I search for FIRST_NAME:I~0.0 I don't get anything back.

Best Answer

In Lucene's FuzzyQuery, you cannot specify the extact distance. You can specify the value of "fuzziness" between 0 and 1 where values closer to 0 indicate broad match and values closer to 1 indicate narrow match. The formula for "fuzziness" is as follows. (From Lucene in Action)

From this formula, you can work back to an approximate fuzziness for given value of distance. So, StackOverflow is to be matched with StackUnderflow, which is at a distance of 3, the fuzziness required will be approximately 0.77.

Related Topic