C# – Trouble searching for acronyms in Lucene.NET

clucene.netsearch

I'm currently working on a Lucene.NET full-text search implementation. For the most part it's going quite well but I'm having a few issues revolving around acronyms in the data…

As an example of what's going on if I had "N.A.S.A." in the field I indexed I'm able to match it with n.a.s.a. or nasa, but n.a.s.a doesn't match it, not even if I put a fuzzy-search (n.a.s.a~).

The first thought that comes to mind for me is to rip out all the .'s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.

Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

Best Answer

The StandardAnalyzer uses the StandardTokenizer which tokenizes 'N.A.S.A.' as 'nasa', but won't do this to 'N.A.S.A'. That's why your original query matches both the input 'N.A.S.A' which are processed into 'nasa', and the input 'nasa' which matches the already tokenized value. This also explains why 'N.A.S.A' wont match anything since the index only contains the token 'nasa'.

This can be seen when outputting the value from the token stream directly.

public static void Main(string[] args) {
    var analyzer = new StandardAnalyzer(Version.LUCENE_30);
    var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));

    var termAttr = stream.GetAttribute<ITermAttribute>();
    while (stream.IncrementToken()) {
        Console.WriteLine(termAttr.Term);
    }

    Console.ReadLine();
}

Outputs:

nasa
n.a.s.a

You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.