C# – Trouble searching for acronyms in Lucene.NET

clucene.netsearch

I'm currently working on a Lucene.NET full-text search implementation. For the most part it's going quite well but I'm having a few issues revolving around acronyms in the data…

As an example of what's going on if I had "N.A.S.A." in the field I indexed I'm able to match it with n.a.s.a. or nasa, but n.a.s.a doesn't match it, not even if I put a fuzzy-search (n.a.s.a~).

The first thought that comes to mind for me is to rip out all the .'s before indexing/searching, but it seems a bit more like a workaround than a solution and I was hoping to get a cleaner solution.

Can anyone suggest any changes or a different analyzer (using StandardAnalyzer currently) that may be more suited to matching this kind of data?

Best Answer

The StandardAnalyzer uses the StandardTokenizer which tokenizes 'N.A.S.A.' as 'nasa', but won't do this to 'N.A.S.A'. That's why your original query matches both the input 'N.A.S.A' which are processed into 'nasa', and the input 'nasa' which matches the already tokenized value. This also explains why 'N.A.S.A' wont match anything since the index only contains the token 'nasa'.

This can be seen when outputting the value from the token stream directly.

public static void Main(string[] args) {
    var analyzer = new StandardAnalyzer(Version.LUCENE_30);
    var stream = analyzer.TokenStream("f", new StringReader("N.A.S.A. N.A.S.A"));

    var termAttr = stream.GetAttribute<ITermAttribute>();
    while (stream.IncrementToken()) {
        Console.WriteLine(termAttr.Term);
    }

    Console.ReadLine();
}

Outputs:

nasa
n.a.s.a

You would probably need to write a custom analyzer to handle this scenario. One solution would be to keep the original token so n.a* would work, but you would also need to build a better detection of acronyms.

C# language version history:

These are the versions of C# known about at the time of this writing:

C# 1.0 released with .NET 1.0 and VS2002 (January 2002)
C# 1.2 (bizarrely enough); released with .NET 1.1 and VS2003 (April 2003). First version to call Dispose on IEnumerators which implemented IDisposable. A few other small features.
C# 2.0 released with .NET 2.0 and VS2005 (November 2005). Major new features: generics, anonymous methods, nullable types, and iterator blocks
C# 3.0 released with .NET 3.5 and VS2008 (November 2007). Major new features: lambda expressions, extension methods, expression trees, anonymous types, implicit typing (var), and query expressions
C# 4.0 released with .NET 4 and VS2010 (April 2010). Major new features: late binding (dynamic), delegate and interface generic variance, more COM support, named arguments, tuple data type and optional parameters
C# 5.0 released with .NET 4.5 and VS2012 (August 2012). Major features: async programming, and caller info attributes. Breaking change: loop variable closure.
C# 6.0 released with .NET 4.6 and VS2015 (July 2015). Implemented by Roslyn. Features: initializers for automatically implemented properties, using directives to import static members, exception filters, element initializers, await in catch and finally, extension Add methods in collection initializers.
C# 7.0 released with .NET 4.7 and VS2017 (March 2017). Major new features: tuples, ref locals and ref return, pattern matching (including pattern-based switch statements), inline out parameter declarations, local functions, binary literals, digit separators, and arbitrary async returns.
C# 7.1 released with VS2017 v15.3 (August 2017). New features: async main, tuple member name inference, default expression, and pattern matching with generics.
C# 7.2 released with VS2017 v15.5 (November 2017). New features: private protected access modifier, Span<T>, aka interior pointer, aka stackonly struct, and everything else.
C# 7.3 released with VS2017 v15.7 (May 2018). New features: enum, delegate and unmanaged generic type constraints. ref reassignment. Unsafe improvements: stackalloc initialization, unpinned indexed fixed buffers, custom fixed statements. Improved overloading resolution. Expression variables in initializers and queries. == and != defined for tuples. Auto-properties' backing fields can now be targeted by attributes.
C# 8.0 released with .NET Core 3.0 and VS2019 v16.3 (September 2019). Major new features: nullable reference-types, asynchronous streams, indices and ranges, readonly members, using declarations, default interface methods, static local functions, and enhancement of interpolated verbatim strings.
C# 9.0 released with .NET 5.0 and VS2019 v16.8 (November 2020). Major new features: init-only properties, records, with-expressions, data classes, positional records, top-level programs, improved pattern matching (simple type patterns, relational patterns, logical patterns), improved target typing (target-type new expressions, target typed ?? and ?), and covariant returns. Minor features: relax ordering of ref and partial modifiers, parameter null checking, lambda discard parameters, native ints, attributes on local functions, function pointers, static lambdas, extension GetEnumerator, module initializers, and extending partial.

In response to the OP's question:

What are the correct version numbers for C#? What came out when? Why can't I find any answers about C# 3.5?

There is no such thing as C# 3.5 - the cause of confusion here is that the C# 3.0 is present in .NET 3.5. The language and framework are versioned independently, however - as is the CLR, which is at version 2.0 for .NET 2.0 through 3.5, .NET 4 introducing CLR 4.0, service packs notwithstanding. The CLR in .NET 4.5 has various improvements, but the versioning is unclear: in some places it may be referred to as CLR 4.5 (this MSDN page used to refer to it that way, for example), but the Environment.Version property still reports 4.0.xxx.

As of May 3, 2017, the C# Language Team created a history of C# versions and features on their GitHub repository: Features Added in C# Language Versions. There is also a page that tracks upcoming and recently implemented language features.

C# – Storing relational data in a Lucene.NET index

I've had my share of problems with storing relational data i Lucene but the one you have should be easy to fix.

I guess you tokenize the group fields and that makes it possible to search for substrings in the field value. Just add the field untokenized and it should work like expected.

Please check the following small piece of code:

internal class Program {
    private static void Main(string[] args) {
        var directory = new RAMDirectory();
        var writer = new IndexWriter(directory, new StandardAnalyzer());
        AddDocument(writer, "group", "stuff", Field.Index.UN_TOKENIZED);
        AddDocument(writer, "group", "other stuff", Field.Index.UN_TOKENIZED);
        writer.Close(true);

        var searcher = new IndexSearcher(directory);
        Hits hits = searcher.Search(new TermQuery(new Term("group", "stuff")));

        for (int i = 0; i < hits.Length(); i++) {
            Console.WriteLine(hits.Doc(i).GetField("group").StringValue());
        }
    }

    private static void AddDocument(IndexWriter writer, string name, string value, Field.Index index) {
        var document = new Document();
        document.Add(new Field(name, value, Field.Store.YES, index));
        writer.AddDocument(document);
    }
}

The sample adds two documents to the index which are untokenized, does a search for stuff and gets one hit. If you changed the code to add them tokenized then you will have two hits as you see now.

The issue with using Lucene for relational data is that it might be expected that wildcard and range searches always will work. That is not really the case if the index is big due to way Lucene resolves those queries.

Another sample to illustrate the behavior:

    private static void Main(string[] args) {
        var directory = new RAMDirectory();
        var writer = new IndexWriter(directory, new StandardAnalyzer());

        var documentA = new Document();
        documentA.Add(new Field("name", "A", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentA.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentA.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentA);
        var documentB = new Document();
        documentB.Add(new Field("name", "B", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentB.Add(new Field("group", "stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentB);
        var documentC = new Document();
        documentC.Add(new Field("name", "C", Field.Store.YES, Field.Index.UN_TOKENIZED));
        documentC.Add(new Field("group", "other stuff", Field.Store.YES, Field.Index.UN_TOKENIZED));
        writer.AddDocument(documentC);

        writer.Close(true);

        var query1 = new TermQuery(new Term("group", "stuff"));
        SearchAndDisplay("First sample", directory, query1);

        var query2 = new TermQuery(new Term("group", "other stuff"));
        SearchAndDisplay("Second sample", directory, query2);

        var query3 = new BooleanQuery();
        query3.Add(new TermQuery(new Term("group", "stuff")), BooleanClause.Occur.MUST);
        query3.Add(new TermQuery(new Term("group", "other stuff")), BooleanClause.Occur.MUST);
        SearchAndDisplay("Third sample", directory, query3);
    }

    private static void SearchAndDisplay(string title, Directory directory, Query query3) {
        var searcher = new IndexSearcher(directory);
        Hits hits = searcher.Search(query3);
        Console.WriteLine(title);
        for (int i = 0; i < hits.Length(); i++) {
            Console.WriteLine(hits.Doc(i).GetField("name").StringValue());
        }
    }

Best Answer

Related Solutions

C# – What are the correct version numbers for C#

C# language version history:

In response to the OP's question:

C# – Storing relational data in a Lucene.NET index

Related Topic