Lucene: What is the difference between Query and Filter

lucenelucene.net

Lucene query vs filter?

They both does similar things like termquery filters by term value, filter i guess is there for similar purpose.

When would you use filter and when query?

Just starting on lucene today so trying to clear concept

Best Answer

Filter doesn't affect the computation of the score of the non-filtered documents.

For instance imagine the following docs:

1.
loc: "uk", "london"
text: "i live in london, "london is the best"

2.
loc: "london avenue", "london street", "london"
text: "I like the shop in london st."

now let's say you do the following query:

q=+loc:"london" +text:"london"

in this query the score of doc 2 is higher than that of doc 1 (because loc is calculated in the document score)

using a filter:

q=+text:"london" f=+loc:"london"

in this query the score of doc 1 is higher than that of doc 2.

Excuse the Solr style formatting but the overall notion is clear.

Other reasons for using filters are for caching purposes, filters are cached separately from queries so if you have a dynamic query with a static part it would make sense to filter by the static part. In this way the index traversal is limited to the subset of filtered docs.

Related Solutions

Java – How to do query auto-completion/suggestions in Lucene

Based on @Alexandre Victoor's answer, I wrote a little class based on the Lucene Spellchecker in the contrib package (and using the LuceneDictionary included in it) that does exactly what I want.

This allows re-indexing from a single source index with a single field, and provides suggestions for terms. Results are sorted by the number of matching documents with that term in the original index, so more popular terms appear first. Seems to work pretty well :)

import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.ISOLatin1AccentFilter;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter;
import org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.Side;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.spell.LuceneDictionary;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

/**
 * Search term auto-completer, works for single terms (so use on the last term
 * of the query).
 * <p>
 * Returns more popular terms first.
 * 
 * @author Mat Mannion, M.Mannion@warwick.ac.uk
 */
public final class Autocompleter {

    private static final String GRAMMED_WORDS_FIELD = "words";

    private static final String SOURCE_WORD_FIELD = "sourceWord";

    private static final String COUNT_FIELD = "count";

    private static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "i", "if", "in", "into", "is",
    "no", "not", "of", "on", "or", "s", "such",
    "t", "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
    };

    private final Directory autoCompleteDirectory;

    private IndexReader autoCompleteReader;

    private IndexSearcher autoCompleteSearcher;

    public Autocompleter(String autoCompleteDir) throws IOException {
        this.autoCompleteDirectory = FSDirectory.getDirectory(autoCompleteDir,
                null);

        reOpenReader();
    }

    public List<String> suggestTermsFor(String term) throws IOException {
        // get the top 5 terms for query
        Query query = new TermQuery(new Term(GRAMMED_WORDS_FIELD, term));
        Sort sort = new Sort(COUNT_FIELD, true);

        TopDocs docs = autoCompleteSearcher.search(query, null, 5, sort);
        List<String> suggestions = new ArrayList<String>();
        for (ScoreDoc doc : docs.scoreDocs) {
            suggestions.add(autoCompleteReader.document(doc.doc).get(
                    SOURCE_WORD_FIELD));
        }

        return suggestions;
    }

    @SuppressWarnings("unchecked")
    public void reIndex(Directory sourceDirectory, String fieldToAutocomplete)
            throws CorruptIndexException, IOException {
        // build a dictionary (from the spell package)
        IndexReader sourceReader = IndexReader.open(sourceDirectory);

        LuceneDictionary dict = new LuceneDictionary(sourceReader,
                fieldToAutocomplete);

        // code from
        // org.apache.lucene.search.spell.SpellChecker.indexDictionary(
        // Dictionary)
        IndexReader.unlock(autoCompleteDirectory);

        // use a custom analyzer so we can do EdgeNGramFiltering
        IndexWriter writer = new IndexWriter(autoCompleteDirectory,
        new Analyzer() {
            public TokenStream tokenStream(String fieldName,
                    Reader reader) {
                TokenStream result = new StandardTokenizer(reader);

                result = new StandardFilter(result);
                result = new LowerCaseFilter(result);
                result = new ISOLatin1AccentFilter(result);
                result = new StopFilter(result,
                    ENGLISH_STOP_WORDS);
                result = new EdgeNGramTokenFilter(
                    result, Side.FRONT,1, 20);

                return result;
            }
        }, true);

        writer.setMergeFactor(300);
        writer.setMaxBufferedDocs(150);

        // go through every word, storing the original word (incl. n-grams) 
        // and the number of times it occurs
        Map<String, Integer> wordsMap = new HashMap<String, Integer>();

        Iterator<String> iter = (Iterator<String>) dict.getWordsIterator();
        while (iter.hasNext()) {
            String word = iter.next();

            int len = word.length();
            if (len < 3) {
                continue; // too short we bail but "too long" is fine...
            }

            if (wordsMap.containsKey(word)) {
                throw new IllegalStateException(
                        "This should never happen in Lucene 2.3.2");
                // wordsMap.put(word, wordsMap.get(word) + 1);
            } else {
                // use the number of documents this word appears in
                wordsMap.put(word, sourceReader.docFreq(new Term(
                        fieldToAutocomplete, word)));
            }
        }

        for (String word : wordsMap.keySet()) {
            // ok index the word
            Document doc = new Document();
            doc.add(new Field(SOURCE_WORD_FIELD, word, Field.Store.YES,
                    Field.Index.UN_TOKENIZED)); // orig term
            doc.add(new Field(GRAMMED_WORDS_FIELD, word, Field.Store.YES,
                    Field.Index.TOKENIZED)); // grammed
            doc.add(new Field(COUNT_FIELD,
                    Integer.toString(wordsMap.get(word)), Field.Store.NO,
                    Field.Index.UN_TOKENIZED)); // count

            writer.addDocument(doc);
        }

        sourceReader.close();

        // close writer
        writer.optimize();
        writer.close();

        // re-open our reader
        reOpenReader();
    }

    private void reOpenReader() throws CorruptIndexException, IOException {
        if (autoCompleteReader == null) {
            autoCompleteReader = IndexReader.open(autoCompleteDirectory);
        } else {
            autoCompleteReader.reopen();
        }

        autoCompleteSearcher = new IndexSearcher(autoCompleteReader);
    }

    public static void main(String[] args) throws Exception {
        Autocompleter autocomplete = new Autocompleter("/index/autocomplete");

        // run this to re-index from the current index, shouldn't need to do
        // this very often
        // autocomplete.reIndex(FSDirectory.getDirectory("/index/live", null),
        // "content");

        String term = "steve";

        System.out.println(autocomplete.suggestTermsFor(term));
        // prints [steve, steven, stevens, stevenson, stevenage]
    }

}

Java – Difference between solr and lucene

@darkheir: Lucene and Solr are 2 differents Apache projects that are made to work together, I don't understand what is the aim of each project.

Solr uses Lucene under the hood. Lucene has no clue about the Solr API.
Lucene is a powerful search engine framework that lets us add search capability to our application. It exposes an easy-to-use API while hiding all the search-related complex operations. Any application can use this library, not just Solr.
Solr is built around Lucene. It is not just an http-wrapper around Lucene but has been known to add more arsenal to Lucene (archived). Solr is ready-to-use out of box. It is a web application that offers related infrastructure and a lot more features in addition to what Lucene offers.

@darkheir: Lucene is used to create a search index and Solr use this index to perform searches. Am I right or is this a totally different approach?

Lucene doesn't just create the Index for the consumption by Solr. Lucene handles all the search related operations. Any application can use the Lucene framework.

Examples are Solr, Elastic Search, LinkedIn (yes, under the hood), etc..

Check out this article: Lucene vs Solr

UPDATE (6/18/14)

When to use Lucene?

You are a search engineer AND
You are a programmer AND
You want full control over almost all the internals of Lucene AND
Your requirements demand you to do all sorts of geeky customization to Lucene AND
You are willing to take care of infrastructure elements of your search like scaling, distribution, etc.

When to use Solr?

At least one of the above didn't make sense. OR
You want something that is ready to use out-of-the-box (even without knowledge of Java) OR
Your infrastructure requirements outweigh search customization requirements.

NOTE: I don't mean that Solr is hard to customize. Solr is very flexible and provides a lot of pluggable API points, allowing you to throw-in your code.

There are people, falling under 'have to use Lucene' camp, but still prefer Solr to plain Lucene as it's easy to use. However, they never restrain themselves from customizing Solr to the maximum extent.

BTW, I see that there are more resources on Solr (4.x) than Lucene (4.x).

Best Answer

Related Solutions

Java – How to do query auto-completion/suggestions in Lucene

Java – Difference between solr and lucene

Related Topic