I am working on a store search API using Lucene.
I need to show store search results for each City,State combination with its frequency in brackets….for example:
Los Angles,CA (450)
Atlanta,GA (212)
Boston, MA (78)
.
.
.
As of now, my search results return around 7000 Lucene documents, on average, if the user says "Show me all the stores".
In this use case, I end up showing around 800 unique City,State records as shown above.
I am overriding the HitCollector
class's Collect
method and retrieving vectors as follows:
var vectors = _reader.GetTermFreqVectors(doc);
Then I iterate through this collection and calculate the frequency for each unique City,State combination.
But this is turning out to be very very slow in performance…is there any better way of grouping search results and calculating frequency in Lucene?
A code snippet would be very helpful
Also, please suggest if I can optimize my Lucene search code using any other techniques/tips….
Thanks for reading!
Best Answer
I don't believe you can do this OOTB in Lucene currently - searching for this functionality yields this open issue:
Jira Lucene Feature Request
The functionality is present OOTB with Solr however - which provides a faceting feature. A query such as the following:
http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.field=inStock
would return the following result:
More information on faceting can be found on the Solr website:
http://wiki.apache.org/solr/SimpleFacetParameters
EDIT: If you definitely don't want to go down the SOLR aproach to faceting you may be able to leverage the functionality in this patch described for Lucene:
http://sujitpal.blogspot.com/2007/01/faceted-searching-with-lucene.html
which provides an implementation of the faceting feature on top of Lucene 2.0 via a patch.