R – Grouping Lucene search results and calculating frequency by category

lucenelucene.netperformance

I am working on a store search API using Lucene.

I need to show store search results for each City,State combination with its frequency in brackets….for example:

Los Angles,CA (450)
Atlanta,GA (212)
Boston, MA (78)
.
.
.

As of now, my search results return around 7000 Lucene documents, on average, if the user says "Show me all the stores".
In this use case, I end up showing around 800 unique City,State records as shown above.

I am overriding the HitCollector class's Collect method and retrieving vectors as follows:

var vectors = _reader.GetTermFreqVectors(doc);

Then I iterate through this collection and calculate the frequency for each unique City,State combination.

But this is turning out to be very very slow in performance…is there any better way of grouping search results and calculating frequency in Lucene?
A code snippet would be very helpful

Also, please suggest if I can optimize my Lucene search code using any other techniques/tips….

Thanks for reading!

Best Answer

I don't believe you can do this OOTB in Lucene currently - searching for this functionality yields this open issue:

Jira Lucene Feature Request

The functionality is present OOTB with Solr however - which provides a faceting feature. A query such as the following:

http://localhost:8983/solr/select?q=ipod&rows=0&facet=true&facet.limit=-1&facet.field=cat&facet.field=inStock

would return the following result:

<response>
<responseHeader><status>0</status><QTime>2</QTime></responseHeader>
<result numFound="4" start="0"/>
<lst name="facet_counts">
 <lst name="facet_queries"/>
 <lst name="facet_fields">
  <lst name="cat">
        <int name="search">0</int>
        <int name="memory">0</int>
        <int name="graphics">0</int>
        <int name="card">0</int>
        <int name="music">1</int>
        <int name="software">0</int>
        <int name="electronics">3</int>
        <int name="copier">0</int>
        <int name="multifunction">0</int>
        <int name="camera">0</int>
        <int name="connector">2</int>
        <int name="hard">0</int>
        <int name="scanner">0</int>
        <int name="monitor">0</int>
        <int name="drive">0</int>
        <int name="printer">0</int>
  </lst>
  <lst name="inStock">
        <int name="false">3</int>
        <int name="true">1</int>
  </lst>
 </lst>
</lst>
</response>

More information on faceting can be found on the Solr website:

http://wiki.apache.org/solr/SimpleFacetParameters

EDIT: If you definitely don't want to go down the SOLR aproach to faceting you may be able to leverage the functionality in this patch described for Lucene:

http://sujitpal.blogspot.com/2007/01/faceted-searching-with-lucene.html

which provides an implementation of the faceting feature on top of Lucene 2.0 via a patch.