.net – Need Lucene query optimization advice

lucenelucene.net

Am working on web based Job search application using Lucene.User on my site can search for jobs which are within a radius of 100 miles from say "Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score returned by lucene) in descending order.

I'm using a 3rd party API to fetch all the cities within given radius of a city.This API returns me around 864 cities within 100 miles radius of "Boston,MA".

I'm building the city/state Lucene query using the following logic which is part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864 cities with CityName ass key and StateCode as value.
And finalQuery is a Lucene BooleanQuery object which contains other search criteria entered by the user like:skills,keywords,etc.

foreach (string city in nearestCities.Keys)

{

    BooleanQuery tempFinalQuery = finalQuery;

    cityStateQuery = new BooleanQuery();    

    queryCity = queryParserCity.Parse(city);

    queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);

    cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like an AND

    cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);

} 


nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should is like an OR



finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);

I then input finalQuery object to Lucene's Search method to get all the jobs within 100 miles radius.:

searcher.Search(finalQuery, collector);

I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds on an average to execute which obviously is unacceptable by any standards of a website.I also found out that the statements involving "Parse" take a considerable amount of time to execute as compared to other statements.

A job for a given location is a dynamic attribute in the sense that a city could have 2 jobs(meeting a particular search criteria) today,but zero job for the same search criteria after 3 days.So,I cannot use any "Caching" over here.

Is there any way I can optimize this logic?or for that matter my whole approach/algorithm towards finding all jobs within 100 miles using Lucene?

FYI,here is how my indexing in Lucene looks like:

doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED));

doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES, Field.Index.TOKENIZED));

doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO, Field.Index.TOKENIZED));

doc.Add(new Field("city", job.City.Trim(), Field.Store.YES, Field.Index.TOKENIZED , Field.TermVector.YES));

doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("citystate", job.City.Trim() + ", " + job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED , Field.TermVector.YES));

doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES, Field.Index.UN_TOKENIZED));

doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES, Field.Index.TOKENIZED));

doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED,Field.TermVector.YES));

doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO, Field.Index.UN_TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("showAllJobs", "yy", Field.Store.NO, Field.Index.UN_TOKENIZED));

Thanks a ton for reading!I would really appreciate your help on this.

Janis

Best Answer

Not quite sure if I completely understand your code, but when it comes to geospatial search a filter approach might be more appropriate. Maybe this link can give you some ideas - http://sujitpal.blogspot.com/2008/02/spatial-search-with-lucene.html

Maybe you can use Filters for other parts of your query as well. To be honest your query looks quite complex.

--Hardy