How would one use Lucene.NET to help implement search on a site like Stack Overflow

lucenelucene.netsearchsql-server-2008

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.

The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].

As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:

Lucene.NET is not being used for Stack
Overflow – we are using SQL Server
Full Text indexing. Search is an area
where we continue to make minor
tweaks.

So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?

Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):

Technologies:

And of course, the star of the show, Lucene.NET.

The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.

Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.

Models

This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:

Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
Events: This is data about a real-time event. It has location information, date/time information.

The important thing to note about these models:

They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
Every instance of a model has a unique URL on the site

Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:

Votes: Keyed on the user
Replies: Optional, as an example, see the Notes case above
Favorited: Is the model listed as a favorite of a user?
Comments: (optional)
Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.

And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):

Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).

Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.

SQL Server 2008

The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.

It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.

Lucene.NET

This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.

Indexing

Questions/Models

I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).

In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.

The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).

As an example, a question sets the subject, and then the answer elaborates on the subject.

For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.

I believe that this will help with making the relevance scores more relevant to each other.

Tags

Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:

If the tag exists
Which questions the tags are associated with
The questions themselves

However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:

select
     m.Name, m.Body
from
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
where
    t.Name = <tag>

And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.

This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).

The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).

At this point, the confidence score (Wilson score interval) comes into play for ordering the results.

This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).

Given that I can change the SQL statement to order based on the Wilson score interval like this:

select
     m.Name, m.Body
from
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
        left outer join VoteTallyStatistics as s on
            s.ModelTypeId = ta.ModelTypeId and
            s.ModelId = ta.ModelId
where
    t.Name = <tag>
order by
    --- Use Id to break ties.
    s.WilsonIntervalLowerBound desc, m.Id

It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".

Replies

Originally, I thought this is in a separate index of its own, with a key back into the Questions index.

I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.

This would of course bloat the index. I'm somewhat comfortable with that right now.

Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.

There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?

What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?

Boost

This is related to indexing of course, but I think it merits it's own section.

Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.

For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?

The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.

Shashikant Kore suggests a document structure like this:

Title
Question
Accepted Answer (Or highly voted answer if there is no accepted answer)
All answers combined

And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.

The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.

Search for Items Tagged

I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.

However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.

General Search

So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):

    q votes answers accepted answer votes asp.net highlights mvc highlights
    ------- ------- --------------------- ------------------ --------------
         21      26                    51                  2              2
         58      23                    70                  2              5
         29      24                    40                  3              4
         37      15                    25                  1              2
         59      23                    47                  2              2

Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).

How is all of this brought together?

At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.

How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?

It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.

My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:

1 / ((e ^ cs) * (e ^ rs))

This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.

The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).

Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.

Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:

confidence score = votetally < 0 ? 
    -(1 - wilson score interval upper bound) :
    wilson score interval lower bound

The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.

To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:

confidence score = 0.5 + 
    (votetally < 0 ? 
        -(1 - wilson score interval upper bound) :
        wilson score interval lower bound) / 2

My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).

An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.

As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.

Best Answer

The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.

I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:

Algorithms of the Intelligent Web

Collective Intelligence in Action

Programming Collective Intelligence

Related Solutions

Vb.net – How to incorporate multiple fields in QueryParser

There are 3 ways to do this.

The first way is to construct a query manually, this is what QueryParser is doing internally. This is the most powerful way to do it, and means that you don't have to parse the user input if you want to prevent access to some of the more exotic features of QueryParser:

IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);

BooleanQuery booleanQuery = new BooleanQuery();
Query query1 = new TermQuery(new Term("filename", "<text>"));
Query query2 = new TermQuery(new Term("filetext", "<text>"));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
// Use BooleanClause.Occur.MUST instead of BooleanClause.Occur.SHOULD
// for AND queries
Hits hits = searcher.Search(booleanQuery);

The second way is to use MultiFieldQueryParser, this behaves like QueryParser, allowing access to all the power that it has, except that it will search over multiple fields.

IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);

Analyzer analyzer = new StandardAnalyzer();
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
                                        new string[] {"filename", "filetext"},
                                        analyzer);

Hits hits = searcher.Search(queryParser.parse("<text>"));

The final way is to use the special syntax of QueryParser see here.

IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);    

Analyzer analyzer = new StandardAnalyzer();
QueryParser queryParser = new QueryParser("<default field>", analyzer);
// <default field> is the field that QueryParser will search if you don't 
// prefix it with a field.
string special = "filename:" + text + " OR filetext:" + text;

Hits hits = searcher.Search(queryParser.parse(special));

Your other option is to create new field when you index your content called filenameandtext, into which you can place the contents of both filename and filetext, then you only have to search one field.

.net – Concurrency in Lucene.NET.

First of all we have to define a "write" operation. A write operation will object a lock once you start a write operation and will continue until you close the object that is performing the work. Such as creating an IndexWriter and indexing a document will cause the write to object a lock and it will keep this lock until you close the IndexWriter.

Now we can talk about the lock a little bit. This lock that is object is a file based lock. Like mythz mentioned earlier, there is a file called 'write.lock' that is created. Once a write lock is objected it is exclusive! This lock causes all index modifying operations (IndexWriter, and some methods from IndexReader) to wait until the lock is removed.

Overall you and have multiple reads on an index. You can even read and write at the same time, no problem. But there is a problem when having multiple writers. If one thread is waiting for the lock too long it will time out.

1) Possible Solution #1 Direct Operations

If you are sure that your indexing operations are short and quick, you may be able to just use the same index at the same time. Otherwise you will have to think about how you want to organize the indexing operations of the applications.

2) Possible Solution #2 Web Service

Since you are working with a web solution it might be possible to create a web service. When implementing this web service I would dedicate a worker thread for indexing. I would create a work queue to contain the work and if the queue contained multiple jobs to do, it should grab them all and do them into batch. This will solve all of the problems.

3) create another index, then merge

If the console application does heavy work on the index you may be able to look into having the console application you could create a seperate index in the console application and then merge the indexes at some safe scheduled time using IndexWriter.AddIndexes.

from here you can do this in two ways, you can merge with the direct index. Or you can merge to create a 3rd index, and then when this index is ready replace the original index. You have to be careful in what your doing here as well to make sure that your not going to lock something in heavy use and cause a timeout for other write operations.

4) Index & Search multiple indexes

Personally I think people need to separate their indexes out. This helps separates responsibilities of the programs and minimizes down time and maintained of having a single point for all indexes. For example, if your console application is responsible for only adding in certain fields or your are kind of extending an index you could look separate the indexes out, but maintain identity by using an ID field in each document. Now with this you can take advantage of the built in support for searching multiple indexes using the MultiSercher class. Or if your wanting there is also a nice ParallelMultiSearch class that can search both indexes at once.

5) Look into SOLR

Something else that can help your issue of maintaining a single place for you index, you could change your program to work with a SOLR server. http://lucene.apache.org/solr/ there is also a nice SOLRNET http://code.google.com/p/solrnet/ library that can be helpful in this situation. Although I'm not experienced with solr but i am under the impression that it will help you manage situation such as this. Also it has other benefits such as hit highlighting and searching for related items by finding items "MoreLikeThis", or provide spell checking.

I'm sure there are other methods but these are all the ones that I can think of. Overall it your solution depends upon how many people are writing and how up to date the search index you need it to be. Overall if you can defer some operations for a latter time and do some batch operations in any situation will give you the most performance. My suggestion is to understand what your able to work with and go from there. good luck

Best Answer

Related Solutions

Vb.net – How to incorporate multiple fields in QueryParser

.net – Concurrency in Lucene.NET.

Related Topic