.net – Boosting Multi-Value Fields

lucenelucene.net

I have a set of documents containing scored items that I'd like to index. Our data structure looks like:

Document
  ID
  Text
  List<RelatedScore>

RelatedScore
  ID
  Score

My first thought was to add each RelatedScore as a multi-value field using the Boost property of the Field to modify the value of the particular score when searching.

foreach (var relatedScore in document.RelatedScores) {
  var field = new Field("RelatedScore", relatedScore.ID,
                        Field.Store.YES, Field.Index.UN_TOKENIZED);
  field.SetBoost(relatedScore.Score);
  luceneDoc.Add(field);
}

However, it appears that the "Norm" that is calculated applies to the entire multi-field – all the RelatedScore" values for a document will end up having the same score.

Is there a mechanism in Lucene to allow for this functionality? I would rather not create another index just to account for this – it feels like there should be a way using a single index. If there isn't a means to accomplish this, a few ideas that we have to compensate are :

  1. Insert the multi-value field items in order of descending value. Then somehow add a positional-aware analysis to assign higher boost/score to the first items in the field.
  2. Add a high value score multiple times to the field. So, a RelatedScore with Score==1 might be added three times, while a RelatedScore with Score==.3 would only be added once.

Both of these will result in a loss of search fidelity on these fields, yes, but they may be good enough. Any thoughts on this?

Best Answer

This appears to be a use case for Payloads. I'm not sure if this is available in Lucene.NET, as I've only used the Java version.

Another hacky way to do this, if the absolute values of the scores aren't that important, is to discretize them (place them in buckets based on value) and create a field for each bucket. So if you have scores that range from 1 to 100, create say, 10 buckets called RelatedScore0_10, RelatedScore10_20, etc, and for any document that has a RelatedScore in that bucket, add a "true" value in that field. Then for every search that gets executed tack on an OR query like:

(RelatedScore0_10:true^1 RelatedScore10_20:true^2 ...)

The nice thing about this is that you can tweak the boost values for each one of your buckets on the fly. Otherwise you'd need to reindex to change the field norm (boost) values for each field.

Related Topic