I use the following pattern for text that should be indexed in all the languages:
{
"id":"sdsd"
"title":{"languages":{"en":0,"fr":1},"texts":["this is the title in inglish","Celui ci c'est le titre en francais"]}
}
object = coleccion.find("id='xxxxx'");
// now if want the text in English
print(object.title.texts[object.title.languages["en"]])
// the use of objecttitle.languages index array it to improve performance in client accessing a determined text
// that allows us to add indexes to your translated texts on mongo, as
ensureIndex({title.texts})
We can also wrap the code to obtain text in an specific language in a Class.
(1) What all features should I extract?
First, realize that you're not classifying documents. You're classifying (document, query) pairs, so you should extract features that express how well they match.
The standard approach in learning to rank is to run the query against various search engine setups (e.g. tf-idf, BM-25, etc.) and then train a model on the similarity scores, but for a small, domain-specific SE, you could have features such as
- For each term, a boolean that indicates whether the term occurs in both the query and the document. Or maybe not a boolean, but the tf-idf weights of those query terms that actually occur in the document.
- Various overlap metrics such as Jaccard or Tanimoto.
(2) Is there a better way to integrate the machine learning component into the search engine? My final goal is to "learn" the ranking function based on both business logic as well as user feedback.
This is a very broad question, and the answer depends on how much effort you want to put in. The first improvement that comes to mind is that you should use not the binary relevance judgements from the classifier, but its real-valued decision function, so that you can actually do ranking instead of just filtering. For an SVM, the decision function is the signed distance to the hyperplane. Good machine learning packages have an interface for getting the value of that.
Beyond that, look into pairwise and listwise learning to rank; what you're suggesting is the so-called pointwise approach. IIRC, pairwise works a lot better in practice. The reason is that with pairwise ranking, you need much fewer clicks: instead of having users label documents as relevant/irrelevant, you only give them the "relevant" button. Then you learn a binary classifier on triples (document1, document2, query) that tells whether document1 is more relevant to the query than document2, or vice versa. When a user labels, say, document 4 in the ranking as relevant, that gives you six samples to learn from:
- document4 > document3
- document4 > document2
- document4 > document1
- document1 < document4
- document2 < document4
- document3 < document4
so you get the negatives for free.
(These are all just suggestions, I haven't tried any of this. I just happen to have worked in a research group where people investigated learning to rank. I did do a presentation of someone else's paper for a reading group once, maybe the slides can be of help.)
Best Answer
I would use an existing platform designed for search. You mentioned Lucene and there are others around based on the language you are using.
If you want to create a stand alone search server that is language agnostic look at SOLR. It is based on Lucene, so lots of support.
I personally like Sphinx, but it may not work in your situation, it all depends on the language you are using.