Node.js Search Algorithm – Implementing Efficient Search in MongoDB

mongodbnode.jssearch

I would like to create a site where users can post articles with the following optional parts:

A title
Contents (text)
Categories
Keywords

Articles will be stored in mongodb and the site will be built in node.js. Users can search the site using a normal search text box.

I'm thinking about creating the following collections:

Users
Articles
Keywords

I will then create an entry for each keyword used in the Keywords collection with an array containing all the articles that use it. If a user conducts a search, the search is broken up into keywords and each keyword is looked up in the Keywords collection. Each article is then retrieved from the db and ranked based on relevance.

My questions are:

Would it be efficient to use a Keywords collection like this, should I just use the Articles collection (Using full-text search or something) or should I structure it in some other way?
How would I incorporate the ability to search the title, contents or categories for articles instead of just the keywords?
Would it be better to use something like Apache Lucene than to build this functionality myself?

Best Answer

I would use an existing platform designed for search. You mentioned Lucene and there are others around based on the language you are using.

If you want to create a stand alone search server that is language agnostic look at SOLR. It is based on Lucene, so lots of support.

I personally like Sphinx, but it may not work in your situation, it all depends on the language you are using.

Related Solutions

A good pattern for multi language in MongoDb

I use the following pattern for text that should be indexed in all the languages:

{
"id":"sdsd"
"title":{"languages":{"en":0,"fr":1},"texts":["this is the title in inglish","Celui ci c'est le titre en francais"]}
}

object = coleccion.find("id='xxxxx'");

// now if want the text in English

print(object.title.texts[object.title.languages["en"]])

// the use of objecttitle.languages index array it to improve performance in client accessing a determined text

// that allows us to add indexes to your translated texts on mongo, as

ensureIndex({title.texts})

We can also wrap the code to obtain text in an specific language in a Class.

Machine Learning – How Machine Learning Is Incorporated into Search Engine Design

(1) What all features should I extract?

First, realize that you're not classifying documents. You're classifying (document, query) pairs, so you should extract features that express how well they match.

The standard approach in learning to rank is to run the query against various search engine setups (e.g. tf-idf, BM-25, etc.) and then train a model on the similarity scores, but for a small, domain-specific SE, you could have features such as

For each term, a boolean that indicates whether the term occurs in both the query and the document. Or maybe not a boolean, but the tf-idf weights of those query terms that actually occur in the document.
Various overlap metrics such as Jaccard or Tanimoto.

(2) Is there a better way to integrate the machine learning component into the search engine? My final goal is to "learn" the ranking function based on both business logic as well as user feedback.

This is a very broad question, and the answer depends on how much effort you want to put in. The first improvement that comes to mind is that you should use not the binary relevance judgements from the classifier, but its real-valued decision function, so that you can actually do ranking instead of just filtering. For an SVM, the decision function is the signed distance to the hyperplane. Good machine learning packages have an interface for getting the value of that.

Beyond that, look into pairwise and listwise learning to rank; what you're suggesting is the so-called pointwise approach. IIRC, pairwise works a lot better in practice. The reason is that with pairwise ranking, you need much fewer clicks: instead of having users label documents as relevant/irrelevant, you only give them the "relevant" button. Then you learn a binary classifier on triples (document1, document2, query) that tells whether document1 is more relevant to the query than document2, or vice versa. When a user labels, say, document 4 in the ranking as relevant, that gives you six samples to learn from:

document4 > document3
document4 > document2
document4 > document1
document1 < document4
document2 < document4
document3 < document4

so you get the negatives for free.

(These are all just suggestions, I haven't tried any of this. I just happen to have worked in a research group where people investigated learning to rank. I did do a presentation of someone else's paper for a reading group once, maybe the slides can be of help.)

Best Answer

Related Solutions

A good pattern for multi language in MongoDb

Machine Learning – How Machine Learning Is Incorporated into Search Engine Design

Related Topic