Algorithm for matching similar content text items

algorithms

I am working on a website (C#, ASP.Net MVC 3) which reads some RSS feeds from multiple sources and put feed title and summary in a database table(Sql Server).

What I want to do is:
Put an algorithm in place which can relate multiple feeds. For example if each feed is a news item, I would like to relate all news which says in different grammar of English "Some has won some election".

Is there any standard algorithm for such kind of content matching logic?
If not, what kind of custom algorithm should be used?

If this logic can be written on Database side(e.g. Stored Procedure) it will be better.

Best Answer

As @Cosmin-Prund said, there's no trivial or good pre-existing way to do this. My off-the-top-of-my-head suggestion would be to use a search engine like Lucene to tokenize and store the feed title. Use a stemming tokenizer, so that you can match words even if they're in different forms (such as wins vs winning). Then, when you process a new feed, you can search for the title as keywords, and see what you get back. You'll have to play with it some to find out how to tune the results to do what you want (try dropping the two most common tokens?), but it ought to be in the right ballpark of what you're looking for.

Related Topic