Algorithm for matching similar content text items

algorithms

I am working on a website (C#, ASP.Net MVC 3) which reads some RSS feeds from multiple sources and put feed title and summary in a database table(Sql Server).

What I want to do is:
Put an algorithm in place which can relate multiple feeds. For example if each feed is a news item, I would like to relate all news which says in different grammar of English "Some has won some election".

Is there any standard algorithm for such kind of content matching logic?
If not, what kind of custom algorithm should be used?

If this logic can be written on Database side(e.g. Stored Procedure) it will be better.

Best Answer

As @Cosmin-Prund said, there's no trivial or good pre-existing way to do this. My off-the-top-of-my-head suggestion would be to use a search engine like Lucene to tokenize and store the feed title. Use a stemming tokenizer, so that you can match words even if they're in different forms (such as wins vs winning). Then, when you process a new feed, you can search for the title as keywords, and see what you get back. You'll have to play with it some to find out how to tune the results to do what you want (try dropping the two most common tokens?), but it ought to be in the right ballpark of what you're looking for.

Defining a Metric

First, you need to decide what makes articles similar. There are two main approaches: looking for similarities in article topics or looking for similarities in article text. Topics will give better results but text is easier to implement.

Similarity by Topic

As mentioned several times, the easiest way to implement this system is allowing specify topics through author-specified tags. You would then search for articles with large overlaps in tags. If the tags are numerous and fine grained enough then this should give the best results.

The disadvantage is that you need to put a lot of thought into what the tags are to ensure you have coverage, clarity, and a lack of redundancy. If you take the Stack Exchange approach of letting users create their own tags then you can increase coverage but you need to moderate the tags to maintain the clarity/lack of redundancy. However, the greatest drawback of this approach is that you are trusting users to appropriately tag their posts. SE gets around this problem by letting other users edit and make suggestions for the tags.

You can get even better results if you tag topics at the sentence or paragraph level. It gives a better representation of which topics are more important in an article but it's more work. As the tagging scope gets smaller, the complexity of this task becomes exponentially more difficult.

What about an automated solution to take the work load off the users? Automatic Topic Identification is something that has been studied a lot. I'm not an expert at it but I suggest you read a few papers and decide if you feel these solutions are mature enough to give reliable results. My concern with this approach is that since you admit your domain is niche you might have a hard time finding an out-of-the-box solution and will need to implement the topic identifier yourself. At that point you might as well just do text-based similarity because it will be much easier and out-of-the-box solutions exist.

Similarity by Text

In this approach instead of comparing topic tags, you compare the actual words in the article. The advantage is that the preprocessing is much easier to accomplish. The disadvantage is that it assumes that similar text means a similar topic, which is not always the case.

Making it Work

In general, whichever metric you choose you will end up with a vector representing your articles. Maybe the vector is of word frequencies or of topic tags. You now need to compare the vectors for your articles to see which are similar.

The Stanford Natural Language Processing Course offered on coursera.com is a good introduction to Information Retrieval (specifically the Week 7 lectures). Keep in mind that the solutions presented in those lectures are relatively basic, but it's a good start.

I would heavily suggest trying to find an out-of-the-box implementation here. Failing that, using a toolkit like Apache Lucene will greatly simplify your development.

Now you need to test out a bunch of term weighting algorithms and see which one gives the best results for your data. TREC is a competition to find better and better weighting algorithms. Check the proceedings on their website to find discussions of newer, more accurate weighting algorithms.

Database – An algorithm for finding subset matching criteria

After your clarifications on the question, I would go by this logic to list recipeids excluding any recipe with a missing/insufficient ingredient.

SELECT recipeid
FROM recipeingredients RI
WHERE recipeid NOT IN (SELECT RI1.recipeid
                       FROM recipeingredients RI1
                       LEFT OUTER JOIN Ingredients I ON RI1.ingredientsid = I.id
                                                    AND RI1.Requiredquantity <= I.availablequantity
                       WHERE I.id IS NULL
                      )

Best Answer

Related Solutions

Algorithms for Matching Similar Articles – Logic and Implementation

Defining a Metric

Similarity by Topic

Similarity by Text

Making it Work

Database – An algorithm for finding subset matching criteria

Related Topic