Data Mining – How to Cluster Strings Based on Relation

data mining

If you don't know WEKA, you can try a theoretical answer. I don't need literal code/examples…

I have a huge data set of strings in which I want to cluster the strings to find the most related ones, these could as well be seen as duplicate. I already have a set of couples of string for which I know that they are duplicate to each other, so, now I want to do some data mining on those two sets.

The result I'm looking for is a system that would return me the possible most relevant couples of strings for which we don't know yet that they are duplicates, I believe that I need clustering for this, which type?

Note that I want to base myself on word occurrence comparison, not on interpretation or meaning.


Here is an example of two string of which we know they are duplicate (in our vision on them):

  • The weather is really cold and it is raining.

  • It is raining and the weather is really cold.

Now, the following strings also exist (most to least relevant, ignoring stop words):

  • Is the weather really that cold today?

  • Rainy days are awful.

  • I see the sunshine outside.

The software would return the following two strings as most relevant, which aren't known to be duplicate:

  • The weather is really cold and it is raining.

  • Is the weather really that cold today?

Then, I would mark that as duplicate or not duplicate and it would present me with another couple.


How do I go to implement this in the most efficient way that I can apply to a large data set?

Best Answer

This is obviously non-trivial, but there are algorithms that at least attempt to do things like this. I hasten to add, however, that they're statistical, so trying to use only two sentences as a basis is going to be extremely iffy at best.

The usual approach runs something like this:

  1. filter out stop words
  2. use a thesaurus to substitute one canonical word for each word
  3. count occurrences of words in each document/sentence
  4. compute the cosine distance between the base document(s) and each candidate similar document
  5. pick the N closest to the base documents

Note that there's room for a lot of variation here though. For example, the thesaurus can get considerably better results if it's context sensitive, and to maintain the context you often want to retain the stop words, at least until that step is complete. For example, consider your base documents about the weather being compared to: "I have a cold", and "It is cold". If you follow the steps above, these will both be just "cold" by step 2, and both seem equally close to the base documents.

With a context-sensitive thesaurus step (an ontology, really), you'd use the extra words to disambiguate the two uses of "cold", so when you compute distances, one would refer to the disease named "the cold", and the other to "cold weather". The base documents would both refer to cold weather, so your result would show "It is cold" as similar, but "I have a cold" as different.

If you're trying to keep things simpler, however, you might skip the thesaurus completely, and just stem the words instead. This turns "rainy" and "raining" both into "rain", so when you do comparisons they'll all show up as synonymous.

As far as details go, there are quite a few lists of stop-words easily found. At least in my testing, the choice isn't particularly critical.

For a thesaurus, I've used the Moby Thesaurus, with some (substantial) processing to basically invert it -- i.e., rather than finding multiple synonyms for one word, find one canonical word for a given input.

There aren't as many papers on context-sensitive synonym/definition searching -- but some are still quite good. A lot of work on the "semantic web" and related ontologies is along this line as well (though little of it is likely to be of much help in your case).

For stemming, the Porter Stemmer is well known. There's a newer, slightly modified version (Porter2) that should be covered somewhere on the same page(s). Another well-known algorithm is the Lancaster Stemmer. There's also the Lovins stemmer, but I wouldn't really recommend it1 -- though it's still widely known because it was the first (well known) stemming algorithm published. Note that most (all?) of these strip only suffixes, not prefixes.

Quite a few papers discuss cosine distance. It's well enough known that even the Wikipedia entry for it is pretty decent.

Quite a few people have already assembled these pieces together into coherent (at least they generally try to be coherent) tool-kits, complete programs, etc. A few reasonably well known examples include WordNet, NLTK, Apache OpenNLP, and Freeling.


1In particular, Lovins only ever removes one suffix from a word. If you had, for example, "Loverly" and "lovingly", Porter would reduce both to "lov" and they'd show up as synonyms, but Lovins would reduce them to "lover" and "loving", respectively, and they'd show up as different. It is possible to repeat the Lovins algorithm until it removes no more suffixes, but the result isn't very good -- Porter has quite a bit of context sensitivity so (for example) it only removes one suffix if it did not remove another; multiple applications of Lovins wouldn't take this into account.

Related Topic