C# – Clustering Strings on the basis of Common Substrings

cclusterdata miningsqlstrings

I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance:

  1. How to make another layer from an existing layer
  2. Unable to edit data on the network drive
  3. Existing layers in the desktop
  4. Assistance with network drive

In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word)

The steps I'm following are:

  1. Iterate through the data set
  2. Do a row by row comparison
  3. Find the common words between the strings
  4. Form a cluster where number of common words is greater than or equal to 2(eliminating stop words)
  5. If number of common words<2, put the string in a new cluster.
  6. Assign the rows either to the existing clusters or form a new one depending upon the common words
  7. Continue until all the strings are processed

I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.

Best Answer

One technique that can be used to perform clustering on multi-dimensional numeric data is the Kohonen self-organising feature map. It's a little too involved to describe here, but should be included in any beginner's level text on machine learning.

This just leaves the problem of how to convert your data to numeric form. To do this, I'd first run an an analysis to find a reasonable number (say 100) of words that appear in many of your strings, but not too many. You're looking for words in the middle of the frequency distribution, as these carry the most useful information. You can then use the presence or absence of these words as inputs to your feature map.

Related Topic