C# – Clustering Strings on the basis of Common Substrings

cclusterdata miningsqlstrings

I have around 10000+ strings and have to identify and group all the strings which looks similar(I base the similarity on the number of common words between any two give strings). The more number of common words, more similar the strings would be. For instance:

How to make another layer from an existing layer
Unable to edit data on the network drive
Existing layers in the desktop
Assistance with network drive

In this case, the strings 1 and 3 are similar with common words Existing, Layer and 2 and 4 are similar with common words Network Drive(eliminating stop word)

The steps I'm following are:

Iterate through the data set
Do a row by row comparison
Find the common words between the strings
Form a cluster where number of common words is greater than or equal to 2(eliminating stop words)
If number of common words<2, put the string in a new cluster.
Assign the rows either to the existing clusters or form a new one depending upon the common words
Continue until all the strings are processed

I am implementing the project in C#, and have got till step 3. However, I'm not sure how to proceed with the clustering. I have researched a lot about string clustering but could not find any solution that fits my problem. Your inputs would be highly appreciated.

Best Answer

One technique that can be used to perform clustering on multi-dimensional numeric data is the Kohonen self-organising feature map. It's a little too involved to describe here, but should be included in any beginner's level text on machine learning.

This just leaves the problem of how to convert your data to numeric form. To do this, I'd first run an an analysis to find a reasonable number (say 100) of words that appear in many of your strings, but not too many. You're looking for words in the middle of the frequency distribution, as these carry the most useful information. You can then use the presence or absence of these words as inputs to your feature map.

Best Answer

Related Solutions

Data Mining – How to Cluster Strings Based on Relation

Architecture – How to implement a lightweight clustered architecture for a distributed application

Related Topic