One technique that can be used to perform clustering on multi-dimensional numeric data is the Kohonen self-organising feature map. It's a little too involved to describe here, but should be included in any beginner's level text on machine learning.
This just leaves the problem of how to convert your data to numeric form. To do this, I'd first run an an analysis to find a reasonable number (say 100) of words that appear in many of your strings, but not too many. You're looking for words in the middle of the frequency distribution, as these carry the most useful information. You can then use the presence or absence of these words as inputs to your feature map.
Firstly, please post some images and their corresponding plots from your implementation of Hough transform. Without images and their plots, it is difficult to tell what is going on - especially since there is no source code to critique.
My suspicion is that your understanding of Hough transform may not be correct. When the input is a single point in the (x, y)
space (spatial), the resulting Hough transform accumulation should be a sinusoidal graph.
Remember that Hough transform is not merely a coordinate-space transform. It is an integral transform (note that Hough transform and Radon transform are related), therefore the value at a single point in the transformed space depends on an integral that runs on every point in the input (x, y)
space.
https://en.wikipedia.org/wiki/Hough_transform
A positive detection of a single line always shows up as a signature "bowtie" shape in the Hough transform space, when rendered as an image where the gray level intensity is proportional to the number of votes received (i.e. the integral). Please refer to the sample plot in the Wikipedia article linked above.
Your attempt to smooth the Hough transform space is on the right track, but the amount of smoothing depends on a lot of factors. It is necessary to make an image plot of the Hough transform plot in order to fine-tune the amount of smoothing needed. It is very likely that the smoothing needs to be performed on a larger window e.g. a Gaussian smooth function with sigma of 5 - 10 pixels is sometimes needed, depending on input characteristics.
You can also compare your implementation's output with that of MATLAB's output or OpenCV's.
Make sure the bins used for Hough Transform accumulation have enough precision to avoid value overflow. Put simply, if the Hough Transform values are stored in "8-bit unsigned integer", it is very likely to overflow. Any other data types should be fine.
Best Answer
The bandwidth is the distance/size scale of the kernel function, i.e. what the size of the “window” is across which you calculate the mean.
There is no bandwidth that works well for all purposes and all instances of the data. Instead, you will need to either
manually select an appropriate bandwith for your algorithm; or
use an algorithm that automatically adapts or estimates the bandwidth (though this implies some computational overhead).
sklearn
module offers anestimate_bandwith()
function based on a nearest-neighbor analysis.Any discussion of bandwidth or kernels first requires that you have already defined a distance metric. For image processing you will have to select a suitable colour space. RGB might not produce best results, depending on your goals.