Is it possible to specify your own distance function using scikit-learn K-Means Clustering?
Python – Is it possible to specify your own distance function using scikit-learn K-Means Clustering
cluster-analysisk-meansmachine learningpythonscikit-learn
Best Answer
Here's a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function.
Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ?
Some notes added 26mar 2012:
1) for cosine distance, first normalize all the data vectors to |X| = 1; then
is fast. For bit vectors, keep the norms separately from the vectors instead of expanding out to floats (although some programs may expand for you). For sparse vectors, say 1 % of N, X . Y should take time O( 2 % N ), space O(N); but I don't know which programs do that.
2) Scikit-learn clustering gives an excellent overview of k-means, mini-batch-k-means ... with code that works on scipy.sparse matrices.
3) Always check cluster sizes after k-means. If you're expecting roughly equal-sized clusters, but they come out
[44 37 9 5 5] %
... (sound of head-scratching).