Python – Understanding “score” returned by scikit-learn KMeans

k-meanspythonscikit-learn

I applied clustering on a set of text documents (about 100). I converted them to Tfidf vectors using TfIdfVectorizer and supplied the vectors as input to scikitlearn.cluster.KMeans(n_clusters=2, init='k-means++', max_iter=100, n_init=10). Now when I

model.fit()
print model.score()

on my vectors, I get a very small value if all the text documents are very similar, and I get a very large negative value if the documents are very different.

It serves my basic purpose of finding which set of documents are similar, but can someone help me understand what exactly does this model.score() value signify for a fit? How can I use this value to justify my findings?

Best Answer

In the documentation it says:

Returns:    
score : float
Opposite of the value of X on the K-means objective.

To understand what that means you need to have a look at the k-means algorithm. What k-means essentially does is find cluster centers that minimize the sum of distances between data samples and their associated cluster centers.

It is a two-step process, where (a) each data sample is associated to its closest cluster center, (b) cluster centers are adjusted to lie at the center of all samples associated to them. These steps are repeated until a criterion (max iterations / min change between last two iterations) is met.

As you can see there remains a distance between the data samples and their associated cluster centers, and the objective of our minimization is that distance (sum of all distances).

You naturally get large distances if you have a big variety in data samples, if the number of data samples is significantly higher than the number of clusters, which in your case is only two. On the contrary, if all data samples were the same, you would always get a zero distance regardless of number of clusters.

From the documentation I would expect that all values are negative, though. If you observe both negative and positive values, maybe there is more to the score than that.

I wonder how you got the idea of clustering into two clusters though.