Python – Using cosine distance with scikit learn KNeighborsClassifier

knnmachine learningpythonscikit-learn

Is it possible to use something like 1 – cosine similarity with scikit learn's KNeighborsClassifier?

This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?

Best Answer

The cosine similarity is generally defined as x^T y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)

Notice though, that x^T y / (||x|| * ||y||) = (x/||x||)^T (y/||y||). The euclidean distance can be equivalently written as sqrt(x^Tx + y^Ty − 2 x^Ty). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.

Related Solutions

Python – Calling a function of a module by using its name (a string)

Assuming module foo with method bar:

import foo
method_to_call = getattr(foo, 'bar')
result = method_to_call()

You could shorten lines 2 and 3 to:

result = getattr(foo, 'bar')()

if that makes more sense for your use case.

You can use getattr in this fashion on class instance bound methods, module-level methods, class methods... the list goes on.

Python – Use scikit-learn to classify into multiple categories

What you want is called multi-label classification. Scikits-learn can do that. See here: http://scikit-learn.org/dev/modules/multiclass.html.

I'm not sure what's going wrong in your example, my version of sklearn apparently doesn't have WordNGramAnalyzer. Perhaps it's a question of using more training examples or trying a different classifier? Though note that the multi-label classifier expects the target to be a list of tuples/lists of labels.

The following works for me:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train = [[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[0,1],[0,1]]
X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'hello welcome to new york. enjoy it here and london too'])   
target_names = ['New York', 'London']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_n=1,max_n=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

For me, this produces the output:

nice day in nyc => New York
welcome to london => London
hello welcome to new york. enjoy it here and london too => New York, London

Hope this helps.

Best Answer

Related Solutions

Python – Calling a function of a module by using its name (a string)

Python – Use scikit-learn to classify into multiple categories

Related Topic