Python – Scikit-learn cross validation scoring for regression

pythonregressionscikit-learn

How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score?

Tried the following two but doesn't work:

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring='mean_squared_error')

and

scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring=metrics.mean_squared_error)

The first one generates a list of negative numbers while mean squared error should always be non-negative. The second one complains that:

mean_squared_error() takes exactly 2 arguments (3 given)

Best Answer

I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed - https://github.com/scikit-learn/scikit-learn/issues/2439

In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you want to use to compare models but R^2 cannot be calculated depending (I think) on the type of cross-val you are using.

If you choose MSE as a scorer, it outputs a list of errors which you can then take the mean of, like so:

# Doing linear regression with leave one out cross val

from sklearn import cross_validation, linear_model
import numpy as np

# Including this to remind you that it is necessary to use numpy arrays rather 
# than lists otherwise you will get an error
X_digits = np.array(x)
Y_digits = np.array(y)

loo = cross_validation.LeaveOneOut(len(Y_digits))

regr = linear_model.LinearRegression()

scores = cross_validation.cross_val_score(regr, X_digits, Y_digits, scoring='mean_squared_error', cv=loo,)

# This will print the mean of the list of errors that were output and 
# provide your metric for evaluation
print scores.mean()

Related Solutions

Python – Scikit-learn feature selection for regression data

As larsmans noted, chi2 cannot be used for feature selection with regression data.

Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.

def f_regression(X,Y):
   import sklearn
   return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general

from sklearn.datasets import load_svmlight_file

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to  the name of my toy dataset file

from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]

Python – Label encoding across multiple columns in scikit-learn

You can easily do this though,

df.apply(LabelEncoder().fit_transform)

EDIT2:

In scikit-learn 0.20, the recommended way is

OneHotEncoder().fit_transform(df)

as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

EDIT:

Since this original answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

For inverse_transform and transform, you have to do a little bit of hack.

from collections import defaultdict
d = defaultdict(LabelEncoder)

With this, you now retain all columns LabelEncoder as dictionary.

# Encoding the variable
fit = df.apply(lambda x: d[x.name].fit_transform(x))

# Inverse the encoded
fit.apply(lambda x: d[x.name].inverse_transform(x))

# Using the dictionary to label future data
df.apply(lambda x: d[x.name].transform(x))

MOAR EDIT:

Using Neuraxle's FlattenForEach step, it's possible to do this as well to use the same LabelEncoder on all the flattened data at once:

FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

For using separate LabelEncoders depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances.

Best Answer

Related Solutions

Python – Scikit-learn feature selection for regression data

Python – Label encoding across multiple columns in scikit-learn

Related Topic