Python – Scikit-learn – feature reduction using RFECV and GridSearch. Where are the coefficients stored

pythonscikit-learn

I am using Scikit-learn RFECV to select most significant features for a logistic regression using a Cross Validation. Assume X is a [n,x] dataframe of features, and y represents the response variable:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

# 
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y) 

Two questions:

a) Is this the correct process for feature, hyper-parameter selection and fitting?

b) Where can I find the fitted coefficients for the selected features?

Best Answer

Is this the correct process for feature selection? This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv parameter to cv=3, both RFECV and GridSearchCV will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression. You can also combine

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

into

X_new = rfecv.fit_transform(X, y)

Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.

Is this the correct process for fitting? Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y), it makes a grid of LogisticRegression estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_, the parameters of this estimator in grid.best_params_ and the performance score for this estimator under grid.best_score_. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.

Where can I find the fitted coefficients for the selected features? The grid.best_estimator_ attribute is a LogisticRegression object with all this information, so grid.best_estimator_.coef_ has all the coefficients (and grid.best_estimator_.intercept_ is the intercept). Note that to be able to get this grid.best_estimator_, the refit parameter on GridSearchCV needs to be set to True, but this is the default anyway.

Related Topic