Python – Scikit-learn – feature reduction using RFECV and GridSearch. Where are the coefficients stored

pythonscikit-learn

I am using Scikit-learn RFECV to select most significant features for a logistic regression using a Cross Validation. Assume X is a [n,x] dataframe of features, and y represents the response variable:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

# 
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y)

Two questions:

a) Is this the correct process for feature, hyper-parameter selection and fitting?

b) Where can I find the fitted coefficients for the selected features?

Best Answer

Is this the correct process for feature selection? This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv parameter to cv=3, both RFECV and GridSearchCV will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression. You can also combine

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

into

X_new = rfecv.fit_transform(X, y)

Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.

Is this the correct process for fitting? Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y), it makes a grid of LogisticRegression estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_, the parameters of this estimator in grid.best_params_ and the performance score for this estimator under grid.best_score_. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.

Where can I find the fitted coefficients for the selected features? The grid.best_estimator_ attribute is a LogisticRegression object with all this information, so grid.best_estimator_.coef_ has all the coefficients (and grid.best_estimator_.intercept_ is the intercept). Note that to be able to get this grid.best_estimator_, the refit parameter on GridSearchCV needs to be set to True, but this is the default anyway.

Related Solutions

Python – What are the differences between type() and isinstance()

To summarize the contents of other (already good!) answers, isinstance caters for inheritance (an instance of a derived class is an instance of a base class, too), while checking for equality of type does not (it demands identity of types and rejects instances of subtypes, AKA subclasses).

Normally, in Python, you want your code to support inheritance, of course (since inheritance is so handy, it would be bad to stop code using yours from using it!), so isinstance is less bad than checking identity of types because it seamlessly supports inheritance.

It's not that isinstance is good, mind you—it's just less bad than checking equality of types. The normal, Pythonic, preferred solution is almost invariably "duck typing": try using the argument as if it was of a certain desired type, do it in a try/except statement catching all exceptions that could arise if the argument was not in fact of that type (or any other type nicely duck-mimicking it;-), and in the except clause, try something else (using the argument "as if" it was of some other type).

basestring is, however, quite a special case—a builtin type that exists only to let you use isinstance (both str and unicode subclass basestring). Strings are sequences (you could loop over them, index them, slice them, ...), but you generally want to treat them as "scalar" types—it's somewhat incovenient (but a reasonably frequent use case) to treat all kinds of strings (and maybe other scalar types, i.e., ones you can't loop on) one way, all containers (lists, sets, dicts, ...) in another way, and basestring plus isinstance helps you do that—the overall structure of this idiom is something like:

if isinstance(x, basestring)
  return treatasscalar(x)
try:
  return treatasiter(iter(x))
except TypeError:
  return treatasscalar(x)

You could say that basestring is an Abstract Base Class ("ABC")—it offers no concrete functionality to subclasses, but rather exists as a "marker", mainly for use with isinstance. The concept is obviously a growing one in Python, since PEP 3119, which introduces a generalization of it, was accepted and has been implemented starting with Python 2.6 and 3.0.

The PEP makes it clear that, while ABCs can often substitute for duck typing, there is generally no big pressure to do that (see here). ABCs as implemented in recent Python versions do however offer extra goodies: isinstance (and issubclass) can now mean more than just "[an instance of] a derived class" (in particular, any class can be "registered" with an ABC so that it will show as a subclass, and its instances as instances of the ABC); and ABCs can also offer extra convenience to actual subclasses in a very natural way via Template Method design pattern applications (see here and here [[part II]] for more on the TM DP, in general and specifically in Python, independent of ABCs).

For the underlying mechanics of ABC support as offered in Python 2.6, see here; for their 3.1 version, very similar, see here. In both versions, standard library module collections (that's the 3.1 version—for the very similar 2.6 version, see here) offers several useful ABCs.

For the purpose of this answer, the key thing to retain about ABCs (beyond an arguably more natural placement for TM DP functionality, compared to the classic Python alternative of mixin classes such as UserDict.DictMixin) is that they make isinstance (and issubclass) much more attractive and pervasive (in Python 2.6 and going forward) than they used to be (in 2.5 and before), and therefore, by contrast, make checking type equality an even worse practice in recent Python versions than it already used to be.

Python – What are the differences between the urllib, urllib2, urllib3 and requests module

I know it's been said already, but I'd highly recommend the requests Python package.

If you've used languages other than python, you're probably thinking urllib and urllib2 are easy to use, not much code, and highly capable, that's how I used to think. But the requests package is so unbelievably useful and short that everyone should be using it.

First, it supports a fully restful API, and is as easy as:

import requests

resp = requests.get('http://www.mywebsite.com/user')
resp = requests.post('http://www.mywebsite.com/user')
resp = requests.put('http://www.mywebsite.com/user/put')
resp = requests.delete('http://www.mywebsite.com/user/delete')

Regardless of whether GET / POST, you never have to encode parameters again, it simply takes a dictionary as an argument and is good to go:

userdata = {"firstname": "John", "lastname": "Doe", "password": "jdoe123"}
resp = requests.post('http://www.mywebsite.com/user', data=userdata)

Plus it even has a built in JSON decoder (again, I know json.loads() isn't a lot more to write, but this sure is convenient):

resp.json()

Or if your response data is just text, use:

resp.text

This is just the tip of the iceberg. This is the list of features from the requests site:

International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
List item
Python 2.7, 3.6—3.9
Thread-safe.

Best Answer

Related Solutions

Python – What are the differences between type() and isinstance()

Python – What are the differences between the urllib, urllib2, urllib3 and requests module

Related Topic