Python – Scikit-learn feature selection for regression data

pythonscikit-learn

I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.

I am working with scikit-learn version 0.11.

I have tried two approaches – the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.

I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.

The two approaches mentioned above:

I tried using sklearn.feature_selection.f_regression(X,Y).

This failed with the following error message:
"TypeError: copy() takes exactly 1 argument (2 given)"

I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?

Please find my toy dataset pasted into the end of this message.

The following code snippet should give the results I describe above.

from sklearn.datasets import load_svmlight_file

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file

from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features

Thanks in advance.

Richard

Contents of my contrived svmlight file – with additional blank lines inserted for clarity:

1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA

1.8 1:1.000000 2:1.000000#mB

0.1 5:1.000000#mC

1.8 1:1.000000 2:1.000000#mD

0.1 3:1.000000 4:1.000000#mE

0.1 3:1.000000#mF

1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG

1.8 2:1.000000#mH

Best Answer

As larsmans noted, chi2 cannot be used for feature selection with regression data.

Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.

def f_regression(X,Y):
   import sklearn
   return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general

from sklearn.datasets import load_svmlight_file

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to  the name of my toy dataset file

from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]

Solution

Ensure you have pandas at least 0.10.1 installed.

Read iterating files chunk-by-chunk and multiple table queries.

Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)

import numpy as np
import pandas as pd

# create a store
store = pd.HDFStore('mystore.h5')

# this is the key to your storage:
#    this maps your fields to a specific group, and defines 
#    what you want to have as data_columns.
#    you might want to create a nice class wrapping this
#    (as you will want to have this map and its inversion)  
group_map = dict(
    A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
    B = dict(fields = ['field_10',......        ], dc = ['field_10']),
    .....
    REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),

)

group_map_inverted = dict()
for g, v in group_map.items():
    group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))

Reading in the files and creating the storage (essentially doing what append_to_multiple does):

for f in files:
   # read in the file, additional options may be necessary here
   # the chunksize is not strictly necessary, you may be able to slurp each 
   # file into memory in which case just eliminate this part of the loop 
   # (you can also change chunksize if necessary)
   for chunk in pd.read_table(f, chunksize=50000):
       # we are going to append to each table by group
       # we are not going to create indexes at this time
       # but we *ARE* going to create (some) data_columns

       # figure out the field groupings
       for g, v in group_map.items():
             # create the frame for this group
             frame = chunk.reindex(columns = v['fields'], copy = False)    

             # append it
             store.append(g, frame, index=False, data_columns = v['dc'])

Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).

This is how you get columns and create new ones:

frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
#     select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows

# do calculations on this frame
new_frame = cool_function_on_frame(frame)

# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)

When you are ready for post_processing:

# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)

About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:

store.select(group, where = ['field_1000=foo', 'field_1001>0'])

They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).

You also might want to:

create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
indexes on certain data columns (makes row-subsetting much faster).
enable compression.

Let me know when you have questions!

Python – Scikit-learn – feature reduction using RFECV and GridSearch. Where are the coefficients stored

Is this the correct process for feature selection? This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv parameter to cv=3, both RFECV and GridSearchCV will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression. You can also combine

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

into

X_new = rfecv.fit_transform(X, y)

Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.

Is this the correct process for fitting? Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y), it makes a grid of LogisticRegression estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_, the parameters of this estimator in grid.best_params_ and the performance score for this estimator under grid.best_score_. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.

Where can I find the fitted coefficients for the selected features? The grid.best_estimator_ attribute is a LogisticRegression object with all this information, so grid.best_estimator_.coef_ has all the coefficients (and grid.best_estimator_.intercept_ is the intercept). Note that to be able to get this grid.best_estimator_, the refit parameter on GridSearchCV needs to be set to True, but this is the default anyway.

Best Answer

Related Solutions

Python – “Large data” workflows using pandas

Solution

Python – Scikit-learn – feature reduction using RFECV and GridSearch. Where are the coefficients stored

Related Topic