Python – Feature selection using scikit-learn

chi-squaredfeature-selectionmachine learningpythonscikit-learn

I'm new in machine learning. I'm preparing my data for classification using Scikit Learn SVM. In order to select the best features I have used the following method:

SelectKBest(chi2, k=10).fit_transform(A1, A2)

Since my dataset consist of negative values, I get the following error:

ValueError                                Traceback (most recent call last)

/media/5804B87404B856AA/TFM_UC3M/test2_v.py in <module>()
----> 1 
      2 
      3 
      4 
      5 

/usr/local/lib/python2.6/dist-packages/sklearn/base.pyc in fit_transform(self, X, y,     **fit_params)
    427         else:
    428             # fit method of arity 2 (supervised transformation)

--> 429             return self.fit(X, y, **fit_params).transform(X)
    430 
    431 

/usr/local/lib/python2.6/dist-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    300         self._check_params(X, y)
    301 
--> 302         self.scores_, self.pvalues_ = self.score_func(X, y)
    303         self.scores_ = np.asarray(self.scores_)
    304         self.pvalues_ = np.asarray(self.pvalues_)

/usr/local/lib/python2.6/dist-  packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    190     X = atleast2d_or_csr(X)
    191     if np.any((X.data if issparse(X) else X) < 0):
--> 192         raise ValueError("Input X must be non-negative.")
    193 
    194     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Can someone tell me how can I transform my data ?

Best Answer

The error message Input X must be non-negative says it all: Pearson's chi square test (goodness of fit) does not apply to negative values. It's logical because the chi square test assumes frequencies distribution and a frequency can't be a negative number. Consequently, sklearn.feature_selection.chi2 asserts the input is non-negative.

You are saying that your features are "min, max, mean, median and FFT of accelerometer signal". In many cases, it may be quite safe to simply shift each feature to make it all positive, or even normalize to [0, 1] interval as suggested by EdChum.

If data transformation is for some reason not possible (e.g. a negative value is an important factor), you should pick another statistic to score your features:

sklearn.feature_selection.f_classif computes ANOVA f-value
sklearn.feature_selection.mutual_info_classif computes the mutual information

Since the whole point of this procedure is to prepare the features for another method, it's not a big deal to pick anyone, the end result usually the same or very close.

Related Solutions

Python – Using global variables in a function

You can use a global variable within other functions by declaring it as global within each function that assigns a value to it:

globvar = 0

def set_globvar_to_one():
    global globvar    # Needed to modify global copy of globvar
    globvar = 1

def print_globvar():
    print(globvar)     # No need for global declaration to read value of globvar

set_globvar_to_one()
print_globvar()       # Prints 1

Since global variables have a long history of introducing bugs (in every programming language), Python wants to make sure that you understand the risks by forcing you to explicitly use the global keyword.

See other answers if you want to share a global variable across modules.

Python – Iterating over dictionaries using ‘for’ loops

key is just a variable name.

for key in d:

will simply loop over the keys in the dictionary, rather than the keys and values. To loop over both key and value you can use the following:

For Python 3.x:

for key, value in d.items():

For Python 2.x:

for key, value in d.iteritems():

To test for yourself, change the word key to poop.

In Python 3.x, iteritems() was replaced with simply items(), which returns a set-like view backed by the dict, like iteritems() but even better. This is also available in 2.7 as viewitems().

The operation items() will work for both 2 and 3, but in 2 it will return a list of the dictionary's (key, value) pairs, which will not reflect changes to the dict that happen after the items() call. If you want the 2.x behavior in 3.x, you can call list(d.items()).

Best Answer

Related Solutions

Python – Using global variables in a function

Python – Iterating over dictionaries using ‘for’ loops

Related Topic