Perform Chi-2 feature selection on TF and TF*IDF vectors

feature-selectionmachine learningscikit-learn

I'm experimenting with Chi-2 feature selection for some text classification tasks.
I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.

Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,

This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.

It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?

My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html

It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?

Thank you in advance for your kind advise!

Best Answer

The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.

This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.

It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

Related Topic