Perform Chi-2 feature selection on TF and TF*IDF vectors

feature-selectionmachine learningscikit-learn

I'm experimenting with Chi-2 feature selection for some text classification tasks.
I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.

Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,

This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.

It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?

My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html

It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?

Thank you in advance for your kind advise!

Best Answer

The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.

This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.

It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

Related Solutions

Python – Scikit-learn χ² (chi-squared) statistic and corresponding contingency table

Consider a column x of X. sklearn.feature_selection.chi2 tests whether the frequencies of the y values where x is 1 agree with the frequencies of y in the full population. (@larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y where x is 0 also contribute to the test.

Suppose we form the contingency table for x and y:

    | y=0  y=1
----+---------
x=0 |  a    b
x=1 |  c    d

Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).

Let nx = c + d. This is the number of occurrences of 1 in x.

Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.

sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table analysis of a 2x2 table.

Here's an extreme example. Suppose the 2x2 contingency table for x and y is

    |  y=0  y=1
----+----------
x=0 |   8    8
x=1 |  20  188

The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.

The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.

Ios – UITextView cursor not positioning properly when editing in iOS 7. Why

this bug is in iOS 7.0 you can solve this by modifying textView delegate method.

try below code

- (void)textViewDidChange:(UITextView *)textView {
    CGRect line = [textView caretRectForPosition:
        textView.selectedTextRange.start];
    CGFloat overflow = line.origin.y + line.size.height
        - ( textView.contentOffset.y + textView.bounds.size.height
        - textView.contentInset.bottom - textView.contentInset.top );
    if ( overflow > 0 ) {
    // We are at the bottom of the visible text and introduced a line feed, scroll down (iOS 7 does not do it)
    // Scroll caret to visible area
        CGPoint offset = textView.contentOffset;
        offset.y += overflow + 7; // leave 7 pixels margin
    // Cannot animate with setContentOffset:animated: or caret will not appear
        [UIView animateWithDuration:.2 animations:^{
            [textView setContentOffset:offset];
        }];
    }
}

your problem will solved.

Best Answer

Related Solutions

Python – Scikit-learn χ² (chi-squared) statistic and corresponding contingency table

Ios – UITextView cursor not positioning properly when editing in iOS 7. Why

Related Topic