Python – 10*10 fold cross validation in scikit-learn

machine learningpythonscikit-learnscikits

class sklearn.cross_validation.ShuffleSplit(
    n, 
    n_iterations=10, 
    test_fraction=0.10000000000000001, 
    indices=True, 
    random_state=None
)

the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)

Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.

If ShuffleSplit is the right, one concern is that it is mentioned

Note: contrary to other cross-validation strategies, random splits do not
guarantee that all folds will be different, although this is still
very likely for sizeable datasets

Is this always the case for 10*10 fold CV?

Best Answer

I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:

>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
...     random_state=42)

If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):

>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
...    X, y = shuffle(X_orig, y_orig, random_state=i)
...    skf = StratifiedKFold(y, 10)
...    print cross_val_score(clf, X, y, cv=skf)

Related Solutions

Python – Randomized stratified k-fold cross-validation in scikit-learn

The shuffling flag for cross_validation.StratifiedKFold has been introduced in the current version 0.15:

http://scikit-learn.org/0.15/modules/generated/sklearn.cross_validation.StratifiedKFold.html

This can be found in the Changelog:

http://scikit-learn.org/stable/whats_new.html#new-features

Shuffle option for cross_validation.StratifiedKFold. By Jeffrey Blackburne.

Python – Plotting Precision-Recall curve when using cross-validation in scikit-learn

Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. Next, collect all the test (i.e. out-of-bag) predictions and compute precision and recall.

 ## let test_samples[k] = test samples for the kth fold (list of list)
 ## let train_samples[k] = test samples for the kth fold (list of list)

 for k in range(0, k):
      model = train(parameters, train_samples[k])
      predictions_fold[k] = predict(model, test_samples[k])

 # collect predictions
 predictions_combined = [p for preds in predictions_fold for p in preds]

 ## let predictions = rearranged predictions s.t. they are in the original order

 ## use predictions and labels to compute lists of TP, FP, FN
 ## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation

Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. Given n samples, you should have n test predictions.

(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.)

Unless you are using leave-one-out cross-validation, k-fold cross validation generally requires a random partitioning of the data. Ideally, you would do repeated (and stratified) k-fold cross validation. Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006).

I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation.

For a nice plot, I showed a representative PR curve from one of the cross-validation rounds.

There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset.

For instance, if the proportion of (binary) labels in your dataset is not skewed (i.e. it is roughly 50-50), you could use the simpler ROC analysis with cross-validation:

Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration.

Best Answer

Related Solutions

Python – Randomized stratified k-fold cross-validation in scikit-learn

Python – Plotting Precision-Recall curve when using cross-validation in scikit-learn

Related Topic