Python – 10*10 fold cross validation in scikit-learn

machine learningpythonscikit-learnscikits

Is

class sklearn.cross_validation.ShuffleSplit(
    n, 
    n_iterations=10, 
    test_fraction=0.10000000000000001, 
    indices=True, 
    random_state=None
)

the right way for 10*10fold CV in scikit-learn? (By changing the random_state to 10 different numbers)

Because I didn't find any random_state parameter in Stratified K-Fold or K-Fold and the separate from K-Fold are always identical for the same data.

If ShuffleSplit is the right, one concern is that it is mentioned

Note: contrary to other cross-validation strategies, random splits do not
guarantee that all folds will be different, although this is still
very likely for sizeable datasets

Is this always the case for 10*10 fold CV?

Best Answer

I am not sure what you mean by 10*10 cross validation. The ShuffleSplit configuration you give will make you call the fit method of the estimator 10 times. If you call this 10 times by explicitly using an outer loop or directly call it 100 times with 10% of the data reserved for testing in a single loop if you use instead:

>>> ss = ShuffleSplit(X.shape[0], n_iterations=100, test_fraction=0.1,
...     random_state=42)

If you want to do 10 runs of StratifiedKFold with k=10 you can shuffle the dataset between the runs (that would lead to a total 100 calls to the fit method with a 90% train / 10% test split for each call to fit):

>>> from sklearn.utils import shuffle
>>> from sklearn.cross_validation import StratifiedKFold, cross_val_score
>>> for i in range(10):
...    X, y = shuffle(X_orig, y_orig, random_state=i)
...    skf = StratifiedKFold(y, 10)
...    print cross_val_score(clf, X, y, cv=skf)
Related Topic