Python – Using decision tree regression and cross-validation in sklearn

pythonregressionscikit-learn

I am a novice in statistical methods so please excuse any naivety. I am having a problem understanding the execution of cross validation when using Decision tree regression from sklearn (e.g. DecisionTreeRegressor and RandomForestRegressor). My dataset varies from having multiple predictors (y = single dependent variable; X = multiple independent variables) to having a single predictor and consists of enough cases (> 10k). The following explanation applies for all cases.

When fitting and scoring the regressors with the standard methods:

dt = DecisionTreeRegressor()
rf = RandomForestRegressor()

dt.fit(X,y)
rf.fit(X,y)

dt_score = dt.score(X,y)
rf_score = rf.score(X,y)

The dt_score and rf_score returns promising R-squared values (> 0.7), however I am aware of the over-fitting properties of the DT and to lesser extent the RF. Therefore I tried to score the regressors with cross-validation (10 fold) to get a more true representation of the accuracy:

dt = DecisionTreeRegressor()
rf = RandomForestRegressor()

dt.fit(X,y)
rf.fit(X,y)

dt_scores = cross_val_score(dt, X, y, cv = 10)
rf_scores = cross_val_score(rf, X, y, cv = 10) 

dt_score = round(sum(dt_scores )/len(dt_scores ), 3)
rf_score = round(sum(rf_scores )/len(rf_scores ), 3)

The results of this cross validation always returns negative values. I assume they are R squared values according to the sklearn guidlines: By default, the score computed at each CV iteration is the score method of the estimator (the score method of both the regressors is R squared). The explanation given from the guidelines for the basic KFold cross validation is: Each fold is then used once as a validation while the k – 1 remaining folds form the training set.

How I understand this, when using 10 old cv, is: my dataset is split into 10 equal parts, for each part the remaining 9 parts are used for training (I am not sure if this is a fit operation or a score operation) and the remaining part is used for validation (not sure what is done for validation). These regressors are a complete "black box" for me, so I have no idea on how a tree is used for regression and where the cross validation gets its R square values from.

So to summarize, I am struggling to understand how the cross validation can decrease the accuracy (R squared) so dramatically? Am I using the cross validation right for a regressor? Does it make sense to use cross validation for a decision tree regressor? Should I be using another cross-validation method?

Thank you

Best Answer

Have put together a small code-snippet articulating how on using DecisionTreeRegressor and cross-validation.

A. In the first code-snippet 'cross_val_score' is used. But, r2_score might have negative score, giving insight about the poor learning by the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.20, random_state=0)

dt = DecisionTreeRegressor(random_state=0, criterion="mae")
dt_fit = dt.fit(X_train, y_train)

dt_scores = cross_val_score(dt_fit, X_train, y_train, cv = 5)
print("mean cross validation score: {}".format(np.mean(dt_scores)))
print("score without cv: {}".format(dt_fit.score(X_train, y_train)))

# on the test or hold-out set
from sklearn.metrics import r2_score
print(r2_score(y_test, dt_fit.predict(X_test)))
print(dt_fit.score(X_test, y_test))

B. In this next section, using cross-validation for performing GridSerach on the parameter 'min_samples_split', then using the best estimator for scoring on the valiation/holdout set. # Using GridSearch: from sklearn.model_selection import GridSearchCV from sklearn.metrics import make_scorer from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score

scoring = make_scorer(r2_score)
g_cv = GridSearchCV(DecisionTreeRegressor(random_state=0),
              param_grid={'min_samples_split': range(2, 10)},
              scoring=scoring, cv=5, refit=True)

g_cv.fit(X_train, y_train)
g_cv.best_params_

result = g_cv.cv_results_
# print(result)
r2_score(y_test, g_cv.best_estimator_.predict(X_test))

Hoping, this was useful.

Reference:

https://www.programcreek.com/python/example/75177/sklearn.cross_validation.cross_val_score

Related Solutions

Windows – Parsing string in batch file

This should do it:

FOR /F "tokens=1-6 delims==," %%I IN ("MyProject/Architecture=32bit,BuildType=Debug,OS=winpc") DO (
    ECHO I %%I, J %%J, K %%K, L %%L, M %%M, N %%N
)
REM output is: I MyProject/Architecture, J 32bit, K BuildType, L Debug, M OS, N winpc

The batch FOR loop is a pretty interesting piece of machinery. Type FOR /? in a console for a description of some of the crazy stuff it can do.

Early stopping with Keras and sklearn GridSearchCV cross-validation

[Answer after the question was edited & clarified:]

Before rushing into implementation issues, it is always a good practice to take some time to think about the methodology and the task itself; arguably, intermingling early stopping with the cross validation procedure is not a good idea.

Let's make up an example to highlight the argument.

Suppose that you indeed use early stopping with 100 epochs, and 5-fold cross validation (CV) for hyperparameter selection. Suppose also that you end up with a hyperparameter set X giving best performance, say 89.3% binary classification accuracy.

Now suppose that your second-best hyperparameter set, Y, gives 89.2% accuracy. Examining closely the individual CV folds, you see that, for your best case X, 3 out of the 5 CV folds exhausted the max 100 epochs, while in the other 2 early stopping kicked in, say in 95 and 93 epochs respectively.

Now imagine that, examining your second-best set Y, you see that again 3 out of the 5 CV folds exhausted the 100 epochs, while the other 2 both stopped early enough at ~ 80 epochs.

What would be your conclusion from such an experiment?

Arguably, you would have found yourself in an inconclusive situation; further experiments might reveal which is actually the best hyperparameter set, provided of course that you would have thought to look into these details of the results in the first place. And needless to say, if all this was automated through a callback, you might have missed your best model despite the fact that you would have actually tried it.

The whole CV idea is implicitly based on the "all other being equal" argument (which of course is never true in practice, only approximated in the best possible way). If you feel that the number of epochs should be a hyperparameter, just include it explicitly in your CV as such, rather than inserting it through the back door of early stopping, thus possibly compromising the whole process (not to mention that early stopping has itself a hyperparameter, patience).

Not intermingling these two techniques doesn't mean of course that you cannot use them sequentially: once you have obtained your best hyperparameters through CV, you can always employ early stopping when fitting the model in your whole training set (provided of course that you do have a separate validation set).

The field of deep neural nets is still (very) young, and it is true that it has yet to establish its "best practice" guidelines; add the fact that, thanks to an amazing community, there are all sort of tools available in open source implementations, and you can easily find yourself into the (admittedly tempting) position of mixing things up just because they happen to be available. I am not necessarily saying that this is what you are attempting to do here - I am just urging for more caution when combining ideas that may have not been designed to work along together...

Best Answer

Reference:

Related Solutions

Windows – Parsing string in batch file

Early stopping with Keras and sklearn GridSearchCV cross-validation

Related Topic