I am a novice in statistical methods so please excuse any naivety. I am having a problem understanding the execution of cross validation when using Decision tree regression from sklearn (e.g. DecisionTreeRegressor and RandomForestRegressor). My dataset varies from having multiple predictors (y = single dependent variable; X = multiple independent variables) to having a single predictor and consists of enough cases (> 10k). The following explanation applies for all cases.
When fitting and scoring the regressors with the standard methods:
dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
dt.fit(X,y)
rf.fit(X,y)
dt_score = dt.score(X,y)
rf_score = rf.score(X,y)
The dt_score and rf_score returns promising R-squared values (> 0.7), however I am aware of the over-fitting properties of the DT and to lesser extent the RF. Therefore I tried to score the regressors with cross-validation (10 fold) to get a more true representation of the accuracy:
dt = DecisionTreeRegressor()
rf = RandomForestRegressor()
dt.fit(X,y)
rf.fit(X,y)
dt_scores = cross_val_score(dt, X, y, cv = 10)
rf_scores = cross_val_score(rf, X, y, cv = 10)
dt_score = round(sum(dt_scores )/len(dt_scores ), 3)
rf_score = round(sum(rf_scores )/len(rf_scores ), 3)
The results of this cross validation always returns negative values. I assume they are R squared values according to the sklearn guidlines: By default, the score computed at each CV iteration is the score method of the estimator (the score method of both the regressors is R squared). The explanation given from the guidelines for the basic KFold cross validation is: Each fold is then used once as a validation while the k – 1 remaining folds form the training set.
How I understand this, when using 10 old cv, is: my dataset is split into 10 equal parts, for each part the remaining 9 parts are used for training (I am not sure if this is a fit operation or a score operation) and the remaining part is used for validation (not sure what is done for validation). These regressors are a complete "black box" for me, so I have no idea on how a tree is used for regression and where the cross validation gets its R square values from.
So to summarize, I am struggling to understand how the cross validation can decrease the accuracy (R squared) so dramatically? Am I using the cross validation right for a regressor? Does it make sense to use cross validation for a decision tree regressor? Should I be using another cross-validation method?
Thank you
Best Answer
Have put together a small code-snippet articulating how on using DecisionTreeRegressor and cross-validation.
A. In the first code-snippet 'cross_val_score' is used. But, r2_score might have negative score, giving insight about the poor learning by the model.
B. In this next section, using cross-validation for performing GridSerach on the parameter 'min_samples_split', then using the best estimator for scoring on the valiation/holdout set. # Using GridSearch: from sklearn.model_selection import GridSearchCV from sklearn.metrics import make_scorer from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score
Hoping, this was useful.
Reference:
https://www.programcreek.com/python/example/75177/sklearn.cross_validation.cross_val_score