Nested cross validation

Posted by Rocu on Sun, 12 Sep 2021 00:10:02 +0200

  • 20210911

0. Introduction

In most cases, machine learning experiments are divided directly by train test. Generally speaking, this method has little impact on the data set with relatively small data set, but it is biased for the data set with relatively small data set. (I remember it was said in a book, and it was also mentioned in the course of in-depth learning). For data sets with less data, K-fold cross validation is more used. Of course, this method is essentially the same. For coding implementation, it is basically a matter of a few lines of code.

Moreover, generally speaking, a verification set will be divided in the training set, and the specific parameters will be selected through the effect of the verification set.

But it's a little too much trouble if these codes are programmed by themselves. Therefore, it is generally implemented by directly calling library functions.

Moreover, there is a problem that there is still some normalization content, so it needs to be considered. These can also be automated.

1. Cross validation

In case of ordinary cross validation, after processing the data, this part of the data is directly divided into one line of code:

cross_val_score( model, X, y)

Just. However, as mentioned earlier, you have to select relevant parameters, so you need to make some adjustments to the code.

Simple steps are given in article [1], and the specific code is as follows:

# automatic nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
# define the model
model = RandomForestClassifier(random_state=1)
# define search space
space = dict()
space['n_estimators'] = [10, 100, 500]
space['max_features'] = [2, 4, 6]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# execute the nested cross-validation
scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

In this article, this code was programmed to achieve the same function by itself; The outer data set loop is realized by kfold. You can see the content of the article. In the above code, GridSearchCV is the place where ambiguity is easy to occur; But this part can be regarded as a separate model. Moreover, from the official documents, the default value of the refine parameter is True, the best parameter is used, and the training is conducted on the whole data set again. So you can understand.

2. Pipeline

First code:

#10. Y is the entire dataset

parameters = {
       "lda__n_components" : list(range(1, 35)),
        "rf__min_samples_leaf": [1, 2, 4],
        "rf__min_samples_split":[2, 5, 10],
        "rf__max_depth": [int(x) for x in np.linspace(10, 110, num = 10)]
}

steps = [
            ("min", StandardScaler()), 
            ('lda', LinearDiscriminantAnalysis()),
            ('rf', RandomForestClassifier(n_jobs = -1)),
]
    
model = Pipeline(steps = steps)
    
inner_cv = StratifiedKFold(n_splits=10, shuffle= True)
grid_model = GridSearchCV(
        model, 
        parameters,
        scoring='accuracy', n_jobs=-1, cv = inner_cv,
)
outer_cv = StratifiedKFold(n_splits=10, shuffle= True)#, random_state=1)

n_scores = cross_val_score(
        grid_model, X, y, 
        scoring = 'accuracy', cv = outer_cv, 
        n_jobs=-1, error_score='raise'
)

print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Although the previous code can realize cross validation, generally speaking, there is still some preprocessing for the data set, which needs Pipeline to solve; But at the same time, you have to select parameters, because the Pipeline has been named when you get it. Therefore, when selecting parameters, the key value of the parameter dictionary is prefixed with the name of Pipeline. In article [2], the code is similar.
Article [3] implements the code with the same function.

In article [4], the specific content of nested cross validation is briefly introduced, and it is explained that after using nested cross validation, the effect is a little worse than not using nested cross validation, but this effect is also more in line with the rules.

reference resources

[1]Nested Cross-Validation for Machine Learning with Python
[2]Putting together sklearn pipeline+nested cross-validation for KNN regression
[3]Python – Nested Cross Validation for Algorithm Selection
[4]Nested cross-validation

Topics: Python Machine Learning