Datawhale - (scikit learn tutorial) - task07 (integrated learning) - 202112

Posted by richardandrewle on Sat, 01 Jan 2022 21:22:15 +0100

1, Basic principles

Ensemble learning completes the learning task by constructing and combining multiple learners, so as to improve the generalization and stability performance better than a single learner. To achieve good integration effect, individual learners should be "good and different". According to the generation mode of individual learners, ensemble learning can be divided into two categories: sequence integration method, that is, individual learners have strong dependencies and must be generated serially, such as Boosting; Parallel integration method, that is, individual learners do not have strong dependencies and can be generated in parallel, such as Bagging and random forest.

2, Boosting

Boosting refers to converting a weak learner into a strong learner through an algorithm set. The main principle of boosting is to train a series of weak learners. The so-called weak learner refers to a model that is only a little better than random guess, such as a small decision tree. The training method is to use weighted data. In the early stage of training, a large weight is given to the misclassification data. Its working mechanism is as follows:

  1. Firstly, a basic learner is trained from the initial training set;
  2. Then, the distribution of training samples is adjusted according to the performance of the base learner, so that the training samples wrong by the base learner receive more attention in the follow-up;
  3. Training the next base learner based on the adjusted sample distribution;
  4. Repeat the above steps until the number of base learners reaches the predetermined value T, and finally combine the T base learners weighted.

Several typical Boosting methods are introduced below.

import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

wine = load_wine()
print(f"All features:{wine.feature_names}")
X = pd.DataFrame(wine.data, columns=wine.feature_names)
y = pd.Series(wine.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

base_model = DecisionTreeClassifier(max_depth=1, criterion='gini',random_state=1).fit(X_train, y_train)
y_pred = base_model.predict(X_test)
print(f"Accuracy of decision tree:{accuracy_score(y_test,y_pred):.3f}")

from sklearn.ensemble import AdaBoostClassifier
model = AdaBoostClassifier(base_estimator=base_model,
                            n_estimators=50,
                            learning_rate=0.5,
                            algorithm='SAMME.R',
                            random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"AdaBoost Accuracy of:{accuracy_score(y_test,y_pred):.3f}")

# Influence of the number of test estimators
x = list(range(2, 102, 2))
y = []

for i in x:
  model = AdaBoostClassifier(base_estimator=base_model,
                              n_estimators=i,
                              learning_rate=0.5,
                              algorithm='SAMME.R',
                              random_state=1)
  
  model.fit(X_train, y_train)
  model_test_sc = accuracy_score(y_test, model.predict(X_test))
  y.append(model_test_sc)

plt.style.use('ggplot')
plt.title("Effect of n_estimators", pad=20)
plt.xlabel("Number of base estimators")
plt.ylabel("Test accuracy of AdaBoost")
plt.plot(x, y)
plt.show()


# Test the impact of learning rate
x = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
y = []

for i in x:
  model = AdaBoostClassifier(base_estimator=base_model,
                              n_estimators=50,
                              learning_rate=i,
                              algorithm='SAMME.R',
                              random_state=1)
  
  model.fit(X_train, y_train)
  model_test_sc = accuracy_score(y_test, model.predict(X_test))
  y.append(model_test_sc)

plt.title("Effect of learning_rate", pad=20)
plt.xlabel("Learning rate")
plt.ylabel("Test accuracy of AdaBoost")
plt.plot(x, y)
plt.show()

# Auto tuning using GridSearchCV
hyperparameter_space = {'n_estimators':list(range(2, 102, 2)), 
                        'learning_rate':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]}



gs = GridSearchCV(AdaBoostClassifier(base_estimator=base_model,
                                     algorithm='SAMME.R',
                                     random_state=1),
                  param_grid=hyperparameter_space, 
                  scoring="accuracy", n_jobs=-1, cv=5)

gs.fit(X_train, y_train)
print("Optimal hyperparameter:", gs.best_params_)


3, Bagging



3, Stacking

1. Basic idea of stacking

The method used when combining individual learners is called combination strategy. For the classification problem, we can use the voting method to select the class with the most output. For the regression problem, we can average the results of the classifier. Voting method and average method are very effective combination strategies. Another combination strategy is to use another machine learning algorithm to combine the results of individual machine learners. This method is stacking. In the stacking method, we call the individual learner as the primary learner, the learner used for combination as the secondary learner or meta learner, and the data used for training by the secondary learner as the secondary training set. The secondary training set is obtained by using the primary learner on the training set.


Topics: Machine Learning scikit-learn