Integrated learning task7

Posted by ckehrer on Tue, 08 Mar 2022 00:55:03 +0100

Ideas of voting method

voting is a combination strategy for classification problems in integrated learning.
The basic idea is to choose the one with the most output among all machine learning algorithms.

There are two types of machine learning algorithm output for classification: one is to directly output class labels, and the other is to output class probability. Voting with the former is called major / hard voting, and classification with the latter is called soft voting. The voting classifier in sklearn is the implementation of voting method.

Implementation of hard voting and soft voting

Hard voting

The prediction result is the most frequent category of all voting results.

from sklearn import datasets, linear_model, svm, neighbors
from sklearn.metrics import accuracy_score
from numpy import argmax

# Using breast cancer data
breast_cancer = datasets.load_breast_cancer()
x, y = breast_cancer.data, breast_cancer.target

Loading the voting module

from sklearn.ensemble import VotingClassifier
from sklearn import datasets, naive_bayes, svm, neighbors

Initialize basic learner

learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5)  
learner_2 = linear_model.Perceptron(tol=1e-2, random_state=0)
learner_3 = svm.SVC(gamma=0.001)

Generate test set and training set

test_samples = 150
x_train, y_train = x[:-test_samples], y[:-test_samples]
x_test, y_test = x[-test_samples:], y[-test_samples:]

Training data

learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5)
learner_2 = linear_model.Perceptron(tol=1e-2, random_state=0)
learner_3 = svm.SVC(gamma=0.001)
voting = VotingClassifier([('KNN', learner_1),
                           ('Prc', learner_2),
                           ('SVM', learner_3)])

Fitting and prediction

# Fit classifier with the training data
voting.fit(x_train, y_train)

# Predict the most voted class
hard_predictions = voting.predict(x_test)

print('-'*30)
print('Hard Voting:', accuracy_score(y_test, hard_predictions))

------------------------------
Hard Voting: 0.9333333333333333

Soft voting

The prediction result is the class with the largest sum of probability among all voting results.

# Instantiate the learners (classifiers)
learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5)
learner_2 = naive_bayes.GaussianNB()
learner_3 = svm.SVC(gamma=0.001, probability=True)

# Instantiate the voting classifier
voting = VotingClassifier([('KNN', learner_1),
                           ('NB', learner_2),
                           ('SVM', learner_3)],
                            voting='soft')# Change the voting method

voting.fit(x_train, y_train)
learner_1.fit(x_train, y_train)
learner_2.fit(x_train, y_train)
learner_3.fit(x_train, y_train)

SVC(gamma=0.001, probability=True)

# Predict the most probable class
soft_predictions = voting.predict(x_test)

# Get the base learner predictions
predictions_1 = learner_1.predict(x_test)
predictions_2 = learner_2.predict(x_test)
predictions_3 = learner_3.predict(x_test)

# Print results
# Accuracies of base learners
print('L1:', accuracy_score(y_test, predictions_1))
print('L2:', accuracy_score(y_test, predictions_2))
print('L3:', accuracy_score(y_test, predictions_3))
# Accuracy of soft voting
print('-'*30)
print('Soft Voting:', accuracy_score(y_test, soft_predictions))

L1: 0.94
L2: 0.96
L3: 0.8933333333333333
------------------------------
Soft Voting: 0.9333333333333333

When the model used in the voting collection can predict clear category labels, it is suitable to use hard voting. When the model used in the voting set can predict the probability of categories, it is suitable to use soft voting.
Soft voting can also be used in models that do not predict the probability of class members themselves, as long as they can output prediction scores similar to probability (such as support vector machine, k-nearest neighbor and decision tree)

Case study of voting method (based on sklearn, introduce the use of pipe pipe and the use of voting)

Pipe pipe

This has appeared several times and is solved here

Pipeline can connect many algorithm models in series, and can be used to cascade multiple estimators into an estimator, such as organizing feature extraction, normalization and classification together to form a typical machine learning problem workflow.
All estimators except the last one in the pipeline must be transformers. The last estimator can be any type (transformer, classifier, register). If the last estimator is a classifier, the whole pipeline can be used as a classifier. If the last estimator is a cluster, Then the whole pipeline can be used as a cluster.

In fact, it is to form a workflow, which simplifies the calling data of multiple models and avoids multiple preprocessing data

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold
import numpy as np

models=[('lr',LogisticRegression()),('svm',make_pipeline(StandardScaler(),SVC()))]
ensemble_hard=VotingClassifier(estimators=models,voting='hard')
ensemble_soft=VotingClassifier(estimators=models,voting='soft')

Create sample

from sklearn.datasets import make_classification
def get_dataset():
    X,y=make_classification(n_samples=1000,n_features=20,n_informative=15,
    n_redundant=5,random_state=2)

    return X,y

We use several KNN models as the base model to demonstrate the voting method, in which each model adopts different neighbor value K parameters

def get_voting():
# define the base 
    modelsmodels = list()
    models.append(('knn1', KNeighborsClassifier(n_neighbors=1)))
    models.append(('knn3', KNeighborsClassifier(n_neighbors=3)))
    models.append(('knn5', KNeighborsClassifier(n_neighbors=5)))
    models.append(('knn7', KNeighborsClassifier(n_neighbors=7)))
    models.append(('knn9', KNeighborsClassifier(n_neighbors=9)))
    # define the voting 
    ensemble_hard = VotingClassifier(estimators=models, voting='hard')
    ensemble_soft = VotingClassifier(estimators=models,voting='soft')
    return ensemble_hard,ensemble_soft

def get_voting():
# define the base 
    modelsmodels = list()
    models.append(('knn1', KNeighborsClassifier(n_neighbors=1)))
    models.append(('knn3', KNeighborsClassifier(n_neighbors=3)))
    models.append(('knn5', KNeighborsClassifier(n_neighbors=5)))
    models.append(('knn7', KNeighborsClassifier(n_neighbors=7)))
    models.append(('knn9', KNeighborsClassifier(n_neighbors=9)))
    # define the voting 
    ensemble = VotingClassifier(estimators=models, voting='hard')
    return ensemble

View model promotion

def get_models():
    models=dict()
    models['knn1'] = KNeighborsClassifier(n_neighbors=1)
    models['knn3'] = KNeighborsClassifier(n_neighbors=3)
    models['knn5'] = KNeighborsClassifier(n_neighbors=5)
    models['knn7'] = KNeighborsClassifier(n_neighbors=7)
    models['knn9'] = KNeighborsClassifier(n_neighbors=9)
    models['voting']= get_voting()
    return models

evaluate_ The model () function receives a model instance and returns it in the form of a hierarchical 10x cross validation three repeated score list

from sklearn.model_selection import cross_val_score
def evaluate_model(model,X,y):
    cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1)
    scores=cross_val_score(model,X,y,scoring='accuracy',cv=cv,n_jobs=-1,error_score='raise')
    return scores

Compare algorithms and visualize

from sklearn.neighbors import KNeighborsClassifier 
from matplotlib import pyplot
X,y=get_dataset()
models=get_models()
results,names=list(),list()
for name,model in models.items():
    scores=evaluate_model(model,X,y)
    results.append(scores)
    names.append(name)
    print('>%s %.3f (%.3f)'% (name, np.mean(scores), np.std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

>knn1 0.873 (0.030)
>knn3 0.889 (0.038)
>knn5 0.895 (0.031)
>knn7 0.899 (0.035)
>knn9 0.900 (0.033)
>voting 0.910 (0.031)

The voting method is better than the rest based model, but it has not been significantly improved

bagging

Bagging algorithm: let the learning algorithm train multiple rounds. Each round of training set is composed of N training samples randomly taken from the initial training set. An initial training sample can appear many times or not at all in a round of training set. After training, a prediction function sequence h can be obtained_ 1，⋯ ⋯h_n. The final prediction function H adopts the voting method for the classification problem and the simple average method for the regression problem to judge the new example. Its essence is to sample the population with return!
The prediction of regression problem is carried out by taking the average value of prediction. The prediction of classification problem is carried out by taking the majority vote of the prediction.

bagging principle

Select n samples from the original sample set by Bootstrap sampling (New)
Establish a classifier for these n samples
Repeat steps 1-2 to establish m classifiers
Select n samples (m new sample data sets) from Bootstrap sampling and classify them on M classifiers
Vote the results of m classifiers, and the most is the final category.

Bootstrap principle

The name of Bootstrap sampling comes from the idiom "pull up by your own bootstraps", which means relying on your own resources. It is called self-help method. It is a sampling method with return. It is an important statistical method for estimating the variance of statistics and then interval estimation in nonparametric statistics.
Its core idea and basic steps are as follows:
1) Resampling technology is used to extract a certain number of samples (given by yourself) from the original samples, and repeated sampling is allowed in this process.

2) Calculate the given statistic T according to the extracted samples.

3) Repeat the above N times (generally greater than 1000) to obtain N statistics T.

4) Calculate the sample variance of the above N statistics T to obtain the variance of the statistics.

bagging comparison before and after use

# Import algorithm package and dataset
from sklearn import neighbors
from sklearn import datasets
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')  # Beautify with its own style
# The following two lines of code are used to display Chinese
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally

iris = datasets.load_iris()
x_data = iris.data[:,:2]
y_data = iris.target

x_train,x_test,y_train,y_test = train_test_split(x_data, y_data)
# print(iris)
print(y_data)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

Without bagging

def plot(model):
    # Gets the range of data values
    x_min, x_max = x_data[:, 0].min() - 1, x_data[:, 0].max() + 1
    y_min, y_max = x_data[:, 1].min() - 1, x_data[:, 1].max() + 1

    # Generate grid matrix
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))

    z = model.predict(np.c_[xx.ravel(), yy.ravel()])# Travel is similar to flatten in that multidimensional data is converted to one dimension. Flatten will not change the original data, and travel will change the original data
    z = z.reshape(xx.shape)
    # Contour map
    cs = plt.contourf(xx, yy, z)

# KNN
knn = neighbors.KNeighborsClassifier()
knn.fit(x_train, y_train)

# Draw a picture
plot(knn)
# Sample scatter diagram
plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data)
plt.show()
# Accuracy
print(knn.score(x_test, y_test))

# Decision tree
dtree = tree.DecisionTreeClassifier()
dtree.fit(x_train, y_train)
# Draw a picture
plot(dtree)
# Sample scatter diagram
plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data)
plt.show()
# Accuracy
dtree.score(x_test, y_test)

0.6578947368421053

0.6052631578947368

Using bagging

bagging_knn = BaggingClassifier(knn, n_estimators=100)
# Input data to build model
bagging_knn.fit(x_train, y_train)
plot(bagging_knn)
# Sample scatter diagram
plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data)
plt.show()
print(bagging_knn.score(x_test, y_test))

0.7105263157894737

bagging_tree = BaggingClassifier(dtree, n_estimators=100)
# Input data to build model
bagging_tree.fit(x_train, y_train)
plot(bagging_tree)
# Sample scatter diagram
plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data)
plt.show()
print(bagging_tree.score(x_test, y_test))

0.6578947368421053

It can be seen that after using bagging, KNN and decision tree are improved

summary

This is task 7 of integrated learning. Only here can I really figure out what integrated learning is. In fact, it is the comprehensive use of multiple models, including bagging,stacking and voting methods learned before.
The general structure of ensemble learning is to produce a group of "individual learners" and then combine them with some strategy. The integration only contains the same type of individual learners, which is called homogeneous. The individual learners among them are also called "basic learners", and the corresponding algorithm is called "basic learning algorithm". The integration includes different types of individual learners, which are called "heterogeneous", and the individual learners are called "building learners".
Multiple individual learners form a strong learner to improve the accuracy.
Here are some references about integrated learning, which can be seen in later learning.
Relevant references:
Detailed explanation of the principle of ensemble learning
Integrated learning – bagging, boosting, stacking

Topics: Python Machine Learning

Programmer Think