Ideas of voting method
voting is a combination strategy for classification problems in integrated learning.
The basic idea is to choose the one with the most output among all machine learning algorithms.
There are two types of machine learning algorithm output for classification: one is to directly output class labels, and the other is to output class probability. Voting with the former is called major / hard voting, and classification with the latter is called soft voting. The voting classifier in sklearn is the implementation of voting method.
Implementation of hard voting and soft voting
Hard voting
- The prediction result is the most frequent category of all voting results.
from sklearn import datasets, linear_model, svm, neighbors from sklearn.metrics import accuracy_score from numpy import argmax
# Using breast cancer data breast_cancer = datasets.load_breast_cancer() x, y = breast_cancer.data, breast_cancer.target
Loading the voting module
from sklearn.ensemble import VotingClassifier from sklearn import datasets, naive_bayes, svm, neighbors
Initialize basic learner
learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5) learner_2 = linear_model.Perceptron(tol=1e-2, random_state=0) learner_3 = svm.SVC(gamma=0.001)
Generate test set and training set
test_samples = 150 x_train, y_train = x[:-test_samples], y[:-test_samples] x_test, y_test = x[-test_samples:], y[-test_samples:]
Training data
learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5) learner_2 = linear_model.Perceptron(tol=1e-2, random_state=0) learner_3 = svm.SVC(gamma=0.001) voting = VotingClassifier([('KNN', learner_1), ('Prc', learner_2), ('SVM', learner_3)])
Fitting and prediction
# Fit classifier with the training data voting.fit(x_train, y_train) # Predict the most voted class hard_predictions = voting.predict(x_test)
print('-'*30) print('Hard Voting:', accuracy_score(y_test, hard_predictions))
------------------------------ Hard Voting: 0.9333333333333333
Soft voting
- The prediction result is the class with the largest sum of probability among all voting results.
# Instantiate the learners (classifiers) learner_1 = neighbors.KNeighborsClassifier(n_neighbors=5) learner_2 = naive_bayes.GaussianNB() learner_3 = svm.SVC(gamma=0.001, probability=True) # Instantiate the voting classifier voting = VotingClassifier([('KNN', learner_1), ('NB', learner_2), ('SVM', learner_3)], voting='soft')# Change the voting method
voting.fit(x_train, y_train) learner_1.fit(x_train, y_train) learner_2.fit(x_train, y_train) learner_3.fit(x_train, y_train)
SVC(gamma=0.001, probability=True)
# Predict the most probable class soft_predictions = voting.predict(x_test) # Get the base learner predictions predictions_1 = learner_1.predict(x_test) predictions_2 = learner_2.predict(x_test) predictions_3 = learner_3.predict(x_test)
# Print results # Accuracies of base learners print('L1:', accuracy_score(y_test, predictions_1)) print('L2:', accuracy_score(y_test, predictions_2)) print('L3:', accuracy_score(y_test, predictions_3)) # Accuracy of soft voting print('-'*30) print('Soft Voting:', accuracy_score(y_test, soft_predictions))
L1: 0.94 L2: 0.96 L3: 0.8933333333333333 ------------------------------ Soft Voting: 0.9333333333333333
When the model used in the voting collection can predict clear category labels, it is suitable to use hard voting. When the model used in the voting set can predict the probability of categories, it is suitable to use soft voting.
Soft voting can also be used in models that do not predict the probability of class members themselves, as long as they can output prediction scores similar to probability (such as support vector machine, k-nearest neighbor and decision tree)
Case study of voting method (based on sklearn, introduce the use of pipe pipe and the use of voting)
Pipe pipe
- This has appeared several times and is solved here
Pipeline can connect many algorithm models in series, and can be used to cascade multiple estimators into an estimator, such as organizing feature extraction, normalization and classification together to form a typical machine learning problem workflow.
All estimators except the last one in the pipeline must be transformers. The last estimator can be any type (transformer, classifier, register). If the last estimator is a classifier, the whole pipeline can be used as a classifier. If the last estimator is a cluster, Then the whole pipeline can be used as a cluster.
In fact, it is to form a workflow, which simplifies the calling data of multiple models and avoids multiple preprocessing data
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import VotingClassifier from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import RepeatedStratifiedKFold import numpy as np
models=[('lr',LogisticRegression()),('svm',make_pipeline(StandardScaler(),SVC()))] ensemble_hard=VotingClassifier(estimators=models,voting='hard') ensemble_soft=VotingClassifier(estimators=models,voting='soft')
Create sample
from sklearn.datasets import make_classification def get_dataset(): X,y=make_classification(n_samples=1000,n_features=20,n_informative=15, n_redundant=5,random_state=2) return X,y
We use several KNN models as the base model to demonstrate the voting method, in which each model adopts different neighbor value K parameters
def get_voting(): # define the base modelsmodels = list() models.append(('knn1', KNeighborsClassifier(n_neighbors=1))) models.append(('knn3', KNeighborsClassifier(n_neighbors=3))) models.append(('knn5', KNeighborsClassifier(n_neighbors=5))) models.append(('knn7', KNeighborsClassifier(n_neighbors=7))) models.append(('knn9', KNeighborsClassifier(n_neighbors=9))) # define the voting ensemble_hard = VotingClassifier(estimators=models, voting='hard') ensemble_soft = VotingClassifier(estimators=models,voting='soft') return ensemble_hard,ensemble_soft
def get_voting(): # define the base modelsmodels = list() models.append(('knn1', KNeighborsClassifier(n_neighbors=1))) models.append(('knn3', KNeighborsClassifier(n_neighbors=3))) models.append(('knn5', KNeighborsClassifier(n_neighbors=5))) models.append(('knn7', KNeighborsClassifier(n_neighbors=7))) models.append(('knn9', KNeighborsClassifier(n_neighbors=9))) # define the voting ensemble = VotingClassifier(estimators=models, voting='hard') return ensemble
View model promotion
def get_models(): models=dict() models['knn1'] = KNeighborsClassifier(n_neighbors=1) models['knn3'] = KNeighborsClassifier(n_neighbors=3) models['knn5'] = KNeighborsClassifier(n_neighbors=5) models['knn7'] = KNeighborsClassifier(n_neighbors=7) models['knn9'] = KNeighborsClassifier(n_neighbors=9) models['voting']= get_voting() return models
evaluate_ The model () function receives a model instance and returns it in the form of a hierarchical 10x cross validation three repeated score list
from sklearn.model_selection import cross_val_score def evaluate_model(model,X,y): cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1) scores=cross_val_score(model,X,y,scoring='accuracy',cv=cv,n_jobs=-1,error_score='raise') return scores
Compare algorithms and visualize
from sklearn.neighbors import KNeighborsClassifier from matplotlib import pyplot X,y=get_dataset() models=get_models() results,names=list(),list() for name,model in models.items(): scores=evaluate_model(model,X,y) results.append(scores) names.append(name) print('>%s %.3f (%.3f)'% (name, np.mean(scores), np.std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()
>knn1 0.873 (0.030) >knn3 0.889 (0.038) >knn5 0.895 (0.031) >knn7 0.899 (0.035) >knn9 0.900 (0.033) >voting 0.910 (0.031)
- The voting method is better than the rest based model, but it has not been significantly improved
bagging
- Bagging algorithm: let the learning algorithm train multiple rounds. Each round of training set is composed of N training samples randomly taken from the initial training set. An initial training sample can appear many times or not at all in a round of training set. After training, a prediction function sequence h can be obtained_ 1,⋯ ⋯h_n. The final prediction function H adopts the voting method for the classification problem and the simple average method for the regression problem to judge the new example. Its essence is to sample the population with return!
- The prediction of regression problem is carried out by taking the average value of prediction. The prediction of classification problem is carried out by taking the majority vote of the prediction.
bagging principle
-
Select n samples from the original sample set by Bootstrap sampling (New)
-
Establish a classifier for these n samples
-
Repeat steps 1-2 to establish m classifiers
-
Select n samples (m new sample data sets) from Bootstrap sampling and classify them on M classifiers
-
Vote the results of m classifiers, and the most is the final category.
Bootstrap principle
The name of Bootstrap sampling comes from the idiom "pull up by your own bootstraps", which means relying on your own resources. It is called self-help method. It is a sampling method with return. It is an important statistical method for estimating the variance of statistics and then interval estimation in nonparametric statistics.
Its core idea and basic steps are as follows:
1) Resampling technology is used to extract a certain number of samples (given by yourself) from the original samples, and repeated sampling is allowed in this process.
2) Calculate the given statistic T according to the extracted samples.
3) Repeat the above N times (generally greater than 1000) to obtain N statistics T.
4) Calculate the sample variance of the above N statistics T to obtain the variance of the statistics.
bagging comparison before and after use
# Import algorithm package and dataset from sklearn import neighbors from sklearn import datasets from sklearn.ensemble import BaggingClassifier from sklearn import tree from sklearn.model_selection import train_test_split import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot') # Beautify with its own style # The following two lines of code are used to display Chinese plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally iris = datasets.load_iris() x_data = iris.data[:,:2] y_data = iris.target x_train,x_test,y_train,y_test = train_test_split(x_data, y_data) # print(iris) print(y_data)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Without bagging
def plot(model): # Gets the range of data values x_min, x_max = x_data[:, 0].min() - 1, x_data[:, 0].max() + 1 y_min, y_max = x_data[:, 1].min() - 1, x_data[:, 1].max() + 1 # Generate grid matrix xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02)) z = model.predict(np.c_[xx.ravel(), yy.ravel()])# Travel is similar to flatten in that multidimensional data is converted to one dimension. Flatten will not change the original data, and travel will change the original data z = z.reshape(xx.shape) # Contour map cs = plt.contourf(xx, yy, z) # KNN knn = neighbors.KNeighborsClassifier() knn.fit(x_train, y_train) # Draw a picture plot(knn) # Sample scatter diagram plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show() # Accuracy print(knn.score(x_test, y_test)) # Decision tree dtree = tree.DecisionTreeClassifier() dtree.fit(x_train, y_train) # Draw a picture plot(dtree) # Sample scatter diagram plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show() # Accuracy dtree.score(x_test, y_test)
0.6578947368421053
0.6052631578947368
Using bagging
bagging_knn = BaggingClassifier(knn, n_estimators=100) # Input data to build model bagging_knn.fit(x_train, y_train) plot(bagging_knn) # Sample scatter diagram plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show() print(bagging_knn.score(x_test, y_test))
0.7105263157894737
bagging_tree = BaggingClassifier(dtree, n_estimators=100) # Input data to build model bagging_tree.fit(x_train, y_train) plot(bagging_tree) # Sample scatter diagram plt.scatter(x_data[:, 0], x_data[:, 1], c=y_data) plt.show() print(bagging_tree.score(x_test, y_test))
0.6578947368421053
- It can be seen that after using bagging, KNN and decision tree are improved
summary
This is task 7 of integrated learning. Only here can I really figure out what integrated learning is. In fact, it is the comprehensive use of multiple models, including bagging,stacking and voting methods learned before.
The general structure of ensemble learning is to produce a group of "individual learners" and then combine them with some strategy. The integration only contains the same type of individual learners, which is called homogeneous. The individual learners among them are also called "basic learners", and the corresponding algorithm is called "basic learning algorithm". The integration includes different types of individual learners, which are called "heterogeneous", and the individual learners are called "building learners".
Multiple individual learners form a strong learner to improve the accuracy.
Here are some references about integrated learning, which can be seen in later learning.
Relevant references:
Detailed explanation of the principle of ensemble learning
Integrated learning – bagging, boosting, stacking