sklearn Machine Learning

Posted by spiceydog on Mon, 03 Jan 2022 18:05:19 +0100

Task07
This study refers to the Datawhale open source learning: https://github.com/datawhalechina/machine-learning-toy-code/tree/main/ml-with-sklearn
The content is arranged as follows, mainly some code implementation and some principles.

7. Integrated Learning

In the previous chapter, we talked about the decline in the effect of the dimension disaster mapping model. In addition to using the dimension reduction method, there is also a subspace method commonly used to deal with such high-dimensional problems. Integration is one of the common methods in subspace thinking, which combines the outputs of several algorithms or base detectors that perform well in subspace. Integrated learning accomplishes learning tasks by building and combining multiple learners, sometimes referred to as multi-classifier systems. The general strategy of integrated learning is to produce a set of "individual learners" and then combine them with a certain strategy.

The simplest method of integration is voting. Voting method is an integrated learning model that follows the principle of minority obeying majority, and reduces variance through the integration of multiple models, thereby improving the robustness of the model. Ideally, the voting method should be better than any base model.

Voting methods can be used in both regression and classification models:

  • Regression voting: The predictions are the average of all model predictions.
  • Classified voting: The predicted results are those that appear most frequently for all model types.

Classified voting can also be divided into hard and soft votes:

  • Hard votes: The class with the most predicted results for all votes.
  • Soft votes: The predicted result is the class with the highest probability of summing all the voting results.

7.1. Bagging

Unlike voting, Bagging not only integrates the final predictions of the model, but also uses strategies to influence the training of the base model.

At the heart of Bagging is the concept of bootstrap, where there is a playback to sample from a dataset, that is, the same sample may be sampled multiple times. First we randomly take a sample and put it into the sample set. Then we put the sample back into the initial data set and repeat the sample K times. Finally we can get a sample set of K size. In the same way, we can sample T sample sets with K samples and then train a base learner based on each sample set, a total of T learners. Combining these basic learners is the basic process of Bagging.

Sklearn provides us with an API for two Bagging methods, Bagging Regressor and Bagging Classifier. Explain the application with two official examples:

'''BaggingRegressor'''
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=4,n_informative=2, n_targets=1,random_state=0, shuffle=False) # Generate data
regr = BaggingRegressor(base_estimator=SVR(),n_estimators=10, random_state=0).fit(X, y) # Select ten SVR models as base models
regr.predict([[0, 0, 0, 0]])
array([-2.87202411])
'''BaggingClassifier'''
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=4,n_informative=2, n_redundant=0,random_state=0, shuffle=False)# Generate data
clf = BaggingClassifier(base_estimator=SVC(),n_estimators=10, random_state=0).fit(X, y)# Select ten SVR models as base models
clf.predict([[0, 0, 0, 0]])
array([1])

7.2. Boosting

Bagging reduces prediction error mainly by reducing variance. Boosting thought improves the ultimate predictive effect by continuously reducing bias. In other words, Bagging averages many strong classifiers, and Boosting combines many weak classifiers into one strong classifier.

Boosting first trains a base learner from the initial training set, then adjusts the training samples according to the performance of the base learner, so that the samples made by the previous base learner errors get more attention later, and then trains the next base learner based on the adjusted sample distribution. Repeat this until the number of base learners reaches the pre-specified value T, and finally the T base learners are weighted together.

Boosting has two common Boosting styles: Adaptive Boosting and Gradient Boosting, and their variants, Xgboost, LightGBM, and Catboost.

'''AdaBoostRegressor'''
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,random_state=0, shuffle=False)
regr = AdaBoostRegressor(random_state=0, n_estimators=100)# Generate data
regr.fit(X, y)
AdaBoostRegressor(n_estimators=100, random_state=0)# Select 100 boosting base models
regr.predict([[0, 0, 0, 0]])
array([4.79722349])
'''AdaBoostClassifier'''
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,n_informative=2, n_redundant=0,random_state=0, shuffle=False)# Generate data
clf = AdaBoostClassifier(n_estimators=100, random_state=0)# Select 100 boosting base models
clf.fit(X, y)
clf.predict([[0, 0, 0, 0]])
array([1])

7.3. Blending and Staking

Stacking is not strictly an algorithm, but a model integration strategy. Stacking integration algorithm can be understood as a two-tier integration. The first tier contains several basic classifiers, which provide the predicted results to the second tier. The second tier classifiers are usually logical regression. He uses the results of one tier classifiers as features to fit the output predicted results. Blending is essentially the same as Stacking, but with one less level of cross-validation, it can be considered a simplified version of Stacking.

'''Blending'''
import numpy as np
from sklearn import datasets 
from sklearn.model_selection import train_test_split
data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 )
## Create training and test sets
X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1)
## Create training and validation sets
X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Setting the first layer classifier
clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]

# Setting the second layer classifier
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# Output Layer 1 Verification Set Results and Test Set Results
val_features = np.zeros((X_val.shape[0],len(clfs)))  # Construct validation data prediction matrix, store validation data prediction results for each model, each column being a validation set result for model initialization
test_features = np.zeros((X_test.shape[0],len(clfs)))  # Construct test data prediction matrix, store test data prediction results for each model, each column being a model

for i,clf in enumerate(clfs):
    clf.fit(X_train,y_train) # Training model
    val_feature = clf.predict_proba(X_val)[:, 1] # Get the results of each base model in the validation set
    test_feature = clf.predict_proba(X_test)[:,1] # Obtain the results of each base model in the test set
    val_features[:,i] = val_feature # Record the results of each base model in the validation set
    test_features[:,i] = test_feature # Record the results of each base model in the test set

# Enter the results of the first level validation set into the second level training second level classifier
lr.fit(val_features,y_val)
# Output Test Data Prediction Results
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_test, y_test, cv=3, scoring='accuracy')
print(scores.mean())
1.0

The Blending process can be represented by the following diagram, which combines code to understand the process:

'''Stacking'''
import numpy as np
from sklearn import datasets 
from sklearn.model_selection import train_test_split
data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 )
## Create training and test sets
X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1)
# ## Create training and validation sets
# X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Setting the first layer classifier
clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]

# Setting the second layer classifier
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

# Output Layer 1 Verification Set Results and Test Set Results
val_features = np.zeros((X_train1.shape[0],len(clfs)))  # Construct validation data prediction matrix, store validation data prediction results for each model, each column being a validation set result for model initialization
test_features = np.zeros((X_test.shape[0],len(clfs)))  # Construct test data prediction matrix, store test data prediction results for each model, each column being a model

from sklearn.model_selection import StratifiedKFold
n_splits = 5
skf = StratifiedKFold(n_splits)
skf = skf.split(X_train1,y_train1) #Divide training data into 5 folds

for j, clf in enumerate(clfs): # Jth Model
    #Train each single model in turn
    test_features_j = np.zeros((X_test.shape[0], 5)) # Construct validation data prediction matrix (each fold), store validation data single model prediction results for each fold, and validation data training model for each column
    for i, (train, val) in enumerate(skf): # i-fold, 5-fold total
        #5-Fold cross-training uses part i I as the prediction, and the rest to train the model to obtain its predicted output as the new feature of part i I.
        X_train, y_train, X_val, y_val = X_train1[train], y_train1[train], X_train1[val], y_train1[val] # Divide training data into training data and validation data, 5 fold cross validation
        clf.fit(X_train, y_train) # Training model
        y_submission = clf.predict(X_val) # Prediction validation data. Record prediction type in y_submission
        val_features[val, j] = y_submission # The predicted results of the j th model fold i validation data (validation data comes from training data, which accounts for 1/5 of training data) are put into the training model prediction matrix according to the index of the 5-fold split
        test_features_j[:, i] = clf.predict(X_test) # Place the predicted results of the j th model fold i test data into the fold test data prediction matrix

# Enter the results of the first level validation set into the second level training second level classifier
lr.fit(val_features,y_train1)
# Output Test Data Prediction Results
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_test, y_test, cv=3, scoring='accuracy')
print(scores.mean())
1.0

The Stacking process can be represented in the following diagram, which combines code to understand the process:

In addition, we can implement the stacking model directly with the mlxtend toolkit (pip install mlxtend).

'''Stacking(Use mlxtedn Toolkit implementation)'''
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
data, target = make_blobs(n_samples=10000, centers=2, random_state=1, cluster_std=1.0 )
## Create training and test sets
X_train1,X_test,y_train1,y_test = train_test_split(data, target, test_size=0.2, random_state=1)
# ## Create training and validation sets
# X_train,X_val,y_train,y_val = train_test_split(X_train1, y_train1, test_size=0.3, random_state=1)

from mlxtend.classifier import StackingCVClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Setting the first layer classifier
clfs = [SVC(probability = True),RandomForestClassifier(n_estimators=5, n_jobs=-1, criterion='gini'),KNeighborsClassifier()]

# Setting the second layer classifier
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

sclf = StackingCVClassifier(classifiers=clfs,  # Layer 1 Classifier
                            meta_classifier=lr,   # Layer 2 Classifier
                            random_state=42)

# Output Test Data Prediction Results
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clfs[0], X_test, y_test, cv=3, scoring='accuracy')
print(scores.mean())
1.0

Topics: Machine Learning sklearn