Classification case: sample imbalance in XGB

Posted by jakebrewer1 on Thu, 03 Feb 2022 10:42:13 +0100

Parameter setting

There are often problems in XGB classification There are parameters to adjust the sample imbalance scale_pos_weight , Usually, we enter the ratio of negative sample size to positive sample size in the parameter

Classification case

Create unbalanced dataset
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split as TTS
from sklearn.metrics import accuracy_score as accuracy, recall_score as recall, roc_auc_score as auc

class_1 = 500  # Category 1 has 500 samples
class_2 = 50  # Category 2 has only 50
centers = [[0.0, 0.0], [2.0, 2.0]]  # Set the center of two categories
clusters_std = [1.5, 0.5]  # Set the variance of the two categories. Generally speaking, the category with large sample size will be more loose
X, y = make_blobs(n_samples=[class_1, class_2],
                  centers=centers,
                  cluster_std=clusters_std,
                  random_state=0, shuffle=False)

Xtrain, Xtest, Ytrain, Ytest = TTS(X,y,test_size=0.3,random_state=420)

(y == 1).sum() / y.shape[0]
XGBoost modeling
dtrain = xgb.DMatrix(Xtrain,Ytrain)
dtest = xgb.DMatrix(Xtest,Ytest)

param= {'silent':True,'objective':'binary:logistic',"eta":0.1,"scale_pos_weight":1}
num_round = 100

bst = xgb.train(param, dtrain, num_round)

preds = bst.predict(dtest)

The above preds returns the classification probability of the sample. The predicted category of the sample can be generated by setting the threshold below.

#Set your own threshold
ypred = preds.copy()
ypred[preds > 0.5] = 1
ypred[ypred != 1] = 0

Parameter adjustment evaluation

# Parameter setting
scale_pos_weight = [1,5,10]
names = ["negative vs positive: 1",
         "negative vs positive: 5",
         "negative vs positive: 10"]


for name,i in zip(names,scale_pos_weight):
    param={'silent':True,'objective':'binary:logistic',"eta":0.1,"scale_pos_weight":i,'verbosity':0}
    clf = xgb.train(param, dtrain, num_round)
    preds = clf.predict(dtest)
    ypred = preds.copy()
    ypred[preds > 0.5] = 1
    ypred[ypred != 1] = 0
    print(name)
    print("\tAccuracy:{}".format(accuracy(Ytest,ypred)))
    print("\tRecall:{}".format(recall(Ytest,ypred)))
    print("\tAUC:{}".format(auc(Ytest,preds)))
    print("\t")


# Different thresholds can also be used
for name,i in zip(names,scale_pos_weight):
    for thres in [0.3,0.5,0.7,0.9]:
        param= {'silent':True,'objective':'binary:logistic',"eta":0.1,"scale_pos_weight":i,'verbosity':0}
        clf = xgb.train(param, dtrain, num_round)
        preds = clf.predict(dtest)
        ypred = preds.copy()
        ypred[preds > thres] = 1
        ypred[ypred != 1] = 0
        print("{},thresholds:{}".format(name,thres))
        print("\tAccuracy:{}".format(accuracy(Ytest,ypred)))
        print("\tRecall:{}".format(recall(Ytest,ypred)))
        print("\tAUC:{}".format(auc(Ytest,preds)))
        print("\t")


Result analysis and suggestions

  1. The ratio of negative sample size to positive sample size in the original data set is 10:1, and the model parameter scale is compared in turn_ pos_ After the value of weight is 1, 5 and 10, it is found that the accuracy, AUC area and recall rate of the model on the test set are improved, indicating that this parameter does make the XGBoost model have a good processing ability for unbalanced data.
  2. In essence, scale_ pos_ The weight parameter is adjusted by adjusting the predicted probability value. You can observe how the probability is affected by viewing the results returned by bst.predict(Xtest). Therefore, when we only care about whether the predicted results are accurate and whether the AUC area or recall rate is good enough, we can use scale_pos_weight parameter to help us.
  3. However, xgboost has many other functions besides classification and regression. In some fields that need to use accurate probability (such as ranking), we hope to maintain the original appearance of probability and improve the effect of the model. At this time, we can't use scale_pos_weight to help us.
  4. If we only care about the overall performance of the model, we use AUC as the model evaluation index and scale_pos_weight to deal with the sample imbalance problem. If we care about the probability of predicting the correct probability, we can't adjust the scale_pos_weight to reduce the impact of sample imbalance. In this case, we need to consider another parameter: max_delta_step. This parameter is called "the single maximum increment allowed in the weight estimation of the tree", which can be considered as a parameter affecting the estimation. If we are dealing with the problem of sample imbalance and are very concerned about getting the correct prediction probability, we can set max_ delta_ The step parameter is a finite number (such as 1) to help convergence. max_ delta_ The step parameter is usually not used. It is the only purpose of this parameter when the sample imbalance problem under the second classification is.   

Topics: Python Big Data Machine Learning AI