[machine learning notes] tear GBDT by hand

Posted by izy on Wed, 09 Feb 2022 01:30:10 +0100

Source: https://iyinst.github.io/2021/05/19/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0-%E6%89%8B%E6%92%95GBDT/

The principle of gradient lifting decision tree (GBDT) will not be repeated. Unfamiliar students can refer to Mr. Li Hang's statistical learning method. Here is an implementation of GBDT.

Before coding, we should first consider three issues:

1. What is the goal of each tree fitting?

2. How does each node of the tree split?

3. How to calculate the predicted value of leaf nodes?

If you think about these three problems clearly, you can turn them into the problem of building multiple CART and simplify the complex problems. Considering the two tasks of classification and regression, the regression task is relatively simple. When the loss function is the mean square error (MSE), the fitting goal of each tree is the residual of the current model. The splitting method of each node of the tree is the splitting method of CART node, and the predicted value of leaf node is the average value of all samples assigned to this node. The classification task is relatively complex. Considering that the loss function is a binary classification task of log likelihood loss function, the result of model prediction is log probability, which is very similar to logistic regression. The fitting target of each tree is the difference between the real value and the prediction probability. The splitting method of each node of the tree is the splitting method of CART node, and the predicted value of leaf node is y − P P ∗ ( 1 − P ) \frac{y - P}{P * ( 1 - P ) } P * (1 − P)y − P, where y y y is the real label, P P P is the prediction probability. The above three problems are summarized. For problem 1, the fitting goal of each tree is the negative gradient of the loss function. For the regression problem with the mean square error as the loss function and the binary classification problem of the log likelihood loss function, the negative gradient is the difference between the real value and the predicted value; To solve problem 2, each tree of GBDT is a CART regression tree, which has nothing to do with classification or regression task. Therefore, the way of node splitting is the way of CART regression tree splitting; For problem 3, the predicted value of leaf node should be the value that minimizes the loss function. For the regression problem where the mean square error is the loss function, the predicted value is the mean of the real value, and for the binary classification problem of log likelihood loss function, the predicted value is. in addition y − P P ∗ ( 1 − P ) \frac{y - P}{P * ( 1 - P ) } P * (1 − P)y − P another point to note is that the generation of the tree in GBDT is based on (X,r), but the predicted value of the leaf node is calculated based on the real label and the predicted value of the model, that is, the splitting of the decision tree is based on the response value.

After solving these three problems, we began to write the code of GBDT. To reuse the CART code written in the previous article, it is necessary to modify the calculation method of the predicted value of the CART leaf node.

The modified CART tree is as follows.

class CART:
    def __init__(self, objective='regression', max_depth=10, min_samples_leaf=1, min_impurity_decrease=0., real_label = None):
        self.objective = objective
        if self.objective == 'regression':
            self.loss = se_loss
            self.leaf_weight = lambda res,label: -np.mean(res)
        elif self.objective == 'classification':
            self.loss = se_loss
            self.leaf_weight = lambda res, label : np.sum(res) / np.sum((label - res) * (1 - label + res))


        self.min_impurity_decrease = min_impurity_decrease
        self.max_depth = max_depth
        self.root = Node()
        self.min_samples_leaf = min_samples_leaf
        self.depth = 1
        self.real_label = real_label

    # @time_count
    def fit(self, X, y):
        self.root.instances_index = list(range(X.shape[0]))
        self._generate_node(self.root, X, y, self.depth)

    def _generate_node(self, root: Node, X: np.array, y: np.array, depth: int):

        # Pruning greater than maximum depth
        self.depth = max(depth, self.depth)
        if depth >= self.max_depth:
            root.value = self.leaf_weight(y[root.instances_index], self.real_label[root.instances_index])
            return

        split_feature, split_point = -1, -1
        min_loss = self.loss(y[root.instances_index])

        # Look for split points
        for feature_index in range(X.shape[1]):
            split_candidate = sorted(np.unique(X[root.instances_index, feature_index]))
            for candidate in split_candidate:
                left = [i for i in root.instances_index if X[i, feature_index] <= candidate]
                right = [i for i in root.instances_index if X[i, feature_index] > candidate]

                # Pruning less than the minimum number of samples
                if len(left) < self.min_samples_leaf or len(right) < self.min_samples_leaf:
                    continue

                # Calculate the loss after splitting
                split_loss = self.loss(y[left]) + self.loss(y[right])

                # Update loss
                if split_loss < min_loss and self.loss(y[root.instances_index]) - split_loss > self.min_impurity_decrease:
                    min_loss = split_loss
                    split_feature = feature_index
                    split_point = candidate

        if split_point == -1:
            # Not split
            root.value = self.leaf_weight(y[root.instances_index], self.real_label[root.instances_index])
        else:
            # division
            root.split_point = split_point
            root.split_feature = split_feature
            root.left = Node()
            root.right = Node()

            root.left.instances_index = [i for i in root.instances_index if X[i][split_feature] <= split_point]
            root.right.instances_index = [i for i in root.instances_index if X[i][split_feature] > split_point]
            root.instances_index = None

            self._generate_node(root.left, X, y, depth + 1)
            self._generate_node(root.right, X, y, depth + 1)

    def predict(self, X):
        result = np.zeros([len(X)])
        for item, x in enumerate(X):
            root = self.root
            while root.value is None:
                if x[root.split_feature] <= root.split_point:
                    root = root.left
                else:
                    root = root.right
            result[item] = root.value
        return result

Compared with CART in the previous section, self. Is modified leaf_ Weight, that is, the generation method of leaf node weight. Only regression tree is used here. The generation method of leaf node is ∑ r ∑ ( y − r ) ( 1 − y + r ) \frac{\sum r}{\sum (y-r)(1-y+r)} ∑(y−r)(1−y+r)∑r​.

With the modified CART, we build the code of GBDT.

class GBDT:
    def __init__(self, objective='regression', max_tree = 3, max_depth=5, min_samples_leaf=2, min_impurity_decrease=0.):
        self.objective = objective
        if self.objective == 'regression':
            self.loss = se_loss
            self.leaf_weight = np.mean
            self.model_init_func = np.mean
            self.response_gene = lambda pred,y: pred - y
            self.pred_func = lambda x: x
        elif self.objective == 'classification':
            self.loss = gini_loss
            self.leaf_weight = lambda y, pred: (y - pred) / (pred * (1 - pred))
            self.model_init_func = lambda y: - np.log( len(y) / np.sum(y) - 1)
            self.response_gene = lambda pred,y: y - sigmoid(pred)
            self.pred_func = lambda x: np.where(x > .5, 1, 0)

        self.model_init = None
        self.min_impurity_decrease = min_impurity_decrease
        self.max_depth = max_depth
        self.min_samples_leaf = min_samples_leaf
        self.depth = 1
        self.max_tree = max_tree
        self.tree_list = []
        self.tree_num = 0
        self.model = None

    def fit(self, X, y):
        if self.model is None:
            self.model_init = self.model_init_func(y)
            self.model = np.repeat(self.model_init, len(y))
        for tree_num in range(self.max_tree):
            # Calculate response value
            response =  self.response_gene(self.model, y)

            # Build CART tree
            new_cart = CART(objective=self.objective, max_depth=self.max_depth, real_label = y)
            new_cart.fit(X,response)
            f = new_cart.predict(X)

            # Add to list
            self.model += f
            self.tree_list.append(new_cart)

    def predict(self,X):
        predict = self.model_init
        for tree in self.tree_list:
            predict += tree.predict(X)
        return self.pred_func(predict)

We found that the code of GBDT is much simpler than CART, because GBDT only needs to calculate several key values and generate several CART. These key values include initialization prediction value, CART response value, etc. Our GBDT does not add a learning rate, which is equivalent to a GBDT with a learning rate of 1. Setting a small learning rate can improve the accuracy of the model, but it will lead to the decline of the convergence speed of the model, which makes our already slow GBDT worse.

Next, we test the performance of GBDT. The test code is as follows.

from sklearn import datasets  # Import library
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import f1_score, precision_score, recall_score
from GBDT import GBDT
from sklearn import tree
import time

# regression
boston = datasets.load_boston()  # Import Boston house price data
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state= 32)
max_depth = 3

clf = tree.DecisionTreeRegressor(max_depth=max_depth)
clf = clf.fit(X_train, y_train)
t = time.time()
y_pred = clf.predict(X_test)
print(mse(y_pred, y_test), time.time() - t)

gbdt = GBDT(objective='regression',max_depth=max_depth)
t = time.time()
gbdt.fit(X_train, y_train)
y_pred = gbdt.predict(X_test)
print(mse(y_pred, y_test), time.time() - t)

# classification
cancer = datasets.load_breast_cancer()  # Import breast cancer data
X, y = cancer.data, cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=32)
max_depth = 3

clf = tree.DecisionTreeClassifier(max_depth=max_depth)
t = time.time()
clf = clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(precision_score(y_pred, y_test), recall_score(y_pred, y_test),
      f1_score(y_pred, y_test), time.time() - t)

gbdt = GBDT(objective='classification',max_depth=max_depth)
t = time.time()
gbdt.fit(X_train, y_train)
y_pred = gbdt.predict(X_test)
print(precision_score(y_pred, y_test), recall_score(y_pred, y_test),
      f1_score(y_pred, y_test), time.time() - t)

Test results.

29.96338709672509 0.0009992122650146484
26.27684558465425 4.171807050704956
0.9428571428571428 0.9295774647887324 0.9361702127659575 0.006042003631591797
0.8857142857142857 0.96875 0.9253731343283582 20.620769739151

It can be seen that in the regression task, the mean square error of the GBDT we wrote is better than that of the GBDT in sklearn, the accuracy and F1 in the classification task are slightly worse than that of the GBDT in sklearn, and the recall rate is higher than that of the GBDT in sklearn; However, in terms of speed, both regression tasks and classification tasks are more than 3000 times slower than sklearn.

By looking at the code, we found several interesting things.

It doesn't mean that all the trees in GBDT are regression trees. Why are there regions and classifications in the implementation? This is because the final results are obtained in different ways for different tasks, so the calculation methods of CART loss function and leaf node prediction value are different. Therefore, we divide CART into region and classification to calculate loss function and leaf node prediction value respectively.

reference

  1. Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of statistics, 2001: 1189-1232.
  2. In depth understanding of GBDT secondary classification algorithm, https://zhuanlan.zhihu.com/p/89549390

Topics: Python Machine Learning