Source: https://iyinst.github.io/2021/05/19/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0-%E6%89%8B%E6%92%95GBDT/

The principle of gradient lifting decision tree (GBDT) will not be repeated. Unfamiliar students can refer to Mr. Li Hang's statistical learning method. Here is an implementation of GBDT.

Before coding, we should first consider three issues:

1. What is the goal of each tree fitting?

2. How does each node of the tree split?

3. How to calculate the predicted value of leaf nodes?

If you think about these three problems clearly, you can turn them into the problem of building multiple CART and simplify the complex problems. Considering the two tasks of classification and regression, the regression task is relatively simple. When the loss function is the mean square error (MSE), the fitting goal of each tree is the residual of the current model. The splitting method of each node of the tree is the splitting method of CART node, and the predicted value of leaf node is the average value of all samples assigned to this node. The classification task is relatively complex. Considering that the loss function is a binary classification task of log likelihood loss function, the result of model prediction is log probability, which is very similar to logistic regression. The fitting target of each tree is the difference between the real value and the prediction probability. The splitting method of each node of the tree is the splitting method of CART node, and the predicted value of leaf node is y − P P ∗ ( 1 − P ) \frac{y - P}{P * ( 1 - P ) } P * (1 − P)y − P, where y y y is the real label, P P P is the prediction probability. The above three problems are summarized. For problem 1, the fitting goal of each tree is the negative gradient of the loss function. For the regression problem with the mean square error as the loss function and the binary classification problem of the log likelihood loss function, the negative gradient is the difference between the real value and the predicted value; To solve problem 2, each tree of GBDT is a CART regression tree, which has nothing to do with classification or regression task. Therefore, the way of node splitting is the way of CART regression tree splitting; For problem 3, the predicted value of leaf node should be the value that minimizes the loss function. For the regression problem where the mean square error is the loss function, the predicted value is the mean of the real value, and for the binary classification problem of log likelihood loss function, the predicted value is. in addition y − P P ∗ ( 1 − P ) \frac{y - P}{P * ( 1 - P ) } P * (1 − P)y − P another point to note is that the generation of the tree in GBDT is based on (X,r), but the predicted value of the leaf node is calculated based on the real label and the predicted value of the model, that is, the splitting of the decision tree is based on the response value.

After solving these three problems, we began to write the code of GBDT. To reuse the CART code written in the previous article, it is necessary to modify the calculation method of the predicted value of the CART leaf node.

The modified CART tree is as follows.

class CART: def __init__(self, objective='regression', max_depth=10, min_samples_leaf=1, min_impurity_decrease=0., real_label = None): self.objective = objective if self.objective == 'regression': self.loss = se_loss self.leaf_weight = lambda res,label: -np.mean(res) elif self.objective == 'classification': self.loss = se_loss self.leaf_weight = lambda res, label : np.sum(res) / np.sum((label - res) * (1 - label + res)) self.min_impurity_decrease = min_impurity_decrease self.max_depth = max_depth self.root = Node() self.min_samples_leaf = min_samples_leaf self.depth = 1 self.real_label = real_label # @time_count def fit(self, X, y): self.root.instances_index = list(range(X.shape[0])) self._generate_node(self.root, X, y, self.depth) def _generate_node(self, root: Node, X: np.array, y: np.array, depth: int): # Pruning greater than maximum depth self.depth = max(depth, self.depth) if depth >= self.max_depth: root.value = self.leaf_weight(y[root.instances_index], self.real_label[root.instances_index]) return split_feature, split_point = -1, -1 min_loss = self.loss(y[root.instances_index]) # Look for split points for feature_index in range(X.shape[1]): split_candidate = sorted(np.unique(X[root.instances_index, feature_index])) for candidate in split_candidate: left = [i for i in root.instances_index if X[i, feature_index] <= candidate] right = [i for i in root.instances_index if X[i, feature_index] > candidate] # Pruning less than the minimum number of samples if len(left) < self.min_samples_leaf or len(right) < self.min_samples_leaf: continue # Calculate the loss after splitting split_loss = self.loss(y[left]) + self.loss(y[right]) # Update loss if split_loss < min_loss and self.loss(y[root.instances_index]) - split_loss > self.min_impurity_decrease: min_loss = split_loss split_feature = feature_index split_point = candidate if split_point == -1: # Not split root.value = self.leaf_weight(y[root.instances_index], self.real_label[root.instances_index]) else: # division root.split_point = split_point root.split_feature = split_feature root.left = Node() root.right = Node() root.left.instances_index = [i for i in root.instances_index if X[i][split_feature] <= split_point] root.right.instances_index = [i for i in root.instances_index if X[i][split_feature] > split_point] root.instances_index = None self._generate_node(root.left, X, y, depth + 1) self._generate_node(root.right, X, y, depth + 1) def predict(self, X): result = np.zeros([len(X)]) for item, x in enumerate(X): root = self.root while root.value is None: if x[root.split_feature] <= root.split_point: root = root.left else: root = root.right result[item] = root.value return result

Compared with CART in the previous section, self. Is modified leaf_ Weight, that is, the generation method of leaf node weight. Only regression tree is used here. The generation method of leaf node is ∑ r ∑ ( y − r ) ( 1 − y + r ) \frac{\sum r}{\sum (y-r)(1-y+r)} ∑(y−r)(1−y+r)∑r.

With the modified CART, we build the code of GBDT.

class GBDT: def __init__(self, objective='regression', max_tree = 3, max_depth=5, min_samples_leaf=2, min_impurity_decrease=0.): self.objective = objective if self.objective == 'regression': self.loss = se_loss self.leaf_weight = np.mean self.model_init_func = np.mean self.response_gene = lambda pred,y: pred - y self.pred_func = lambda x: x elif self.objective == 'classification': self.loss = gini_loss self.leaf_weight = lambda y, pred: (y - pred) / (pred * (1 - pred)) self.model_init_func = lambda y: - np.log( len(y) / np.sum(y) - 1) self.response_gene = lambda pred,y: y - sigmoid(pred) self.pred_func = lambda x: np.where(x > .5, 1, 0) self.model_init = None self.min_impurity_decrease = min_impurity_decrease self.max_depth = max_depth self.min_samples_leaf = min_samples_leaf self.depth = 1 self.max_tree = max_tree self.tree_list = [] self.tree_num = 0 self.model = None def fit(self, X, y): if self.model is None: self.model_init = self.model_init_func(y) self.model = np.repeat(self.model_init, len(y)) for tree_num in range(self.max_tree): # Calculate response value response = self.response_gene(self.model, y) # Build CART tree new_cart = CART(objective=self.objective, max_depth=self.max_depth, real_label = y) new_cart.fit(X,response) f = new_cart.predict(X) # Add to list self.model += f self.tree_list.append(new_cart) def predict(self,X): predict = self.model_init for tree in self.tree_list: predict += tree.predict(X) return self.pred_func(predict)

We found that the code of GBDT is much simpler than CART, because GBDT only needs to calculate several key values and generate several CART. These key values include initialization prediction value, CART response value, etc. Our GBDT does not add a learning rate, which is equivalent to a GBDT with a learning rate of 1. Setting a small learning rate can improve the accuracy of the model, but it will lead to the decline of the convergence speed of the model, which makes our already slow GBDT worse.

Next, we test the performance of GBDT. The test code is as follows.

from sklearn import datasets # Import library from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error as mse from sklearn.metrics import f1_score, precision_score, recall_score from GBDT import GBDT from sklearn import tree import time # regression boston = datasets.load_boston() # Import Boston house price data X, y = boston.data, boston.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state= 32) max_depth = 3 clf = tree.DecisionTreeRegressor(max_depth=max_depth) clf = clf.fit(X_train, y_train) t = time.time() y_pred = clf.predict(X_test) print(mse(y_pred, y_test), time.time() - t) gbdt = GBDT(objective='regression',max_depth=max_depth) t = time.time() gbdt.fit(X_train, y_train) y_pred = gbdt.predict(X_test) print(mse(y_pred, y_test), time.time() - t) # classification cancer = datasets.load_breast_cancer() # Import breast cancer data X, y = cancer.data, cancer.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=32) max_depth = 3 clf = tree.DecisionTreeClassifier(max_depth=max_depth) t = time.time() clf = clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(precision_score(y_pred, y_test), recall_score(y_pred, y_test), f1_score(y_pred, y_test), time.time() - t) gbdt = GBDT(objective='classification',max_depth=max_depth) t = time.time() gbdt.fit(X_train, y_train) y_pred = gbdt.predict(X_test) print(precision_score(y_pred, y_test), recall_score(y_pred, y_test), f1_score(y_pred, y_test), time.time() - t)

Test results.

29.96338709672509 0.0009992122650146484 26.27684558465425 4.171807050704956 0.9428571428571428 0.9295774647887324 0.9361702127659575 0.006042003631591797 0.8857142857142857 0.96875 0.9253731343283582 20.620769739151

It can be seen that in the regression task, the mean square error of the GBDT we wrote is better than that of the GBDT in sklearn, the accuracy and F1 in the classification task are slightly worse than that of the GBDT in sklearn, and the recall rate is higher than that of the GBDT in sklearn; However, in terms of speed, both regression tasks and classification tasks are more than 3000 times slower than sklearn.

By looking at the code, we found several interesting things.

It doesn't mean that all the trees in GBDT are regression trees. Why are there regions and classifications in the implementation? This is because the final results are obtained in different ways for different tasks, so the calculation methods of CART loss function and leaf node prediction value are different. Therefore, we divide CART into region and classification to calculate loss function and leaf node prediction value respectively.

## reference

- Friedman J H. Greedy function approximation: a gradient boosting machine[J]. Annals of statistics, 2001: 1189-1232.
- In depth understanding of GBDT secondary classification algorithm, https://zhuanlan.zhihu.com/p/89549390