# Machine learning algorithm series - Decision Tree Learning Algorithm

Posted by tomfmason on Wed, 16 Feb 2022 08:43:29 +0100

# 1, Introduction

in life, every time we arrive at the meal point, we will silently recite in our hearts, "what are you going to eat later?", Maybe we don't want to go far after working all day today. At this time, we will decide that the distance of the restaurant can't exceed 200 meters, look at the 20 yuan in our wallet, decide to eat no more than 20, and finally order Lanzhou ramen. From the above example, we can see that the Lanzhou Ramen we eat today is determined by a series of previous decisions.

< center > Figure 1-1 < / center >

as shown in Figure 1-1, the above decision process is represented by a binary tree, which is called Decision Tree. In machine learning, the Decision Tree model shown in Figure 1-1 can also be trained through the data set. This algorithm is called Decision Tree Learning algorithm1 .

# Two, model introduction

### Model

Decision Tree Learning algorithm must first be a tree structure, which is composed of internal nodes and leaf nodes. Internal nodes represent a dimension (feature) and leaf nodes represent a classification. If nodes can be regarded as a pile of decision trees, so they can be connected through a certain set of conditions else... A collection of rules.

< center > Figure 2-1 < / center >

as shown in Figure 2-1, it shows a basic decision tree data structure and its decision methods.

### feature selection

since you want to make a decision, what you need to decide is from which dimension (feature) to make the decision, such as the store distance and the number of change in your wallet in the previous example. In machine learning, we need a quantitative index to determine that the feature used is more appropriate, that is, the "purity" of the subset obtained after using this feature is higher. At this time, three indicators - Information Gain, Gini Index and mean square deviation (MSE) are introduced to solve the above problems.

#### Information Gain

equation 2-1 is an index representing the purity of the sample set, which is called Information Entropy, where D represents the sample set, K represents the classification number of the sample set, and p_k represents the proportion of the k-th sample in the sample set. The smaller the value of Ent(D), the higher the purity of the sample set.

$$\operatorname{Ent}(D)=-\sum_{k=1}^{K} p_{k} \log _{2} p_{k}$$

< center > formula 2-1 < / center >

equation 2-2 represents the impact on the sample set after being divided by a discrete attribute, which is called Information Gain, where D represents the sample set, a represents the discrete attribute, v represents the number of all possible values of discrete attribute a, and D^v represents the sub sample set of v value in the sample set.

$$\operatorname{Gain}(D, a)=\operatorname{Ent}(D)-\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Ent}\left(D^{v}\right)$$

< center > formula 2-2 < / center >

when the attribute is a continuous attribute, its available value is not as limited as that of discrete attribute. At this time, the average value of continuous attributes in the sample set can be taken as the division point, and equation 2-2 can be rewritten to obtain the result of equation 2-3, where T_a represents the set of average values, D_t^v represents the subset. When v = -, it represents the subset of samples smaller than the mean value T. when v = +, it represents the subset of samples larger than the mean value T. take the maximum information gain in the partition point as the information gain value of this attribute.

\begin{aligned} T_{a} &=\left\{\frac{a^{i}+a^{i+1}}{2} \mid 1 \leq i \leq n-1\right\} \\ \operatorname{Gain}(D, a) &=\max _{t \in T_{a}} \operatorname{Gain}(D, a, t) \\ &=\max _{t \in T_{a}} \operatorname{Ent}(D)-\sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Ent}\left(D_{t}^{v}\right) \end{aligned}

< center > formula 2-3 < / center >

The larger the value of   Gain(D, a), the higher the purity improvement of the sample set divided according to this attribute. Thus, the most appropriate division attribute can be found, as shown in equation 2-4:

$$a_{\text {best }}=\underset{a}{\operatorname{argmax}} \operatorname{Gain}(D, a)$$

< center > formula 2-4 < / center >

#### Gini Index

equation 2-5 is another index representing the purity of the sample set, which is called Gini value, where D represents the sample set, K represents the classification number of the sample set, and p_k represents the proportion of the k-th sample in the sample set. The smaller the value of Gini(D), the higher the purity of the sample set.

$$\operatorname{Gini}(D)=1-\sum_{k=1}^{K} p_{k}^{2}$$

< center > formula 2-5 < / center >

equation 2-6 represents the impact on the sample set after being divided by a discrete attribute, which is called Gini Index, where D represents the sample set, a represents the discrete attribute, v represents the number of all possible values of discrete attribute a, and D^v represents the sub sample set of v value in the sample set.

$$\operatorname{Gini_{-}index}(D, a)=\sum_{v=1}^{V} \frac{\left|D^{v}\right|}{|D|} \operatorname{Gini}\left(D^{v}\right)$$

< center > formula 2-6 < / center >

the same as equation 2-3, take the average value of two consecutive attributes as the division point, rewrite equation 2-6, and get the result of equation 2-7, where T_a represents the set of average values, D_t^v represents the subset. When v = -, it represents the subset of samples smaller than the mean value T. when v = +, it represents the subset of samples larger than the mean value T. take the smallest Gini index in the partition point as the Gini index value of this attribute.

$$\operatorname{Gini_{-}index}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}} \frac{\left|D_{t}^{v}\right|}{|D|} \operatorname{Gini}\left(D_{t}^{v}\right)$$

< center > formula 2-7 < / center >

Gini_ The smaller the value of index (D, a), the higher the purity of the sample set divided according to the discrete attribute. Thus, the most appropriate division attribute can be found, as shown in equation 2-8:

$$a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{Gini\_index}(D, a)$$

< center > formula 2-8 < / center >

#### Mean square error (MSE)

the first two indicators enable the decision tree to be used for classification problems. If the decision tree is used for regression problems, different indicators are required to determine the characteristics of division. This indicator is the mean square deviation (MSE) shown in equation 2-9, where T_a represents the set of average values, y_t^v represents the subset label. When v = -, it represents the subset label of the sample smaller than the mean value T. when v = +, it represents the subset label of the sample larger than the mean value T. the latter item is the mean value of the corresponding subset label.

$$\operatorname{MSE}(D, a)=\min _{t \in T_{a}} \sum_{v \in\{-,+\}}\left(y_{t}^{v}-\hat{y_{t}^{v}}\right)^{2}$$

< center > formula 2-9 < / center >

the smaller the value of MSE(D, a), the higher the fitting degree of the decision tree to the sample set. Thus, the most appropriate division attribute can be found, as shown in equation 2-10:

$$a_{\text {best }}=\underset{a}{\operatorname{argmin}} \operatorname{MSE}(D, a)$$

< center > formula 2-10 < / center >

knowing the data structure of the decision tree model and how to divide the best data set, let's learn how to generate a decision tree.

# 3, Algorithm steps

since the data structure of the decision tree is a tree, its child node must also be a tree. The decision tree can be generated recursively. The steps are as follows:

Generate a new node node;

When there is only one category C in the sample:

mark the node node as the leaf node of category C and return the node node;

Traverse all features:

calculate the information gain or Gini index or mean square deviation of the current feature;

Record the best partition feature in node;

After dividing according to the best characteristics, the left part recursively calls the current method as the left child node of the node;

After dividing according to the best characteristics, the right part recursively calls the current method as the right child node of the node;

Return node;

# 4, Regularization

when the decision tree is generated recursively, the classification of training data by the model will be very accurate, but the performance of unknown prediction data is not ideal. This is the so-called over fitting phenomenon. At this time, the model can be regularized as the solution to over fitting learned by linear regression.

### Depth of decision tree

the regularization effect can be achieved by limiting the maximum depth of the decision tree to prevent the decision tree from over fitting. At this time, you only need to add a parameter to record the depth of the current recursive tree in the algorithm step. When the preset maximum depth is reached, no new child nodes will be generated, the current node will be marked as the classification with the largest proportion of classification in the sample, and exit the current recursion.

### Leaf node size of decision tree

another method to regularize the decision tree is to limit the minimum number of samples contained in the leaf node, which can also prevent the phenomenon of over fitting. When the number of samples contained in the nodule, mark the current node as the classification with the largest proportion of classification in the sample and exit the current recursion

### Pruning of decision tree

the decision tree can also be pruned to prevent over fitting and cut off the redundant subtree. Pruning methods are divided into two types: pre pruning and post pruning.

#### pre-pruning

as the name suggests, pre pruning is to decide whether to generate sub nodes when generating the decision tree. The judgment method is to use the verification data set to compare the accuracy of generating and not generating sub nodes. When the accuracy of generating sub nodes is improved, the sub nodes are generated, otherwise the sub nodes are not generated.

< center > Figure 4-1 is from Zhou Zhihua's machine learning < / center >

#### post-pruning

post pruning is to form a complete decision tree, and then start from the leaf node. The same judgment method as pre pruning is used. When the accuracy of generating sub nodes is improved, the sub nodes are retained, otherwise the sub nodes are cut off.

< center > Figure 4-2 the picture comes from Zhou Zhihua's machine learning < / center >

# 5, Code implementation

Implement decision tree classification based on information gain using Python:

import numpy as np

class GainNode:
"""
Nodes in classification decision tree
Based on information gain-Information Gain
"""

def __init__(self, feature=None, threshold=None, gain=None, left=None, right=None):
# Characteristic subscript of node Division
self.feature = feature
# The critical value of node division. When the node is a leaf node, it is the classification value
self.threshold = threshold
# Information gain value of node
self.gain = gain
# Left node
self.left = left
# Right node
self.right = right

class GainTree:
"""
Classification decision tree
Based on information gain-Information Gain
"""

def __init__(self, max_depth = None, min_samples_leaf = None):
# Maximum depth of decision tree
self.max_depth = max_depth
# Minimum sample number of decision node
self.min_samples_leaf = min_samples_leaf

def fit(self, X, y):
"""
Classification decision tree fitting
Based on information gain-Information Gain
"""
y = np.array(y)
self.root = self.buildNode(X, y, 0)
return self

def buildNode(self, X, y, depth):
"""
Build classification decision tree node
Based on information gain-Information Gain
"""
node = GainNode()
# Return directly when there is no sample
if len(y) == 0:
return node
y_classes = np.unique(y)
# When there is only one classification in the sample, the classification is returned directly
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# When the depth of the decision tree reaches the maximum depth limit, the classification with the largest proportion of classification in the sample is returned
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = max(y_classes, key=y.tolist().count)
return node
# When the number of decision leaf node samples reaches the minimum sample limit, the classification with the largest proportion of classification in the sample is returned
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = max(y_classes, key=y.tolist().count)
return node
max_gain = -np.inf
max_middle = None
max_feature = None
# Traverse all features to obtain the feature with the largest information gain
for i in range(X.shape[1]):
# Calculate the information gain of the feature
gain, middle = self.calcGain(X[:,i], y, y_classes)
if max_gain < gain:
max_gain = gain
max_middle = middle
max_feature = i
# Characteristics of maximum information gain
node.feature = max_feature
# critical value
node.threshold = max_middle
# information gain
node.gain = max_gain
X_lt = X[:,max_feature] < max_middle
X_gt = X[:,max_feature] > max_middle
# Recursive processing of left-hand sets
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# Recursive processing of right sets
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node

def calcMiddle(self, x):
"""
Calculate the average of two continuous features
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)

def calcEnt(self, y, y_classes):
"""
Calculate information entropy
"""
ent = 0
for j in range(len(y_classes)):
p = len(y[y == y_classes[j]])/ len(y)
if p != 0:
ent = ent + p * np.log2(p)
return -ent

def calcGain(self, x, y, y_classes):
"""
Calculate information gain
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
max_middle = -np.inf
max_gain = -np.inf
ent = self.calcEnt(y, y_classes)
# Traverse each average
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
ent_gt = self.calcEnt(y_gt, y_classes)
ent_lt = self.calcEnt(y_lt, y_classes)
# Calculate information gain
gain = ent - (ent_gt * len(y_gt) / len(x) + ent_lt * len(y_lt) / len(x))
if max_gain < gain:
max_gain = gain
max_middle = middle[i]
return max_gain, max_middle

def predict(self, X):
"""
Classification decision tree prediction
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y

def checkNode(self, X, y, node, cond = None):
"""
Judge the classification through the node of classification decision tree
"""
# When there is no child node, the current critical value is returned directly
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
# Recursive judgment of left node
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
# Recursive judgment of right node
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt

Using Python to implement decision tree classification based on Gini index:

import numpy as np

class GiniNode:
"""
Nodes in classification decision tree
Based on Gini index-Gini Index
"""

def __init__(self, feature=None, threshold=None, gini_index=None, left=None, right=None):
# Characteristic subscript of node Division
self.feature = feature
# The critical value of node division. When the node is a leaf node, it is the classification value
self.threshold = threshold
# Gini index value of node
self.gini_index = gini_index
# Left node
self.left = left
# Right node
self.right = right

class GiniTree:
"""
Classification decision tree
Based on Gini index-Gini Index
"""

def __init__(self, max_depth = None, min_samples_leaf = None):
# Maximum depth of decision tree
self.max_depth = max_depth
# Minimum sample number of decision node
self.min_samples_leaf = min_samples_leaf

def fit(self, X, y):
"""
Classification decision tree fitting
Based on Gini index-Gini Index
"""
y = np.array(y)
self.root = self.buildNode(X, y, 0)
return self

def buildNode(self, X, y, depth):
"""
Build classification decision tree node
Based on Gini index-Gini Index
"""
node = GiniNode()
# Return directly when there is no sample
if len(y) == 0:
return node
y_classes = np.unique(y)
# When there is only one classification in the sample, the classification is returned directly
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# When the depth of the returned samples reaches the maximum proportion in the classification tree
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = max(y_classes, key=y.tolist().count)
return node
# When the number of decision leaf node samples reaches the minimum sample limit, the classification with the largest proportion of classification in the sample is returned
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = max(y_classes, key=y.tolist().count)
return node
min_gini_index = np.inf
min_middle = None
min_feature = None
# Traverse all features to obtain the feature with the smallest Gini index
for i in range(X.shape[1]):
# Gini index for calculating characteristics
gini_index, middle = self.calcGiniIndex(X[:,i], y, y_classes)
if min_gini_index > gini_index:
min_gini_index = gini_index
min_middle = middle
min_feature = i
# Characteristic of minimum Gini index
node.feature = min_feature
# critical value
node.threshold = min_middle
# gini index
node.gini_index = min_gini_index
X_lt = X[:,min_feature] < min_middle
X_gt = X[:,min_feature] > min_middle
# Recursive processing of left-hand sets
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# Recursive processing of right sets
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node

def calcMiddle(self, x):
"""
Calculate the average of two continuous features
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)

def calcGiniIndex(self, x, y, y_classes):
"""
Calculate Gini index
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
min_middle = np.inf
min_gini_index = np.inf
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
gini_gt = self.calcGini(y_gt, y_classes)
gini_lt = self.calcGini(y_lt, y_classes)
gini_index = gini_gt * len(y_gt) / len(x) + gini_lt * len(y_lt) / len(x)
if min_gini_index > gini_index:
min_gini_index = gini_index
min_middle = middle[i]
return min_gini_index, min_middle

def calcGini(self, y, y_classes):
"""
Calculate Gini value
"""
gini = 1
for j in range(len(y_classes)):
p = len(y[y == y_classes[j]])/ len(y)
gini = gini - p * p
return gini

def predict(self, X):
"""
Classification decision tree prediction
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y

def checkNode(self, X, y, node, cond = None):
"""
Judge the classification through the node of classification decision tree
"""
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt

Using Python to realize decision tree regression based on mean square error:

import numpy as np

class RegressorNode:
"""
Nodes in regression decision tree
"""

def __init__(self, feature=None, threshold=None, mse=None, left=None, right=None):
# Characteristic subscript of node Division
self.feature = feature
# The critical value of node division. When the node is a leaf node, it is the classification value
self.threshold = threshold
# Mean square difference of nodes
self.mse = mse
# Left node
self.left = left
# Right node
self.right = right

class RegressorTree:
"""
Regression decision tree
"""

def __init__(self, max_depth = None, min_samples_leaf = None):
# Maximum depth of decision tree
self.max_depth = max_depth
# Minimum sample number of decision node
self.min_samples_leaf = min_samples_leaf

def fit(self, X, y):
"""
Regression decision tree fitting
"""
self.root = self.buildNode(X, y, 0)
return self

def buildNode(self, X, y, depth):
"""
Construct regression decision tree node
"""
node = RegressorNode()
# Return directly when there is no sample
if len(y) == 0:
return node
y_classes = np.unique(y)
# When there is only one classification in the sample, the classification is returned directly
if len(y_classes) == 1:
node.threshold = y_classes[0]
return node
# When the depth of the decision tree reaches the maximum depth limit, the classification with the largest proportion of classification in the sample is returned
if self.max_depth is not None and depth >= self.max_depth:
node.threshold = np.average(y)
return node
# When the number of leaves in the decision-making node reaches the minimum classification limit
if self.min_samples_leaf is not None and len(y) <= self.min_samples_leaf:
node.threshold = np.average(y)
return node
min_mse = np.inf
min_middle = None
min_feature = None
# Traverse all features to obtain the feature with the lowest mean square deviation
for i in range(X.shape[1]):
# Calculate the mean square deviation of features
mse, middle = self.calcMse(X[:,i], y)
if min_mse > mse:
min_mse = mse
min_middle = middle
min_feature = i
# Features with minimum mean square deviation
node.feature = min_feature
# critical value
node.threshold = min_middle
# Mean square deviation
node.mse = min_mse
X_lt = X[:,min_feature] < min_middle
X_gt = X[:,min_feature] > min_middle
# Recursive processing of left-hand sets
node.left = self.buildNode(X[X_lt,:], y[X_lt], depth + 1)
# Recursive processing of right sets
node.right = self.buildNode(X[X_gt,:], y[X_gt], depth + 1)
return node

def calcMiddle(self, x):
"""
Calculate the average of two continuous features
"""
middle = []
if len(x) == 0:
return np.array(middle)
start = x[0]
for i in range(len(x) - 1):
if x[i] == x[i + 1]:
continue
middle.append((start + x[i + 1]) / 2)
start = x[i + 1]
return np.array(middle)

def calcMse(self, x, y):
"""
Calculate mean square deviation
"""
x_sort = np.sort(x)
middle = self.calcMiddle(x_sort)
min_middle = np.inf
min_mse = np.inf
for i in range(len(middle)):
y_gt = y[x > middle[i]]
y_lt = y[x < middle[i]]
avg_gt = np.average(y_gt)
avg_lt = np.average(y_lt)
mse = np.sum((y_lt - avg_lt) ** 2) + np.sum((y_gt - avg_gt) ** 2)
if min_mse > mse:
min_mse = mse
min_middle = middle[i]
return min_mse, min_middle

def predict(self, X):
"""
Regression decision tree prediction
"""
y = np.zeros(X.shape[0])
self.checkNode(X, y, self.root)
return y

def checkNode(self, X, y, node, cond = None):
"""
Classification is judged by regression decision tree nodes
"""
if node.left is None and node.right is None:
return node.threshold
X_lt = X[:,node.feature] < node.threshold
if cond is not None:
X_lt = X_lt & cond
lt = self.checkNode(X, y, node.left, X_lt)
if lt is not None:
y[X_lt] = lt
X_gt = X[:,node.feature] > node.threshold
if cond is not None:
X_gt = X_gt & cond
gt = self.checkNode(X, y, node.right, X_gt)
if gt is not None:
y[X_gt] = gt

# 6, Third party library implementation

scikit-learn 2 implementation of decision tree classification

from sklearn import tree

# Decision tree classification
clf = tree.DecisionTreeClassifier()
# Fitting data
clf = clf.fit(X, y)

scikit-learn 3 decision tree regression implementation

from sklearn import tree

# Decision tree regression
clf = tree.DecisionTreeRegressor()
# Fitting data
clf = clf.fit(X, y)

# 7, Animation demonstration

Figure 7-1 shows the classification results of a decision tree without regularization, and figure 7-2 shows the classification results of a regularized decision tree (max_depth = 3, min_samples_leaf = 5)

< center > Figure 7-1 < / center >

< center > figure 7-2 < / center >

figure 7-3 shows the regression results of a decision tree without regularization, and figure 7-4 shows the regression results of a regularized decision tree (max_depth = 3, min_samples_leaf = 5)

< center > figure 7-3 < / center >

< center > figure 7-4 < / center >

it can be seen that the decision tree without regularization is obviously over fitted to the training data set, and the situation of the regularized decision tree is relatively better.

# 8, Mind map

< center > figure 8-1 < / center >