Decision tree based on machine learning

Posted by boombanguk on Sat, 18 Dec 2021 18:10:27 +0100

Decision tree is a basic classification and regression method

1, Basic concepts of decision tree

The nodes and directed edges of the decision tree are represented respectively:

An internal node represents a feature or attribute
Leaf nodes represent a classification.
A directed edge represents a partition rule

The directed edge of the decision tree from the root node to the child node represents a path
The paths of the decision tree are mutually exclusive and complete
When using decision tree classification, it is to test a feature of the sample and allocate the sample according to the test results. If the sample is allocated to the child nodes of the tree, each child node should take a value of the feature
Advantages of decision tree: strong readability and fast classification speed

The decision tree follows the idea of divide and rule and can be considered as a set of if else then rules

2, Decision tree algorithm index

1. Information entropy / purity

Information entropy is used to measure uncertainty. The greater the entropy, the greater the uncertainty of information
For the classification problem, the greater the entropy of the current category, the greater its uncertainty, the worse the classification effect, and the impure the set
H ( D ) = E n t ( D ) = ∑ i = 1 n − p i log ⁡ 2 p i H\left( D \right) =Ent\left( D \right) =\sum_{i=1}^n{-p_i\log _2p_i} H(D)=Ent(D)=i=1∑n−pilog2pi
( n n n is the number of classifications, p i p_i pi (is the probability of occurrence of the current classification)
The information entropy is 0, which means that the information is determined and the classification is completed

2. Information gain

If the discrete attribute A has V possible values and is divided by A against D, it will have A weight of ∣ D v ∣ ∣ D ∣ \frac{|D^v|}{|D|} For the branch node of ∣ D ∣ Dv ∣, the increased entropy is Information Gain:
G a i n ( D , A ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain\left( D,A \right) =Ent\left( D \right) -\sum_{v=1}^V{\frac{|D^v|}{|D|}Ent\left( D^v \right)} Gain(D,A)=Ent(D)−∑v=1V∣D∣∣Dv∣Ent(Dv)
The greater the information gain, the greater the improvement of set purity obtained by dividing with attribute A

3.ID3

ID3 (iterative dichotomizor 3) iterates the second classification and the third generation, selects the features with the information gain criterion, determines the performance of the classifier, and constructs the decision tree
Algorithm idea: using data annotation, information gain and traversal, we can complete the selection of features and thresholds in a decision tree (get the features and thresholds with the maximum information gain) and get a classification decision tree

4. Information gain rate, ID4 5 and Gini index

The information gain criterion has a preference for features with a large number of values. In order to reduce the possible adverse effects of this preference, C4 5 decision tree algorithm uses "gain rate"
G a i n _ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio\left( D,a \right) =\frac{Gain\left( D,a \right)}{IV\left( a \right)} Gain_ratio(D,a)=IV(a)Gain(D,a)
I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ log ⁡ 2 ∣ D v ∣ ∣ D ∣ IV\left( a \right) =-\sum_{v=1}^V{\frac{|D^v|}{|D|}}\log _2\frac{|D^v|}{|D|} IV(a)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣
Where IV(a) is called the intrinsic value of a, that is, the more desirable values of a, the greater IV(a)
The purity of data set D can be measured by Gini index:
G i n i ( D ) = ∑ k = 1 ∣ y ∣ ∑ k ' ≠ k p k p k ' = 1 − ∑ k = 1 ∣ y ∣ p k 2 Gini\left( D \right) =\sum_{k=1}^{|y|}{\sum_{k^'\ne k}{p_kp_{k^'}}}=1-\sum_{k=1}^{|y|}{p_{k}^{2}} Gini(D)=∑k=1∣y∣∑k'=kpkpk'=1−∑k=1∣y∣pk2
The smaller Gini(D), the higher the purity of D

3, Decision tree algorithm

The algorithm of decision tree learning is usually to traverse, select the optimal feature and eigenvalue, and segment the training data according to the feature, so that each sub data set has the best classification. This process corresponds to the division of feature space and the construction of decision tree.
Optimal feature selection:
1. Discrete features: directly calculate the features corresponding to the maximum information gain according to the information gain formula
2. Continuous features: adopt continuous feature discretization technology and use dichotomy to process continuous features

The tree building process is recursive. There are three recursive return conditions:
1. Samples in D belong to the same category
2. D is an empty set and cannot be divided
3. The sample in D has the same value on all attributes or the attribute set is empty

4, ROC curve of decision tree

5, Implementation of continuous valued decision tree

import numpy as np
import math
from matplotlib import pyplot as plt


def loadDataSet(trainfile):  # Load data
    dataMat = []
    lable = []
    lablename = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
    fr = open(trainfile)
    for line in fr.readlines():
        lineArr = line.strip().split(',')
        dataMat.append([float(lineArr[1]), float(lineArr[2]), float(lineArr[3]), float(lineArr[4]), float(lineArr[5]),
                        float(lineArr[6]), float(lineArr[7]), float(lineArr[8])])
        lable.append(float(lineArr[0]))
    return dataMat, lable, lablename


def splitdata0(dataset, lable, lable_position, best_t):  # Separate the data set excluding A attribute according to the best partition point of A: smaller than the partition point
    splitDataSet = []
    splitlable = []
    for i in range(len(dataset)):
        if dataset[i][lable_position] <= best_t:
            DataSet = dataset[i][:lable_position]  # Label will not be included_ Point at position
            DataSet.extend(dataset[i][lable_position + 1:])
            splitDataSet.append(DataSet)
            splitlable.append(lable[i])
    return splitDataSet, splitlable


def splitdata1(dataset, lable, lable_position, best_t):  # Separate the data set after eliminating the A attribute according to the best partition point of A: larger than the partition point
    splitDataSet = []
    splitlable = []
    for i in range(len(dataset)):
        if dataset[i][lable_position] > best_t:
            DataSet = dataset[i][:lable_position]  # Label will not be included_ Point at position
            DataSet.extend(dataset[i][lable_position + 1:])
            splitDataSet.append(DataSet)
            splitlable.append(lable[i])
    return splitDataSet, splitlable


def Ent(pt, pf):  # Calculate the information entropy. pt is a positive example and pf is a negative example
    if pt == 0 or pf == 0:
        return 0
    else:
        return -pt / (pt + pf) * math.log(pt / (pt + pf)) - pf / (pt + pf) * math.log(pf / (pt + pf), 2)


def Gain(A, lable):  # Information gain; Attributes; Continuous value processing
    A = np.asarray(A)
    lable = np.asarray(lable)
    T = []
    t_star, max_gain = 0.0, 0.0  # Optimal partition, middle site, maximum information gain
    pt = np.shape(np.nonzero(lable))[1]
    pf = np.shape(lable)[0] - pt
    B = np.sort(A)  # In order to get the middle site, the original list A has also changed because asarray is used
    for i in range(np.shape(B)[0] - 1):
        T.append(float((B[i] + B[i + 1]) / 2))
    m = len(T)
    for i in range(m):
        p0_t = np.shape(np.nonzero(lable[np.nonzero(A <= T[i])[0]]))[
            1]  # Number of positive examples less than or equal to T[i], NP Nonzero (a < = T[i]) [0]: index less than or equal to T[i]
        p0_f = np.shape(np.nonzero(lable[np.nonzero(A <= T[i])[0]] == 0))[1]
        p1_t = np.shape(np.nonzero(lable[np.nonzero(A > T[i])[0]]))[1]
        p1_f = np.shape(np.nonzero(lable[np.nonzero(A > T[i])[0]] == 0))[1]
        gain = Ent(pt, +pf) - ((i + 1) / m * Ent(p0_t, p0_f)) - ((m - i - 1) / m * Ent(p1_t, p1_f))
        if max_gain < gain:
            max_gain = gain
            t_star = T[i]
    return t_star, max_gain


def bestAttributes(dataSet, lable, lablename):  # Select the optimal attribute, optimal partition point and corresponding sequence number
    dataSet = np.asarray(dataSet)
    lable = np.asarray(lable)
    best_gain, best_t, best_lable, lable_position = 0.0, 0.0, 'A', 0.0
    m = np.shape(dataSet)[1]
    if len(lablename) < m:  # There are not enough attributes to select, m is the actual number of attributes
        return best_lable, best_t, lable_position
    for i in range(m):
        t, max_gain = Gain(dataSet[:, i], lable)
        if best_gain < max_gain:
            best_gain = max_gain
            best_t = t
            best_lable = lablename[i]
            lable_position = i
    return best_lable, best_t, int(lable_position)


def MostLable(lable):  # Majority vote
    lable = np.asarray(lable)
    pt = np.count_nonzero(lable)  # Count the quantity with label 1
    pf = np.shape(lable)[0] - pt
    if pt > pf:
        return 1
    else:
        return 0


def TreeGenerate(dataSet, lable, lablename):  # Spanning tree
    if len(set(lable)) == 1:  # All sample categories are consistent
        return lable[0]
    if len(dataSet[0]) == 0:  # All properties are divided, that is, the number of columns is 0
        return MostLable(lable)
    best_lable, best_t, lable_position = bestAttributes(dataSet, lable, lablename)  # The optimal label, optimal partition point and sequence number of optimal partition point are obtained
    if best_t == 0:  # Samples have the same value on a certain attribute
        return MostLable(lable)
    tree = {best_lable: {}}
    del (lablename[lable_position])
    dataSet0, lable0 = splitdata0(dataSet, lable, lable_position, best_t)
    dataSet1, lable1 = splitdata1(dataSet, lable, lable_position, best_t)
    tree[best_lable]['<=' + str(round(best_t, 3))] = TreeGenerate(dataSet0, lable0, lablename)
    tree[best_lable]['>' + str(round(best_t, 3))] = TreeGenerate(dataSet1, lable1, lablename)
    return tree


def predict(tree, feat, data, T, leaf, y, inde):
    firstFeat = list(tree.keys())[0]  # Root node properties
    secondDict = tree[firstFeat]  # Subtree of root node
    featIndex = feat.index(firstFeat)  # The sequence number (number) of firstFeat in feat, that is, which attribute corresponds!!!!!!
    for key in secondDict.keys():  # There are < = and > key s
        if data[featIndex] <= T[featIndex]:  # Since the attribute order of data is ABCDEFGH, ABCDEFGH should also be used in T, otherwise an error will occur
            ture_key = '<=' + str(T[featIndex])
            if ture_key == key:  # String equality
                if type(secondDict[key]).__name__ == "dict":  # If the child node is a dictionary type, it recurses until it is a number
                    classlable = predict(secondDict[key], feat, data, T, leaf, y, inde)
                else:
                    classlable = secondDict[key]
                    if str(key[2:7]) == '0.151':
                        if y[inde] == 0:
                            leaf[str(key[2:7]) + '0' + '0'] += 1
                        else:
                            leaf[str(key[2:7]) + '0' + '1'] += 1
                    else:
                        if y[inde] == 0:
                            leaf[str(key[2:7]) + '0'] += 1
                        else:
                            leaf[str(key[2:7]) + '1'] += 1
        else:
            ture_key = '>' + str(T[featIndex])
            if ture_key == key:
                if type(secondDict[key]).__name__ == "dict":
                    classlable = predict(secondDict[key], feat, data, T, leaf, y, inde)
                else:
                    classlable = secondDict[key]
                    if str(key[1:6]) == '0.151':
                        if y[inde] == 0:
                            leaf[str(key[1:6]) + '1' + '0'] += 1
                        else:
                            leaf[str(key[1:6]) + '1' + '1'] += 1
                    else:
                        if y[inde] == 0:
                            leaf[str(key[1:6]) + '0'] += 1
                        else:
                            leaf[str(key[1:6]) + '1'] += 1
    return classlable


def getKey(x):
    return float(x[0])


def roc_draw(leaf):
    mat = np.zeros((9, 2))
    i = j = 0
    sum1 = sum0 = 0
    for key, value in leaf.items():
        if key[-1] == '0':
            mat[i][1] = value
            sum0 += value
            i += 1
        else:
            mat[j][0] = value
            sum1 += value
            j += 1
    col_one = mat[:, 0]
    col_two = mat[:, 1]
    col_one = np.sort(col_one)
    col_two = np.sort(col_two)
    mat = np.vstack((col_one, col_two)).T
    fpr = [0]
    tpr = [0]
    s0 = s1 = 0
    s = 0
    for i, j in mat:
        print(j, i)
        temp0 = s0
        temp1 = s1
        s1 += i
        s0 += j
        s += (((temp1 + s1) / sum1) * (j / sum0) / 2)
        fpr.append(s0 / sum0)
        tpr.append(s1 / sum1)
    plt.plot(fpr, tpr, color='red')
    plt.xlabel("False Positive Rate(FPR)")
    plt.ylabel("Ture Positive Rate(TPR)")
    plt.grid(alpha=0.4)
    plt.show()
    return s


def con_mat(true_lable, pre_lable):
    tp, fp, fn, tn = 0, 0, 0, 0
    for i in range(len(true_lable)):
        if true_lable[i] == pre_lable[i]:
            if true_lable[i] == 1:
                tp += 1
            else:
                tn += 1
        else:
            if true_lable[i] == 1:
                fn += 1
            else:
                fp += 1

    print("confusion matrix:", tp, fp)
    print("                 ", fn, tn)
    print("accuracy:        ", (tp + tn) / (tp + tn + fn + fp))
    print("precisionScore:  ", tp / (tp + fp))
    print("recallScore:     ", tp / (tp + fn))
    print("F1:              ", tp / (tp + (fn + fp) / 2))
    return


traindata, trainlable, lablename = loadDataSet('classification_train.txt')
tree = TreeGenerate(traindata, trainlable, lablename)
print(tree)

# verification
testdata, testlable, testlablename = loadDataSet('classification_test.txt')
T = [0.235, 0.617, 0.607, 0.165, 0.151, 0.446, 0.036, 0.008]
pre_lable = []
feat = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']

leaf = {}
leaf['0.2350'] = leaf['0.6170'] = leaf['0.6070'] = leaf['0.1650'] = leaf['0.4460'] = leaf['0.0360'] = \
    leaf['0.0080'] = 0
leaf['0.2351'] = leaf['0.6171'] = leaf['0.6071'] = leaf['0.1651'] = leaf['0.4461'] = leaf['0.0361'] = \
    leaf['0.0081'] = 0
leaf['0.15100'] = leaf['0.15101'] = leaf['0.15110'] = leaf['0.15111'] = 0
inde = 0
for data in testdata:
    pre_lable.append(predict(tree, feat, data, T, leaf, testlable, inde))
    inde += 1

con_mat(testlable, pre_lable)
print("ROC AUC: ", roc_draw(leaf))

Topics: Machine Learning Decision Tree sklearn

Programmer Think