Supervised algorithm decision tree

Posted by FraggleRock on Sat, 25 Dec 2021 06:28:28 +0100

Supervised algorithm decision tree

1: Algorithm overview

Decision tree: includes classification tree / regression tree. Regression tree is rarely used, so classification tree is mainly introduced here

Algorithm flow: feature selection – > decision tree generation – > decision tree pruning

2: Feature selection

2.1 aroma entropy (information entropy)

Entropy represents the degree of information confusion. The more chaotic the information is, the higher the entropy is

Calculation formula of entropy:
E n t r o p y ( m ) = − ∑ i = 1 k p i l o g 2 p i Entropy(m) = -\sum_{i=1}^{k}p^ilog_2p^i Entropy(m)=−i=1∑kpilog2pi
When p=0, entropy = 0, because the certainty is 100%, there is no information confusion

#Calculation of information entropy
import numpy as np 
p1 = 0.1
p2 = 0.9
s = -p1*np.log2(p1)-p2*np.log2(p2)
print(s)

Gini index

Φ ( p , 1 − p ) = 1 − ∑ i = 1 m p i 2 \Phi(p,1-p) = 1-\sum_{i=1}^{m}p_i^2 Φ(p,1−p)=1−i=1∑mpi2

Classification error

Φ ( p , 1 − p ) = 1 − m a x ( p , 1 − p ) \Phi(p,1-p) = 1-max(p,1-p) Φ(p,1−p)=1−max(p,1−p)

Code implementation of information entropy:

The entropy of all information is calculated here. Because it is calculated through the label column - gender, the entropy value is relatively high

#Code implementation of information entropy
#1. Data preparation
import pandas as pd 
import numpy as np

data = pd.DataFrame(
    data={
        'Basketball':[0,1,1,1,0],
        'game':[1,0,0,1,1],
        'Gender':[0,1,1,0,0],
    }
)
print(data)

#Calculate the entropy of all data through the label column
def entropy(data):
    #Get number of rows
    rows = data.shape[0]
    #Get the gender column and find the number of each category
    sex_count = data.iloc[:,-1].value_counts()
    #Find the probability value
    p = sex_count / rows  
    #Seeking entropy
    score = (-(p*np.log2(p))).sum()
    return score 

entropy(data)  #0.9709505944546686

2.2 information gain

The greater the difference between the impure degree of the parent node (before partition) and the impure degree of the branch node (after partition), it means that the greater the "impure improvement" obtained by using attribute a for partition, the better the test condition effect.

Formula:
letter interest increase benefit = I ( father section spot ) − I ( son section spot ) father section spot generation surface total body of letter interest entropy son section spot generation surface finger set special Dismantle branch of after of letter interest entropy Information gain = I (parent node) - I (child node) \ \ the parent node represents the overall information entropy \ \ the child node represents the information entropy after the specified special split Information gain = I (parent node) − I (child node) the parent node represents the overall information entropy, and the child node represents the information entropy after the specified special split

Case: calculate the information entropy of basketball

Data:

Basketball	0	1	1	1	0
game	0	0	0	0	1
Gender	0	1	1	0	0

Division: because the information entropy of basketball is calculated, basketball features are used for classification

Conclusion:

Information gain of basketball characteristics = 0.97 - 0.55 = 0.42

Similarly, calculate the information gain of the game = 0.97 - 0.8 = 0.17

Comparing the information gain of basketball and game, it shows that the information gain brought by the characteristics of basketball is greater. Therefore, on the premise of only these two characteristics, the characteristics of basketball should be used as the root node to disassemble the molecular node

2.3 dividing data sets

The maximum criterion for dividing data sets is to select the maximum information gain, that is, the direction in which the information decreases the fastest

Calculation of maximum information gain realized by code:

#Maximum information gain calculation
#1. Dataset
import pandas as pd 
import numpy as np

data = pd.DataFrame(
    data={
        'Basketball':[0,1,1,1,0],
        'game':[0,0,0,0,1],
        'Gender':[0,1,1,0,0],
    }
)
print(data)

#2. Return the information entropy of the whole data set
def allEnt(data):
    #Number of rows
    n = data.shape[0]
    #Classification and probability of classification
    cnt = data.iloc[:,-1].value_counts()
    p = cnt/n
    #Information entropy
    ent = (-p*np.log2(p)).sum()
    return ent

#3. Loop through each column and label column to calculate information entropy
def preEnt(data):
    #First calculate the total entropy
    allent = allEnt(data)
    print('Total information entropy=',allent)
    #Total number of rows
    rows = data.shape[0]
    print('Total number of rows=',rows)
    #Define a variable to compare which feature has the highest information gain
    best = -1 
    #The definition variable receives the column label because it changes every loop
    axis = 0
    
    
    for i in range(0,data.shape[1]-1): #Take one column of data at a time
        flags = data.iloc[:,i].value_counts().index #Get the type of each column
        print(flags)
        ent=0
        
        for j in flags: #Traverse the classification of each column, label the column by classification, and get the data set of each child node (this data set includes redundant features, but does not affect the calling function allEnt())
            childSet = data[data.iloc[:,i]==j] #Each sub data set
            #print(childSet)
            childEnt = allEnt(childSet) #Call the above method to find the information entropy of each subset
            #print(childEnt)
            ent += childEnt * (childSet.shape[0]/rows) #The information entropy of each child node is multiplied by the weight and then accumulated
        
        #Calculate information gain
        zy = allent - ent
            
        print('The first{}Information entropy of columns='.format(i),ent)
        print('The first{}Column information gain='.format(i),zy)
        
        #Determine which feature has the highest information gain
        if zy>= best:
            best = zy  #Assign the data with the highest information gain to best
            axis = i  #Assign the column label with the highest information gain to axis

    print('Highest information gain=',best)
    print('Column with the highest information gain=',axis)
    
    return best,axis 
    
preEnt(data)  #Results: (0.4199730940219749, 0) represents the highest information gain = 0.42, and the data in column 0 produces the highest information gain

2.4 data set division by given column

In fact, the column with the highest information gain is calculated in the previous step, then the column label is obtained, and then the data set is divided

Objective: to prepare for further division

#Data is divided by the specified column and the classification of the specified column
#1. Dataset
import pandas as pd 
import numpy as np

data = pd.DataFrame(
    data={
        'Basketball':[0,1,1,1,0],
        'game':[0,0,0,0,1],
        'Gender':[0,1,1,0,0],
    }
)
print(data)

#2. Divide the data set. According to the above, column 0 is the column with the highest information gain, so column 0 is used for division
def split(data,column,value):
    col = data.columns[column] #The name of the specified column was found
    childSet = data[data.iloc[:,column]==value] #Here is the data with value = value in the column
    childSet = childSet.drop(col,axis=1) #Delete the specified partitioned column
    
    return childSet

split(data,0,1) #

3: Decision tree algorithm

Algorithm principle: recursion

Algorithm classification: ID3 / C4 5 / C5. 0

The conditions for recursive constraints are:

The program traverses all the attributes that divide the dataset
All instances under each branch have the same classification
The sample set contained in the current node is empty and cannot be divided

3.1 ID3

The core of ID3 algorithm is to apply information gain criterion to select features at each node of decision tree and construct decision tree recursively. The specific methods are:

Starting from the root node, the information gain of all possible features is calculated for the node.
The feature with the largest information gain is selected as the feature of the node, and the child nodes are established according to the different values of the feature.
Then call the above methods on the child nodes to build a decision tree.
Until the information gain of all features is very small or no features can be selected, finally a decision tree is obtained.

Disadvantages of ID3:

The limitation of ID3 algorithm mainly comes from the local optimization condition, that is, the calculation method of information gain. Its limitations mainly include the following points:

The higher the branching degree (the more the classification level), the smaller the total information entropy of the child nodes, and ID3 is cut according to a certain column
The classification of some columns may not indicate the results well enough. In extreme cases, the ID is taken as the segmentation field, and each classification
The purity of is 100%, so this classification method is not effective.
Continuous variables cannot be processed directly. If ID3 is used to process continuous variables, the continuous variables need to be discretized first.
It is sensitive to missing values. It is necessary to deal with the missing values in advance before using ID3.
Without pruning, it is easy to lead to over fitting, that is, it performs well in the training set and poorly in the test set

3.2 C4.5

C4.5 is better than ID3 because it modifies the local optimization conditions

ID3 takes the information gain as the feature of dividing the training data set, which is biased to select the feature with more values
C4.5 this problem can be corrected by using information gain ratio.
The gain rate criterion has a preference for attributes with a small number of values, so C4 5 the algorithm does not directly select the gain rate
Instead of the largest candidate partition attribute, a heuristic is used: first, find the genus whose information gain is higher than the average level from the candidate partition attributes
And then select the one with the highest gain rate.

Formula:
letter interest increase benefit rate / than G a i n _ r a t i o n ( D , a ) = G a i n ( D , a ) I V ( a ) his in I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ I V ( a ) call call by genus nature / special sign a of solid have value , V generation surface special sign a of l e v e l number , place with V More large , be I V ( a ) also More high Gain / gain ratio\_ Ratio (D, a) = \ frac {gain (D, a)} {IV(a)} \ \ where IV(a) = - \ sum_ {v = 1} ^ {V} \ frac {| D} {| D} log2 \ frac {| D} {| D} \ \ IV(a) is called the intrinsic value of attribute / feature a, and V represents the level number of feature a, so the larger V, the higher IV(a) Gain / gain ratio_ Ratio (D, a) = IV(a) gain (D, a), where IV(a) = − v=1 Σ V ∣ D ∣∣ Dv ∣ log2 ∣ D ∣∣ Dv ∣ IV(a) is called the intrinsic value of attribute / feature a, and V represents the level number of feature a, so the larger V, the higher IV(a)
Calculate information gain rate:

	Characteristic a	Characteristic b	label
0	0	1	yes
1	0	1	yes
2	0	0	no
3	1	1	no
4	1	1	no

According to the above data, calculate whether the information gain rate of feature a:

IV(a) = -3/5log2(3/5)-2/5log2(2/5) = 0.971

Total information entropy: ent_all = -2/5log2(2/5) -3/5log2(3/5)= 0.97

Information entropy of feature a: ent_a = -3/5*(2/3np.log2(2/3)+1/3np.log2(1/3))= 0.55

Information gain of feature a: Gain(D,a) = 0.97-0.55=0.42

Information increment ratio of feature a: Gain_ration(D,a) = 0.55/0.97 = 0.432

C4.5 how to deal with continuous variables:

At C4 5, the processing means for continuous variables is also added. If the input characteristic field is a continuous variable, the algorithm will first
Sort the number of this column from small to large, and then select the middle number of two adjacent numbers as the candidate point of the segmented data set. If one is connected
If the continuation variable has N values, it is in C4 5 will generate N-1 alternative tangent points, and each tangent point represents a binary
Partition scheme of fork tree

4: Goodness of fit optimization (CART algorithm)

Judge whether it is over fit or under fit

prune

There is a unique solution to prevent over fitting of decision tree - pruning.

	pre-pruning	post-pruning
Number of branches	Many branches are not expanded	Many branches are reserved
Fitting risk	The risk of over fitting is reduced, but the essence of greedy algorithm prohibits the expansion of subsequent branches, resulting in the risk of under fitting	Mr. Cheng becomes a decision tree and is assessed one by one from top to bottom. The risk of under fitting is small and the generalization ability is strong
Time cost	Low training and testing costs	Relatively small pre pruning

Therefore, post pruning is commonly used

Pruning algorithm – CART algorithm:

The splitting process is a binary recursive partitioning process
The type of CART predictive variable x can be either continuous variable or subtype variable (if it is continuous variable, it needs to be transformed into dummy variable)
The data shall be processed in its original form without discretization
When used for numerical prediction, regression is not used, but the prediction is based on the average value of cases reaching leaf nodes

Splitting criterion of CART algorithm: binary splitting

Binary recursive partition: the condition holds to the left, and vice versa to the right

For continuous variables: the condition is that the attribute is less than or equal to the optimal splitting point
For category variables: the condition is that the attribute belongs to several classes

Advantages of binary splitting:

Compared with the slow speed of data fragmentation caused by multi-channel splitting, it allows repeated splitting on one attribute, that is, enough data can be generated on one attribute
The division of. The improvement of tree prediction performance brought by two-way splitting is enough to make up for the corresponding loss of tree readability.
For the predicted variables with different attributes, the y splitting criteria are different:

Classification tree: Gini criterion. Similar to the previous information gain, Gini coefficient measures the impurity of a node.
Regression tree: a common segmentation criterion is standard deviation reduction (SDR), similar to
Least mean square error LS (least squares) criterion.

How to prune CART algorithm:

5: sklean generating decision tree

5.1 parameter criterion

criterion two parameters	Entropy information entropy	gini Gini coefficient
-	It is more sensitive to impure, and the punishment for impure is the strongest
-	The calculation is slow because it involves logarithmic operation	No logarithm, faster
-	The data with high latitude / noise is easy to be over fitted. If the model is under fitted, information entropy can be selected	Use of noisy data

5.2 code implementation

#Implementing decision tree with sklean
#1. Import the required algorithm libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
plt.rcParams['font.sans-serif']=['Simhei']
plt.rcParams['axes.unicode_minus']=False

#2. Prepare data
wine = load_wine()
wine.data.shape
wine.target
wine_pd=pd.concat([pd.DataFrame(wine.data),pd.DataFrame(wine.target)],axis=1)
#wine.feature_name = ['alcohol', 'malic acid', 'ash', 'ash alkalinity', 'magnesium', 'total phenols',' flavonoids', 'non flavanophenols',' anthocyanins', 'color intensity', 'hue', 'od280/od315 diluted wine', 'proline']
wine.feature_names.append("result")
wine_pd.columns=wine.feature_names
wine_pd = wine_pd.rename(columns={
    'alcohol':'alcohol','malic_acid':'malic acid',
    'ash':'ash','alcalinity_of_ash':'Alkalinity of ash',
    'magnesium':'magnesium','total_phenols':'Total phenol',
    'flavanoids':'flavonoid','nonflavanoid_phenols':'Non flavane phenols',
    'proanthocyanins':'anthocyanin','color_intensity':'Color intensity',
    'hue':'tone','od280/od315_of_diluted_wines':'od280/od315 Diluted wine',
    'proline':'proline','result':'label'      
})
print(wine_pd.head())

#3. Divide training set and test set
Xtrain,Xtest,Ytrain,Ytest = train_test_split(wine_pd.iloc[:,:-1],wine_pd.iloc[:,-1],test_size=0.3,random_state=420)

#4. Establish model
clf = tree.DecisionTreeClassifier(criterion='gini') #Use the gini parameter, which is also gini by default
#5. Training model
clf = clf.fit(Xtrain,Ytrain)
#6. Check accuracy
score = clf.score(Xtest,Ytest)
print(score) # 0.9444444444444444

#4. Draw trees
#Using the package: pip install graphviz, you also need to manually install plug-ins and set the system environment PATH (including bin directory and dot.exe under bin directory)
import graphviz
import matplotlib.pyplot as plt
feature_name = ['alcohol','malic acid','ash','Alkalinity of ash','magnesium','Total phenol','flavonoid','Non flavane phenols','anthocyanin','Color intensity','tone','od280/od315 Diluted wine','proline']
dot_data = tree.export_graphviz(clf, #Model
                                feature_names= feature_name, #Feature column name
                                class_names=["Gin","Shirley","Belmord"], #In fact, it corresponds to 0 / 1 / 2 of the label
                                filled=True, #Render color
                               )
graph = graphviz.Source(dot_data,filename='Decision tree PDF')
graph

export_graphviz generates the parameters of a decision tree in DOT format:

feature_names: the name of each attribute
class_names: the name of each dependent variable category
Label: whether to display the label of impurity information. The default is "all", which can be "root" or "none"
filled: whether to draw different colors for the main classification of each node. The default is False
out_file: the name of the output dot file. The default is None, which means that the file is not output. It can be a user-defined name, such as "tree.dot"
Rounded: the default is Ture, which means that the border of each node is rounded and Helvetica font is used
More parameters: https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

Preliminary appearance of decision tree:

6: Decision tree attribute

1.clf.feature_importances_

Show the importance of each feature

clf.feature_importances_

[*zip(feature_name,clf.feature_importances_)]

2.clf.apply

Returns the leaf index for each prediction sample

3.clf.tree_.node_count

Returns the number of nodes in the tree

4.clf.tree_.feature

Return the attribute index value corresponding to each node, - 2 indicates leaf node:

7: Prevent overfitting (pruning parameters)

7.1 parameter random_state random seed

Each time you set different random seeds, the resulting trees are different, so

tree.DecisionTreeClassifier(
    					criterion="entropy"
						,random_state=30   #This random seed can be changed at will
						,splitter="random"
						)

7.2 parameter splitter

splitter has two values:

1.splitter ='best'

Although the decision tree branches randomly, it will give priority to more important features for branching (the importance can be viewed through the attribute feature_imports_);

2.splitter ='random'

The possibility of over fitting can be reduced, but the post pruning is optimized by using the following pruning parameters

tree.DecisionTreeClassifier(
    					criterion="entropy"
						,random_state=30   
						,splitter="random" #best / random 
						)

7.3 pruning parameters:

7.3.1 max_depth limits the maximum depth of the tree

It is recommended to try from = 3 to see the fitting effect, and then decide whether to increase the set depth

7.3. 2 min_ samples_ At least the number of samples per node after leaf branch

If the value is too small, it will be over fitted and under fitted

It is recommended to start with = 5. If the sample size contained in the leaf node changes greatly, it is recommended to enter a floating-point number as a percentage of the sample size

7.3.3 min_samples_split each node must be greater than or equal to this sample size before it can be branched

7.3.4 max_features limits the number of features considered when branching. Features exceeding the limit will be discarded

The pruning parameter is used to limit the over fitting of high-dimensional data, but its method is violent. It is a parameter that directly limits the number of features that can be used and forcibly stops the decision tree. Forcibly setting this parameter may lead to insufficient model learning without knowing the importance of each feature in the decision tree

If you want to prevent overfitting by dimensionality reduction, it is recommended to use PCA, ICA or dimensionality reduction algorithm in feature selection module

7.3.5 min_impurity_decrease limits the amount of information gain

Branches with information gain less than the set value will not occur. This is a function updated in version 0.19. Min was used before version 0.19_ impurity_ split

Case:

#The data is to be ignored and the case of red wine classification above is to be continued

#4. Establish model
clf = tree.DecisionTreeClassifier(criterion='gini',
random_state=420, #Random seed
splitter='random',#Over fitting can be prevented

#1. Maximum depth, except for the root node, the number of layers of lower sub nodes = max_depth
max_depth=5,
#2. At least the number of samples of the child node. If it is a floating-point type, it represents the percentage of the child node in the parent node
min_samples_leaf=0.1,
#3. The minimum number of sample branches of child nodes. Only nodes with a sample size of this number can be branched
min_samples_split=20,
#4. Maximum number of features. Nodes with more than this number of features will be discarded
max_features=3,
#5. Minimum branch information gain. When the information gain of the feature is less than the set value, it will not be branched
min_impurity_split=0.1
) #Use the gini parameter, which is also gini by default
#5. Training model
clf = clf.fit(Xtrain,Ytrain)
#6. Check accuracy
score = clf.score(Xtest,Ytest)
print(score) # 0.9444444444444444

#4. Draw trees
#Usage package: pip install graphviz
import graphviz
import matplotlib.pyplot as plt
feature_name = ['alcohol','malic acid','ash','Alkalinity of ash','magnesium','Total phenol','flavonoid','Non flavane phenols','anthocyanin','Color intensity','tone','od280/od315 Diluted wine','proline']

dot_data = tree.export_graphviz(clf, #Model
feature_names= feature_name, #Feature column name
class_names=["Gin","Shirley","Belmord"], #In fact, it corresponds to 0 / 1 / 2 of the label
filled=True, #Render color
)
graph = graphviz.Source
(dot_data,filename='Decision tree PDF')
graph

7.4 how to determine the optimal pruning parameters

Here, the learning curve is used to find the optimal parameter with the maximum accuracy / score

Disadvantages: if you need to find the optimal solution of all parameters, you need to nest multi-layer loops, which is very troublesome

Solution: Web search, subsequent updates

Case: use the learning curve to find the optimal solution of the maximum depth

#The data is to be ignored and the case of red wine classification above is to be continued

y=[]
for i in range(1,21):
#4. Establish model
    clf = tree.DecisionTreeClassifier(criterion='gini',
                                random_state=420, #Random seed
                                splitter='random',#Over fitting can be prevented
                                max_depth=i #1. Maximum depth
                                                          
                                 ) 
#5. Training model
    clf = clf.fit(Xtrain,Ytrain)
#6. Check accuracy
    score = clf.score(Xtest,Ytest)
    y.append(score)
    print(score) # 0.9444444444444444

#Drawing, learning curve
import matplotlib.pyplot as plt
plt.plot(range(1,21),y)
plt.xticks(range(1,21))

7.5 summary

Attributes are various properties of the model that can be called to view after model training. For the decision tree, the most important is the feature_importances_， Be able to view the importance of each feature to the model.

The interfaces of many algorithms in sklearn are similar. For example, fit and score, which we have used before, can be used for almost every algorithm. In addition to these two interfaces, the most commonly used interfaces for decision trees are apply and predict.
Input the test set in apply and return the index of the leaf node where each test sample is located.
predict the input test set and return the label of each test sample. The returned content is clear at a glance and very easy. If you are interested, you can choose from it
Go down and try.
It must be mentioned here that the input characteristic matrix of all interfaces that require the input of Xtrain and Xtest must be at least one two
Dimensional matrix. sklearn does not accept any one-dimensional matrix as input of characteristic matrix. If your data does have only one feature, it must
Use reshape(-1,1) to add dimension to the matrix.
The classification tree has eight parameters, one attribute, four interfaces, and the code used for drawing:
- Eight parameters: Criterion;
- Two randomness related parameters (random_state, splitter);
- Five pruning parameters (max_depth, min_samples_split, min_samples_leaf, max_feature, min_imparity_decrease);
- One attribute: feature_importances_ ;
- Four interfaces: fit, score, apply, predict

8: Classification model evaluation index (class_weight)

8.1 method of the first sample nonuniformity problem

Create uneven samples

#Case of uneven sample
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs #This is the method of clustering to generate data sets

class_1=1000 #Category 1 sample size
class_2=100  #Category 2 sample size
centers=[[0,0],[2,2]] #Center point of two categories
clusters_std=[2.5,0.5] #Variance of two categories

x,y = make_blobs(n_samples=[class_1,class_2],
          centers=centers,
           cluster_std=clusters_std,
           random_state=420,
           shuffle=False
          )

plt.scatter(x[:,0],x[:,1],c=y,cmap='rainbow',s=10)

Comparison setting class_ Score of weight parameter and unset parameter:

#Partition dataset
from sklearn.model_selection import train_test_split
Xtrain,Xtest,Ytrain,Ytest = train_test_split(x,y,test_size=0.2, random_state=420)

#Create model
#1. Do not set class_weight
clf_01 = DecisionTreeClassifier()
clf_01 = clf_01.fit(Xtrain,Ytrain)

score_01 = clf_01.score(Xtest,Ytest)
print('Score without setting parameters=',score_01)
#2. Set class_weight
clf_02 = DecisionTreeClassifier(class_weight='balanced')
clf_02 = clf_02.fit(Xtrain,Ytrain)

score_02 = clf_02.score(Xtest,Ytest)
print('Set the score of the parameter=',score_02)

Result: set class for uneven samples_ Weight can get higher scores

Score without setting parameters= 0.8954545454545455
 Set the score of the parameter= 0.9045454545454545

9: Evaluation index of classification model (confusion matrix – imbalance of special two classification samples)

9.1 concept of confusion matrix

Confusion matrix is a multi-dimensional measurement index system for binary classification problems, which is very useful when the samples are unbalanced

In the confusion matrix, we consider a few classes as positive cases and most classes as negative cases

In the classification algorithms of decision tree and random forest, that is, the minority class is 1 and the majority class is 0

In SVM, that is to say, a few classes are 1 and most classes are - 1

		Estimate	Estimate
		1	0
True value	1	11 (TP)	10 (FN)
True value	0	01 (FP)	00 (TN)

Of which:
The row represents the forecast situation, and the column represents the actual situation.

The predicted value is 1, which is recorded as P (Positive)
The predicted value is 0 and recorded as N (Negative)
The predicted value is the same as the real value, which is recorded as T (True)
The predicted value is opposite to the real value and is recorded as F (False)
Therefore, the four elements in the matrix represent:
◼ TP (True Positive): the true value is 1, and the predicted value is 1
◼ FN (False Negative): the true value is 1, and the predicted value is 0
◼ FP (False Positive): the true value is 0, and the predicted value is 1
◼ TN (True Negative): the true value is 0, and the predicted value is 0

9.2 model effect evaluation

9.2. 1 Accuracy

Accuracy = (11+00) / (11+00+10+01)

Accuracy is all samples with correct prediction divided by the total samples. Generally speaking, the closer it is to 1, the better.

Remember that a few classes are 1 and most classes are 0

9.2. 2 Precision, also known as Precision

Precision = (11) / (11+01)

Indicates the proportion of the actual number of samples with 1 in all the samples with 1 prediction result.

The lower the accuracy, the greater the proportion of 01, which means that your model has a higher misjudgment rate for most classes 0, and has wrongly injured too many classes (why misjudged most classes? Because most classes are recorded as 0 and a few classes are recorded as 1)

9.2. 3 Recall, also known as sensitivity, real rate and Recall rate

Recall = (11) / (11+10)

Represents the proportion of samples that are predicted to be correct among all samples with a true value of 1

The higher the recall rate, the more minority categories we try to capture

The lower the recall rate, it means that we have not captured enough minority classes

9.2.4 F1 measure

In order to take into account both accuracy and recall, we created the harmonic average of the two as a comprehensive index to consider the balance between the two, which is called F1measure

F-measure = (2 * Precision) / (Precision + Recall)

F1 measure is distributed between [0,1]. The closer it is to 1, the better

9.2. 5 False Negative Rate

It is equal to 1 - Recall. It is used to measure all samples with a real value of 1, which are wrongly judged as 0 by us. It is usually not used much.

FNR = (10) / (11 + 10)

9.2.6 ROC curvereceiver operating characteristic curve

9.3 confusion matrix in sklean

class	meaning
sklearn.metrics.confusion_matrix	Confusion matrix
sklearn.metrics.accuracy_score	Accuracy
sklearn.metrics.precision_score	accuracy
sklearn.metrics.recall_score	recall
sklearn.metrics.precision_recall_curve	Accuracy recall balance curve
sklearn.metrics.f1_score	F1 measure

Comparing the two models in the above case, the two models are unbalanced clf_01 / balance CLF_ Some data between 02:

#Import confusion matrix package
from sklearn import metrics

#1. Compare clf_01 and clf_02 accuracy

metrics.precision_score(Ytest,clf_01.predict(Xtest)) #0.6363636363636364

metrics.precision_score(Ytest,clf_02.predict(Xtest)) #0.6538461538461539

#2. Compare recall rate
metrics.recall_score(Ytest,clf_01.predict(Xtest)) #0.4827586206896552
metrics.recall_score(Ytest,clf_02.predict(Xtest)) #0.5862068965517241

#3. Compare F1 value
metrics.f1_score(Ytest,clf_01.predict(Xtest)) #0.5490196078431373
metrics.f1_score(Ytest,clf_02.predict(Xtest)) #0.6181818181818182

10: Advantages and disadvantages of decision tree algorithm

advantage:

Easy to understand and explain, because trees can be painted and seen.
Little data preparation is required. Many other algorithms usually need data normalization, creating virtual variables and deleting null values. But please
Note that the decision tree module in sklearn does not support the processing of missing values.
The cost of using the tree (for example, when predicting data) is the logarithm of the number of data points used to train the tree, compared with other algorithms
Law, this is a very low cost.
It can process digital and classified data at the same time, and can do both regression and classification. Other techniques are usually dedicated to analysis with only one
Data set of variable type.
Even if its assumptions violate the real model of generated data to some extent, it can perform well.

Disadvantages:

Using decision trees may create overly complex trees that do not generalize data well. This is called overfitting. Trimming, setting leaf nodes
Mechanisms such as the minimum number of samples required by points or setting the maximum depth of the tree are necessary to avoid this problem, and the integration and adjustment of these parameters are very important to the beginning
It will be more obscure for scholars.
The decision tree may be unstable, and small changes in the data may lead to completely different trees. This problem needs to be solved by integration algorithm
No.
The learning of decision tree is based on greedy algorithm, which tries to achieve the overall optimization by optimizing the local optimization (the optimization of each node), but this
This method can not guarantee to return to the global optimal decision tree. This problem can also be solved by integrated algorithms in random forests, features and samples
Will be randomly sampled during branching.
If some classes in the tag are dominant, the decision tree learner will create a tree biased towards the dominant class. Therefore, it is suggested to fit the decision tree
Pre balance dataset.

Topics: Data Mining Decision Tree

Programmer Think

Supervised algorithm decision tree

Supervised algorithm decision tree

1: Algorithm overview

2: Feature selection

2.1 aroma entropy (information entropy)

Gini index

Classification error

Code implementation of information entropy:

2.2 information gain

2.3 dividing data sets

2.4 data set division by given column

3: Decision tree algorithm

3.1 ID3

Disadvantages of ID3:

3.2 C4.5

C4.5 how to deal with continuous variables:

4: Goodness of fit optimization (CART algorithm)

prune

Pruning algorithm – CART algorithm:

Splitting criterion of CART algorithm: binary splitting

Advantages of binary splitting:

How to prune CART algorithm:

5: sklean generating decision tree

5.1 parameter criterion

5.2 code implementation

Preliminary appearance of decision tree:

6: Decision tree attribute

1.clf.feature_importances_

2.clf.apply

3.clf.tree_.node_count

4.clf.tree_.feature

7: Prevent overfitting (pruning parameters)

7.1 parameter random_state random seed

7.2 parameter splitter

7.3 pruning parameters:

7.3.1 max_depth limits the maximum depth of the tree

7.3. 2 min_ samples_ At least the number of samples per node after leaf branch

7.3.3 min_samples_split each node must be greater than or equal to this sample size before it can be branched

7.3.4 max_features limits the number of features considered when branching. Features exceeding the limit will be discarded

7.3.5 min_impurity_decrease limits the amount of information gain

7.4 how to determine the optimal pruning parameters

7.5 summary

8: Classification model evaluation index (class_weight)

8.1 method of the first sample nonuniformity problem

Create uneven samples

Comparison setting class_ Score of weight parameter and unset parameter:

Result: set class for uneven samples_ Weight can get higher scores

9: Evaluation index of classification model (confusion matrix – imbalance of special two classification samples)

9.1 concept of confusion matrix

9.2 model effect evaluation

9.2. 1 Accuracy

9.2. 2 Precision, also known as Precision

9.2. 3 Recall, also known as sensitivity, real rate and Recall rate

9.2.4 F1 measure

9.2. 5 False Negative Rate

9.2.6 ROC curvereceiver operating characteristic curve

9.3 confusion matrix in sklean

10: Advantages and disadvantages of decision tree algorithm

advantage:

Disadvantages:

Hot Topics