# Decision tree of machine combat

Posted by csudhoff on Wed, 27 Oct 2021 17:59:07 +0200

1, Introduction

1.1 concept

As shown in the figure, it is a decision tree. The square represents the decision block and the ellipse represents the terminating block, indicating that the conclusion has been reached and the operation can be terminated. The left and right arrow branch es are led out from the judgment module, which can reach the next judgment module or terminate the module. A hypothetical e-mail classification system is constructed. 1.2 construction of decision tree

1.2.1 general process:

1. Collect data

2. Prepare data

3. Analysis data

4. Training algorithm

5. Test algorithm

6. Use algorithm

The pseudo code function createBranch to create a branch is as follows:
Detect whether each sub item in the dataset belongs to the same classification

```If so return Class label
Else
Find the best features to divide the dataset
Partition dataset
Create branch node
for Subset of each partition
Call function createBranch And add the returned result to the branch node
return Branch node```

1.3 information gain

The change of information before and after dividing the data set is called information gain. Knowing how to calculate the information gain, we can calculate the learning gain obtained by dividing the data set by each eigenvalue. The feature with the highest information gain is the best choice. Where n is the number of classifications

Code part for calculating Shannon entropy for a given dataset:

```from math import log
import operator
def calcshannonEnt(dataSet):
nuMEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/nuMEntries
shannonEnt -= prob * log(prob,2)
return shannonEnt
```

First, we need to calculate the total number of instances in the dataset. Here, we explicitly declare a variable to save the total number of instances. Then create a data dictionary whose key value is the value of the last column. If the current key value does not exist, expand the dictionary and add the current key value to the dictionary. Here, each key value records the occurrence times of the current category. Finally, the occurrence probability of all class labels is used to calculate the occurrence probability of the category. We will use this probability to calculate Shannon entropy and count the times of all class labels. Finally, the occurrence frequency of all class labels is used to calculate the probability of class occurrence.

You can use createDataSet() to get a simple authentication dataset

```def createDataset():
dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
labels = ['no surfacing','flippers']
return dataSet,labels
```

The higher the entropy, the more mixed data. We can add more categories in the data set to observe how the entropy changes. We add a third category called may to test the change of entropy:

```myData[-1]='maybe'
print(myData)
print(calcshannonEnt(myData))```

1.4 data division

In addition to measuring the information entropy, the classification algorithm also needs to divide the data set and measure the entropy of the divided data set, so as to judge whether the data set is divided correctly. The information entropy is calculated once for the result of dividing the data set according to each feature, and then it is judged which feature is the best way to divide the data set.

Given feature Division:

```def splitDataSet(dataSet, axis, value):
# Create a list of returned datasets
retDataSet = []
# Traversal dataset
for featVec in dataSet:
if featVec[axis] == value:
# Remove axis feature
reducedFeatVec = featVec[:axis]
# Add eligible to the returned dataset
reducedFeatVec.extend(featVec[axis + 1:])
retDataSet.append(reducedFeatVec)
# Returns the partitioned dataset
return retDataSet

def createDataSet():
# data set
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
# Classification properties
labels = ['no surfacing', 'flippers']
# Return dataset and classification properties
return dataSet, labels

if __name__ == '__main__':
dataSet, labels = createDataSet()
print(dataSet)
print(splitDataSet(dataSet, 0, 1))
print(splitDataSet(dataSet, 0, 0))
```

Select the best way to divide the data:

```from math import log

def calcShannonEnt(dataSet):
# Returns the number of rows in the dataset
numEntries = len(dataSet)
# Save a dictionary of the number of occurrences of each label
labelCounts = {}
# Each group of eigenvectors is counted
for featVec in dataSet:
# Extract label information
currentLabel = featVec[-1]
# If the label is not put into the dictionary of statistical times, add it
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
# Label count
labelCounts[currentLabel] += 1
# Shannon entropy
shannonEnt = 0.0
# Calculate Shannon entropy
for key in labelCounts:
# The probability of selecting the label
prob = float(labelCounts[key])/numEntries
# Calculation of Shannon entropy by formula
shannonEnt -= prob * log(prob, 2)
# Return Shannon entropy
return shannonEnt
def splitDataSet(dataSet, axis, value):
# Create a list of returned datasets
retDataSet = []
# Traversal dataset
for featVec in dataSet:
if featVec[axis] == value:
# Remove axis feature
reducedFeatVec = featVec[:axis]
# Add eligible to the returned dataset
reducedFeatVec.extend(featVec[axis + 1:])
retDataSet.append(reducedFeatVec)
# Returns the partitioned dataset
return retDataSet

def chooseBestFeatureToSplit(dataSet):
# Number of features
numFeatures = len(dataSet) - 1
# Calculate Shannon entropy of data set
baseEntropy = calcShannonEnt(dataSet)
# information gain
bestInfoGain = 0.0
# Index value of optimal feature
bestFeature = -1
# Traverse all features
for i in range(numFeatures):
# Get the i th all features of dataSet
featList = [example[i] for example in dataSet]
# Create set set {}, elements cannot be repeated
uniqueVals = set(featList)
# Empirical conditional entropy
newEntropy = 0.0
# Calculate information gain
for value in uniqueVals:
# The subset of the subDataSet after partition
subDataSet = splitDataSet(dataSet, i, value)
# Calculate the probability of subsets
prob = len(subDataSet) / float(len(dataSet))
# The empirical conditional entropy is calculated according to the formula
newEntropy += prob * calcShannonEnt(subDataSet)
# information gain
infoGain = baseEntropy - newEntropy
# Print information gain for each feature
print("The first%d The gain of each feature is%.3f" % (i, infoGain))
# Calculate information gain
if (infoGain > bestInfoGain):
# Update the information gain to find the maximum information gain
bestInfoGain = infoGain
# Record the index value of the feature with the largest information gain
bestFeature = i
# Returns the index value of the feature with the largest information gain
return bestFeature
```

Operation screenshot: 1.5 recursive construction of decision tree

Working principle: get the original data set, and then divide the data set based on the best attribute value. Since there may be more than two eigenvalues, there may be data set division greater than two branches. After the first division, the data will be passed down to the next node of the tree branch. On this node, we can divide the data again, Therefore, we can use the principle of recursion to deal with data sets.
Conditions for the end of recursion: the program traverses the attributes of all partitioned data sets, or all instances under each branch have the same classification. If all instances have the same classification, a leaf node or termination block is obtained, and any data reaching the leaf node must belong to the classification of the leaf node.
If the class label is still not unique after the dataset has processed all attributes, we need to decide how to define the leaf node. We usually use the majority voting method to determine the classification of the leaf node. The code is as follows:

```def majorityCnt(classList):
classCount={}
for vote in classList: #Count the number of occurrences of each element in the classList
if vote not in classCount.keys():classCount[vote]=0
classCount[vote]+=1
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True) #Sort dictionary
return sortedClassCount
```

The code for creating the tree is as follows:

```def createTree(dataSet,labels):
classList = [example[-1] for example in dataSet]
if classList.count(classList) == len(classList):#If the categories are exactly the same, continue to divide
return classList
if len(dataSet) == 1: #When all features are traversed, the category with the most occurrences is returned
return majorityCnt(classList)
bestFeat = chooseBestFeatureTOSplit(dataSet) #Select the optimal feature
bestFeatLabel = labels[bestFeat] #Optimal feature label
myTree = {bestFeatLabel:{}}
del (labels[bestFeat]) #Get all attribute values contained in the list
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals: #Create decision tree
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
return myTree
```

The variable myTree contains many nested dictionaries representing tree structure information. Starting from the left, the first keyword no surfacing is the feature name of the first partitioned dataset, and the value of this keyword is another data dictionary. The second keyword is the data set divided by the no surfacing feature. The values of these keywords are the child nodes of the no surfacing node. These values may be class labels or another data dictionary. If the value is a class label, the child node is a leaf node; If the value is another data dictionary, the child node is a judgment node. This format structure repeats continuously to form the whole tree. This tree contains three leaf nodes and two judgment nodes.

2, Drawing a tree in python using the Matplotlib annotation

Draw tree nodes with text annotations:

```import matplotlib.pyplot as plt
from pylab import mpl
decisionNode = dict(boxstyle='sawtooth',fc="0.8")#Define text box and arrow formats
leafNode = dict(boxstyle="round4",fc="0.8")
arrow_args = dict(arrowstyle="<-")
def plotNode(nodeTxt,centerPt,parentPt,nodeType):#Draw annotation with arrow
createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',xytext=centerPt,textcoords='axes fraction',va="center",ha="center",bbox=nodeType,arrowprops=arrow_args)
def createPlot():
fig = plt.figure(1,facecolor='white')
fig.clf()
createPlot.ax1 = plt.subplot(111,frameon=False)
mpl.rcParams['font.sans-serif'] = ['SimHei']  # Blackbody
plotNode('Decision node',(0.5,0.1),(0.1,0.5),decisionNode)
plotNode('Leaf node',(0.8,0.1),(0.3,0.8),leafNode)
plt.show()```

Operation screenshot: I haven't finished it yet. I'll write it tomorrow. 55555!!!!