Decision tree of machine combat

Posted by csudhoff on Wed, 27 Oct 2021 17:59:07 +0200

  1, Introduction

1.1 concept

As shown in the figure, it is a decision tree. The square represents the decision block and the ellipse represents the terminating block, indicating that the conclusion has been reached and the operation can be terminated. The left and right arrow branch es are led out from the judgment module, which can reach the next judgment module or terminate the module. A hypothetical e-mail classification system is constructed.

1.2 construction of decision tree

1.2.1 general process:

        1. Collect data

        2. Prepare data

        3. Analysis data

        4. Training algorithm

        5. Test algorithm

        6. Use algorithm

The pseudo code function createBranch to create a branch is as follows:
Detect whether each sub item in the dataset belongs to the same classification

If so return Class label
	Find the best features to divide the dataset
	Partition dataset
	Create branch node
		for Subset of each partition
			Call function createBranch And add the returned result to the branch node
	return Branch node

1.3 information gain

The change of information before and after dividing the data set is called information gain. Knowing how to calculate the information gain, we can calculate the learning gain obtained by dividing the data set by each eigenvalue. The feature with the highest information gain is the best choice.


Where n is the number of classifications  

  Code part for calculating Shannon entropy for a given dataset:

from math import log
import operator
def calcshannonEnt(dataSet): 
    nuMEntries = len(dataSet) 
    labelCounts = {}  
    for featVec in dataSet: 
        currentLabel = featVec[-1] 
        if currentLabel not in labelCounts.keys(): 
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0 
    for key in labelCounts:
        prob = float(labelCounts[key])/nuMEntries 
        shannonEnt -= prob * log(prob,2)
    return shannonEnt

  First, we need to calculate the total number of instances in the dataset. Here, we explicitly declare a variable to save the total number of instances. Then create a data dictionary whose key value is the value of the last column. If the current key value does not exist, expand the dictionary and add the current key value to the dictionary. Here, each key value records the occurrence times of the current category. Finally, the occurrence probability of all class labels is used to calculate the occurrence probability of the category. We will use this probability to calculate Shannon entropy and count the times of all class labels. Finally, the occurrence frequency of all class labels is used to calculate the probability of class occurrence.

You can use createDataSet() to get a simple authentication dataset

def createDataset():
    dataSet = [[1,1,'yes'],[1,1,'yes'],[1,0,'no'],[0,1,'no'],[0,1,'no']]
    labels = ['no surfacing','flippers']
    return dataSet,labels

The higher the entropy, the more mixed data. We can add more categories in the data set to observe how the entropy changes. We add a third category called may to test the change of entropy:


  1.4 data division

In addition to measuring the information entropy, the classification algorithm also needs to divide the data set and measure the entropy of the divided data set, so as to judge whether the data set is divided correctly. The information entropy is calculated once for the result of dividing the data set according to each feature, and then it is judged which feature is the best way to divide the data set.

Given feature Division:

def splitDataSet(dataSet, axis, value):
    # Create a list of returned datasets
    retDataSet = []
    # Traversal dataset
    for featVec in dataSet:
        if featVec[axis] == value:
            # Remove axis feature
            reducedFeatVec = featVec[:axis]
            # Add eligible to the returned dataset
            reducedFeatVec.extend(featVec[axis + 1:])
    # Returns the partitioned dataset
    return retDataSet

def createDataSet():
    # data set
    dataSet = [[1, 1, 'yes'],
               [1, 1, 'yes'],
               [1, 0, 'no'],
               [0, 1, 'no'],
               [0, 1, 'no']]
    # Classification properties
    labels = ['no surfacing', 'flippers']
    # Return dataset and classification properties
    return dataSet, labels

if __name__ == '__main__':
    dataSet, labels = createDataSet()
    print(splitDataSet(dataSet, 0, 1))
    print(splitDataSet(dataSet, 0, 0))

  Select the best way to divide the data:

from math import log

def calcShannonEnt(dataSet):
    # Returns the number of rows in the dataset
    numEntries = len(dataSet)
    # Save a dictionary of the number of occurrences of each label
    labelCounts = {}
    # Each group of eigenvectors is counted
    for featVec in dataSet:
        # Extract label information
        currentLabel = featVec[-1]
        # If the label is not put into the dictionary of statistical times, add it
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        # Label count
        labelCounts[currentLabel] += 1
    # Shannon entropy
    shannonEnt = 0.0
    # Calculate Shannon entropy
    for key in labelCounts:
        # The probability of selecting the label
        prob = float(labelCounts[key])/numEntries
        # Calculation of Shannon entropy by formula
        shannonEnt -= prob * log(prob, 2)
    # Return Shannon entropy
    return shannonEnt
def splitDataSet(dataSet, axis, value):
    # Create a list of returned datasets
    retDataSet = []
    # Traversal dataset
    for featVec in dataSet:
        if featVec[axis] == value:
            # Remove axis feature
            reducedFeatVec = featVec[:axis]
            # Add eligible to the returned dataset
            reducedFeatVec.extend(featVec[axis + 1:])
    # Returns the partitioned dataset
    return retDataSet

def chooseBestFeatureToSplit(dataSet):
    # Number of features
    numFeatures = len(dataSet[0]) - 1
    # Calculate Shannon entropy of data set
    baseEntropy = calcShannonEnt(dataSet)
    # information gain 
    bestInfoGain = 0.0
    # Index value of optimal feature
    bestFeature = -1
    # Traverse all features
    for i in range(numFeatures):
        # Get the i th all features of dataSet
        featList = [example[i] for example in dataSet]
        # Create set set {}, elements cannot be repeated
        uniqueVals = set(featList)
        # Empirical conditional entropy
        newEntropy = 0.0
        # Calculate information gain
        for value in uniqueVals:
            # The subset of the subDataSet after partition
            subDataSet = splitDataSet(dataSet, i, value)
            # Calculate the probability of subsets
            prob = len(subDataSet) / float(len(dataSet))
            # The empirical conditional entropy is calculated according to the formula
            newEntropy += prob * calcShannonEnt(subDataSet)
        # information gain 
        infoGain = baseEntropy - newEntropy
        # Print information gain for each feature
        print("The first%d The gain of each feature is%.3f" % (i, infoGain))
        # Calculate information gain
        if (infoGain > bestInfoGain):
            # Update the information gain to find the maximum information gain
            bestInfoGain = infoGain
            # Record the index value of the feature with the largest information gain
            bestFeature = i
    # Returns the index value of the feature with the largest information gain
    return bestFeature

  Operation screenshot:

1.5 recursive construction of decision tree

Working principle: get the original data set, and then divide the data set based on the best attribute value. Since there may be more than two eigenvalues, there may be data set division greater than two branches. After the first division, the data will be passed down to the next node of the tree branch. On this node, we can divide the data again, Therefore, we can use the principle of recursion to deal with data sets.
Conditions for the end of recursion: the program traverses the attributes of all partitioned data sets, or all instances under each branch have the same classification. If all instances have the same classification, a leaf node or termination block is obtained, and any data reaching the leaf node must belong to the classification of the leaf node.
If the class label is still not unique after the dataset has processed all attributes, we need to decide how to define the leaf node. We usually use the majority voting method to determine the classification of the leaf node. The code is as follows:

def majorityCnt(classList):
    for vote in classList: #Count the number of occurrences of each element in the classList
        if vote not in classCount.keys():classCount[vote]=0
    sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True) #Sort dictionary
    return sortedClassCount[0][0]

The code for creating the tree is as follows:

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):#If the categories are exactly the same, continue to divide
        return classList[0]
    if len(dataSet[0]) == 1: #When all features are traversed, the category with the most occurrences is returned
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureTOSplit(dataSet) #Select the optimal feature
    bestFeatLabel = labels[bestFeat] #Optimal feature label
    myTree = {bestFeatLabel:{}}
    del (labels[bestFeat]) #Get all attribute values contained in the list
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals: #Create decision tree
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

The variable myTree contains many nested dictionaries representing tree structure information. Starting from the left, the first keyword no surfacing is the feature name of the first partitioned dataset, and the value of this keyword is another data dictionary. The second keyword is the data set divided by the no surfacing feature. The values of these keywords are the child nodes of the no surfacing node. These values may be class labels or another data dictionary. If the value is a class label, the child node is a leaf node; If the value is another data dictionary, the child node is a judgment node. This format structure repeats continuously to form the whole tree. This tree contains three leaf nodes and two judgment nodes.

2, Drawing a tree in python using the Matplotlib annotation

Draw tree nodes with text annotations:

import matplotlib.pyplot as plt
from pylab import mpl
decisionNode = dict(boxstyle='sawtooth',fc="0.8")#Define text box and arrow formats
leafNode = dict(boxstyle="round4",fc="0.8")
arrow_args = dict(arrowstyle="<-")
def plotNode(nodeTxt,centerPt,parentPt,nodeType):#Draw annotation with arrow
    createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords='axes fraction',xytext=centerPt,textcoords='axes fraction',va="center",ha="center",bbox=nodeType,arrowprops=arrow_args)
def createPlot():
    fig = plt.figure(1,facecolor='white')
    createPlot.ax1 = plt.subplot(111,frameon=False)
    mpl.rcParams['font.sans-serif'] = ['SimHei']  # Blackbody
    plotNode('Decision node',(0.5,0.1),(0.1,0.5),decisionNode)
    plotNode('Leaf node',(0.8,0.1),(0.3,0.8),leafNode)

  Operation screenshot:

I haven't finished it yet. I'll write it tomorrow. 55555!!!!

Topics: Algorithm Machine Learning Decision Tree