Iris data classification using decision tree algorithm

Posted by Cheap Commercial on Wed, 08 Dec 2021 00:54:45 +0100

Iris data classification using decision tree algorithm (learning notes)

Introduction to decision tree algorithm

  • The process of building a tree

    1. Starting from the root node, calculate the information gain (information gain ratio and Gini coefficient) of all eigenvalues, and select the feature with the largest calculation result as the root node. (information entropy gain - > ID3, information entropy gain rate - > C4.5, Gini coefficient - > cart)
    2. The sub node is established according to the calculated features, and the first step is performed until the information gain (information gain ratio) of all features is very small or no features can be selected.
  • Building the tree directly according to the above steps is prone to over fitting (the trained model is too consistent with the training data, so the test data cannot be classified).

  • Prevent over fitting: reduce the complexity of the model and simplify the decision tree -- > pruning

    • Pre pruning, pruning while constructing the tree
    • After pruning, prune after the construction of the decision tree
  • The important parameters of decision tree algorithm in sklearn

    • max_depth: the maximum depth of the tree (the number of split points). It is the most commonly used parameter to reduce the complexity of the model and prevent over fitting
    • min_sample_leaf: the minimum number of samples each leaf has
    • max_leaf_nodes: the maximum number of leaves in a tree
    • In practical applications, it is usually only necessary to adjust max_depth is enough to prevent over fitting of the decision tree model
  • Entropy: in information theory, let the probability distribution of discrete random variable X be $p (x = x_i) = P_ I, I = 1,2,3,..., n $, then the entropy of probability distribution is defined as
    $$Entropy(p)=-\sum_^n$$

  • Information Gain: describes the difference between coding with Q and coding with P. In the decision tree algorithm, the Information Gain is for A feature, which is to look at the amount of information when the system has it and without it. The difference between the two is the amount of information brought to the system by this feature, that is, the gain
    $$Gain(S,A)=Entropy(S)-\sum_{v \in Values(A)}{\frac{|S_v|}{|S|}Entropy(S_v)}$$

  • When the probability in entropy is obtained by data estimation (especially maximum likelihood estimation), the corresponding entropy is called empirical entropy. For example, there are 10 data, and there are two categories, class A and class B. If seven of the data belong to class A, the probability of class A is seven tenths. If three of the data belong to class B, the probability of class B is three tenths. The simple explanation is that the probability is calculated according to the data. We define the data in the loan application sample data table as training data set D, then the empirical entropy of training data set D is H(D), | D | represents its sample size and the number of samples. There are k classes Ck, = 1,2,3,..., K,|Ck | is the number of samples belonging to class Ck, so the empirical entropy formula can be written as:
    $$Entropy(D) = - \sum_^{\frac{|c_k|}{|D|} log_2{\frac{|C_k|}{|D|}}}$$

Prepare data and divide training set and test set

Load the data and invoke the required packages
code:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.metrics import accuracy_score
iris = load_iris()
print('Feature Name:',iris.feature_names)
print('Category:',iris.target_names)

Output results:

Feature Name: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Category: ['setosa' 'versicolor' 'virginica']

data processing
code:

X = iris.data
y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/4,random_state=0)
print('Number of dataset samples:{},Number of training samples:{},Number of test set samples:{}'.format(len(X),len(X_train),len(X_test)))

Output results:

Number of data set samples: 150,Number of training samples: 112,Number of test set samples: 38

Build model, training model, test model

Model building
code:

dt_model = DecisionTreeClassifier(max_depth=3)

Training model
code:

dt_model.fit(X_train,y_train)

test model
code:

y_pred = dt_model.predict(X_test)

acc = accuracy_score(y_test,y_pred)

print('Accuracy:',acc)

Output results:

Accuracy: 0.9736842105263158

The maximum depth parameter we set when building the model is 3. Let's take a look at the impact of super parameters on the model

We set the parameters to 2, 3 and 4 respectively to compare the accuracy on the training set and the test set

code:

max_depth_values = [2,3,4]

for max_depth_val in max_depth_values:
    dt_model = DecisionTreeClassifier(max_depth=max_depth_val)
    dt_model.fit(X_train,y_train)

    print('max_depth = ',max_depth_val)
    print('Accuracy on training set:{:.3f}'.format(dt_model.score(X_train,y_train)))
    print('Accuracy on test set:{:.3f}'.format(dt_model.score(X_test,y_test)))

Results obtained:

max_depth =  2
 Accuracy on training set: 0.964
 Accuracy on test set: 0.895
max_depth =  3
 Accuracy on training set: 0.982
 Accuracy on test set: 0.974
max_depth =  4
 Accuracy on training set: 1.000
 Accuracy on test set: 0.974

It can be seen that when the parameter is set to 4, the accuracy in the training set and the test set is the highest. We can also try to increase the parameter to see what changes will occur

Topics: Algorithm Decision Tree