[Python] multi classification algorithm Random Forest

Posted by dtyson2000 on Thu, 30 Dec 2021 15:45:05 +0100

[Python] multi classification algorithm Random Forest

This paper will mainly describe the multi classification application of Random Forest, which can also be used in secondary classification. This paper uses scikit learn framework.

1, Import base library

Import the relevant libraries of data processing and drawing. In order to make the hierarchy more clear, we will import the scikit learn framework later.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

2, Data reading and processing

1. Data reading

Read the device data from the csv file. The code is as follows:

# Read data
dataset = pd.read_csv('test_data.csv')
print(dataset)

The data form is roughly as follows. x is equipment data and class is data characteristics. Here, it is divided into three categories:
(it's inconvenient to display the real data. The screenshot is just for display. Here you can download iris iris data set https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data )

2. Data processing

Next, perform data preprocessing, taking x as the independent variable x and class as the independent variable y.

# Data preprocessing select independent variable x dependent variable y
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

3, Random forest based on scikit learn

1. Training set and test set

Using train in scikit learn_ test_ Split module divides the data into training set and test set.

# Split the data set into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

2. Standardization

In order to eliminate the impact of too large data difference, the StandardScaler is used for standardization to highlight the characteristics of the data set.

# Data standardization feature scale
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

3. Training

In scikit learn, random forest module RandomForestClassifier is used to train the training data.
n_estimators indicates the number of trees in the random forest, criterion='entropy 'indicates that entropy information entropy is selected to find nodes and branches, random_state indicates random mode parameters.

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50, criterion='entropy', random_state=42)
classifier.fit(X_train, y_train)
print("X_train:", X_train)
print("y_train:", y_train)

3. Forecast

After the model is trained, the test data set is used for prediction. The classification of test sets is essentially to predict the dependent variable y, that is, the class in the table, through the test set data.

y_pred = classifier.predict(X_test)
print("X_test:", X_test)
print("y_pred:", y_pred)

4. Output results

Finally, the self-contained classifier in scikit learn is used to output the prediction results, including configuration matrix, Classification Report and Accuracy.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:", )
print(result1)
result2 = accuracy_score(y_test, y_pred)
print("Accuracy:", result2)

It can be seen from the results of the conflict matrix that the random number of classification 0 is 62 and the accurate number of prediction is 62; The number of extracted from category 1 is 240, and the accurate number of prediction is 232; The extracted quantity of classification 2 is 33, and the predicted accurate quantity is 33.

Precision represents accuracy, recall represents recall, and F1 score represents the harmonic average of precision and recall.

Accuracy is the accuracy of random forest classification. The results show that the accuracy is 97.6%.

Summary.

This paper mainly introduces the general steps of multi classification of random forest, with good accuracy. For the specific parameter settings in RandomForestClassifier, please refer to the tutorial on the official website sklearn.ensemble.RandomForestClassifier.

Topics: Python Algorithm