Decision tree and random forest case

Posted by elnino on Sat, 18 Apr 2020 08:39:13 +0200

I think many people have heard about decision trees and random forests. This is a mathematical model for prediction, which can be implemented quickly with python. Please keep these codes in mind. After you understand the meaning, you can also use this model to predict the parameters. However, the blogger thinks that the most interesting part is to find the important factors at the end of the model

The first step to get the data set is to clean the data:

import pandas as pd
import numpy as np

titanic=pd.read_csv(r'/Users/titanic_train.csv')

#Delete unrelated columns
titanic.drop(['PassengerId','Ticket','Cabin','Name'],axis=1,inplace=True)

#Complete age missing value
fillna_titanic=[]
for i in titanic.Sex.unique():
    update=titanic.loc[titanic.Sex==i].fillna(value={'Age':titanic.Age[titanic.Sex==i].mean()})
    fillna_titanic.append(update)
titanic=pd.concat(fillna_titanic)

#Complete boarding information
titanic.fillna(value={'Embarked':titanic.Embarked.mode()[0]},inplace=True)

The second step is to convert variables into dummy variables:

#Conversion of numerical pclass to type
titanic.Pclass=titanic.Pclass.astype('category')

#Dummy variable processing
dummy=pd.get_dummies(titanic[['Sex','Embarked','Pclass']])

#Horizontal merging of titanic and dummy data sets
titanic=pd.concat([titanic,dummy],axis=1)
titanic.drop(['Sex','Embarked','Pclass'],axis=1,inplace=True)

After the data is ready, we start to build the model. You don't need to memorize the following codes, just change the parameters when you use them. This is the charm of python, which can make some very complex things become you can also start.

#Construction of training set and test set
from sklearn import model_selection

#Get all argument names
predictors=titanic.columns[1:]

#Split the data into training set and test set, and the proportion of test set is 25%
X_train,X_test,y_train,y_test=model_selection.train_test_split(titanic[predictors],titanic.Survived,
                                                              test_size=0.25,random_state=1234)

#Using grid method to find the best model parameters
from sklearn.model_selection import GridSearchCV
from sklearn import tree

#Preset different option values of each parameter
max_depth=[2,3,4,5,6]
min_samples_split=[2,4,6,8]
min_samples_leaf=[2,4,8,10,12]

#Organize the values of each parameter in the form of a dictionary
parameters={'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf}

#Grid search method to test different parameter values
grid_dtcateg=GridSearchCV(estimator=tree.DecisionTreeClassifier(),param_grid=parameters,cv=10)

#Model fitting
grid_dtcateg.fit(X_train,y_train)

#Returns the parameter value of the best combination
print(grid_dtcateg.best_params_)

The best parameter combination is: 3, 4, 2
After finding the best combination of parameters, try to build a decision tree:

#Single decision tree modeling
from sklearn import metrics

#Building classification decision tree
CART_Class=tree.DecisionTreeClassifier(max_depth=3,min_samples_leaf=4,min_samples_split=2)

#Model fitting
decision_tree=CART_Class.fit(X_train,y_train)

#Prediction of model on test set
pred=CART_Class.predict(X_test)

#Accuracy of the model
print('The prediction accuracy of the model in the test set is:',metrics.accuracy_score(y_test,pred))

The accuracy is 0.83, not bad.
The plot shows this accuracy:

import matplotlib.pyplot as plt

y_score=CART_Class.predict_proba(X_test)[:,1]
fpr,tpr,threshold=metrics.roc_curve(y_test,y_score)

#Calculate AUC value
roc_auc=metrics.auc(fpr,tpr)

#Draw area map
plt.stackplot(fpr,tpr,color='steelblue',alpha=0.5,edgecolor='black')

#Add marginal line
plt.plot(fpr,tpr,color='black',lw=1)

#Add diagonal
plt.plot([0,1],[0,1],color='red',linestyle='-')

#Add text message
plt.text(0.5,0.3,'ROC curve(area=%0.2f)'%roc_auc)

#Add x and y labels
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')

#display graphics
plt.show()

To build a random forest model:

#Building random forest can improve the prediction accuracy of a single decision tree
from sklearn import ensemble

RF_class=ensemble.RandomForestClassifier(n_estimators=200,random_state=1234)

#Fitting of random forest
RF_class.fit(X_train,y_train)

#Prediction of model on test set
RFclass_pred=RF_class.predict(X_test)

#Accuracy of the model
print('The prediction accuracy of the model in the test set is:',metrics.accuracy_score(y_test,RFclass_pred))

The accuracy of random forest is 0.85, which is higher than that of single decision tree.
Draw a picture to see the accuracy:

#Calculate plot data
y_score=RF_class.predict_proba(X_test)[:,1]
fpr,tpr,threshold=metrics.roc_curve(y_test,y_score)

#Calculate AUC value
roc_auc=metrics.auc(fpr,tpr)

#Draw area map
plt.stackplot(fpr,tpr,color='steelblue',alpha=0.5,edgecolor='black')

#Add marginal line
plt.plot(fpr,tpr,color='black',lw=1)

#Add diagonal
plt.plot([0,1],[0,1],color='red',linestyle='-')

#Add text message
plt.text(0.5,0.3,'ROC curve(area=%0.2f)'%roc_auc)

#Add x and y labels
plt.xlabel('1-Specificity')
plt.ylabel('Sensitivity')

#display graphics
plt.show()

What factors determine the rescue rate on the Titanic

#Importance of independent variable
importance=RF_class.feature_importances_

#Build sequence for drawing
Impt_Series=pd.Series(importance,index=X_train.columns)

#Sort plots on a sequence
Impt_Series.sort_values(ascending=True).plot(kind='barh')
plt.show()

The conclusion is obvious. Do you have any inspiration?

Topics: Python