Chapter III model establishment and evaluation

Posted by sidsel on Mon, 24 Jan 2022 20:30:06 +0100

Objective: to complete the Titanic survival prediction based on the Titanic data set.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
plt.rcParams['font.sans-serif'] = ['SimHei']  
# Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  
# Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  
# Set output picture size
train = pd.read_csv('train.csv')#Original data
data = pd.read_csv('clear_data.csv')
data.head(10)#Load cleaned data

Screenshot of some data after cleaning:

Compared with the original data, the cleared data lacks unimportant character variables such as name, ticket and cabin,
The information of gender and boarding port is transformed into 0-1 variables, which is more intuitive as a whole.

Model building

1. Cut training set and test set

from sklearn.model_selection import train_test_split
# Generally, x and y are taken out before cutting. In some cases, uncut ones will be used. At this time, x and y can be used. x is the cleaned data and Y is the survival data we want to predict, 'Survived'
X = data
y = train['Survived']
# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# View data shapes
X_train.shape, X_test.shape

The function of stratify=y is to keep the data classification proportion of Y in the test set consistent with that in the whole data set. And random_state=0 ensures that random results can be repeated.

 2. Model creation

The model is divided into linear model based classification model (logistic regression) and tree based classification model (decision tree, random forest).

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
lr = LogisticRegression()
lr.fit(X_train, y_train)
#Default parameter logistic regression model
# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

train set is used to train the model and estimate parameters.

test set is used to test and evaluate the quality of the trained model, not for the training model.

# Logistic regression model after adjusting parameters
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

 

# Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

 

# Stochastic forest classification model with adjusted parameters
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

3. Output model prediction results

The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability.

# Forecast label
pred = lr.predict(X_train)
pred[:10]

# Predicted tag probability
pred_proba = lr.predict_proba(X_train)
pred_proba[:10]

 

Survival can be divided into 0 and 1. So the prediction tag is 0 or 1. In the prediction tag probability, each row corresponds to a piece of prediction data, and the two columns correspond to the prediction probability for 0 and 1 respectively. The sum of probability is 1. Take the category with the highest probability as the prediction result of the sample. For example, in the first line, 0.60877 > 0.39122, so the prediction result is 0.

Model evaluation

  • Model evaluation is to know the generalization ability of the model.
  • Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
  • In cross validation, the data is divided many times and multiple models need to be trained.
  • The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
  • Accuracy measures how many samples are predicted to be positive examples
  • recall measures how many positive samples are predicted to be positive
  • f-score is the harmonic average of accuracy and recall

1. Cross validation

from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
scores# k-fold cross validation score
# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))

 

 

The larger the value of k, the greater the amount of calculation, which may be unbearable, so it needs to be selected carefully.

2. Confusion matrix

*TP(True Positive): predict the positive class as the number of positive classes. The true value is 1 and the predicted value is 1

*FN(False Negative): predict the number of positive classes as the number of negative classes. The true value is 1 and the predicted value is 0

*FP(False Positive): the negative class is predicted to be the number of positive classes. The true value is 0 and the predicted value is 1

*TN(True Negative): predict the negative class as the number of negative classes. The true value is 0 and the predicted value is 0

 

from sklearn.metrics import confusion_matrix
# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
# Model prediction results
pred = lr.predict(X_train)
# Confusion matrix
confusion_matrix(y_train, pred)

 

from sklearn.metrics import classification_report
# Accuracy, recall and F1 score
print(classification_report(y_train, pred))

It can be seen that the accuracy of this model is 79%, in which the accurate probability of predicting death (i.e. 0) is 81%, and the accurate probability of predicting survival (i.e. 1) is 75%, indicating that the prediction effect of this model is good.

 3.ROC curve

The full name of ROC is Receiver Operating Characteristic Curve, that is, the working characteristic curve of the subject. The abscissa of the curve is false positive rate (FPR) and the ordinate is true positive rate (TPR). TPR=TP/(TP+FN),FPR=FP/(FP+TN). The larger the area surrounded under the ROC curve, the better the TPR, and the smaller the FPR, the better. The two indicators restrict each other.

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# The threshold closest to 0 was found
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

When a threshold is specified, instances greater than this threshold are classified as positive, and instances less than this value are classified as negative. Each point on the ROC curve corresponds to a threshold. For a classifier, there will be a TPR and FPR under each threshold. The closer the ROC is to the upper left corner, the better the classifier effect. Finding the threshold closest to 0 can be used to judge the effect of 0-1 classification, and the prediction model is good.

Summary: in the third chapter, I learned more about machine learning about model establishment and evaluation, and came into contact with classification model, prediction label, cross validation, confusion matrix, ROC curve and other knowledge. With more confusion, I also realized that I have a lot to learn and gained a lot.

Topics: Python Machine Learning Deep Learning