Hands on data analysis: 3 Model building and evaluation

Posted by pkSML on Tue, 04 Jan 2022 19:12:34 +0100

What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so we need to evaluate this model.

We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size
# Read original data set
train = pd.read_csv('train.csv')


#Read cleaned data set
data = pd.read_csv('clear_data.csv')

Model building

  • After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
  • Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
  • On the one hand, the choice of model is determined by our task.
  • In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
  • At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

Here, a library (sklearn), which is most commonly used in machine learning, is used to build the model.

The following gives the path selection algorithm of sklearn for your reference

# sklearn model algorithm selection path graph

Task 1: cut training set and test set

The data set is divided using the set aside method

  • The data set is divided into independent variables and dependent variables
  • Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
  • Using stratified sampling
  • Set random seeds so that the results can be reproduced

Task tip 1

  • The purpose of cutting the data set is to evaluate the generalization ability of the model
  • The method of cutting data sets in sklearn is train_test_split
  • To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
  • Hierarchical and random seeds are found in the parameters
from sklearn.model_selection import train_test_split
# Generally, x and y are taken out before cutting. In some cases, uncut ones will be used. At this time, x and y can be used. x is the cleaned data and Y is the survival data we want to predict, 'Survived'
X = data
y = train['Survived']
# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# View data shapes
X_train.shape, X_test.shape


Task 2: model creation

  • Create a classification model based on a linear model (logistic regression)
  • Create a tree based classification model (decision tree, random forest)
  • These models are used for training respectively, and the scores of training set and test set are obtained respectively
  • View the parameters of the model, change the parameter values, and observe the changes of the model

Tip 2

  • Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
  • Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
  • The module of linear model is sklearn linear_ model
  • The module where the tree model is located is sklearn ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
# Logistic regression model after adjusting parameters
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))
# Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Stochastic forest classification model with adjusted parameters
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))


Task 3: output model prediction results

  • Output model prediction classification label
  • Output the prediction probability of different classification labels

Tip 3

  • The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability
# Forecast label
pred = lr.predict(X_train)
# At this point, we can see an array of 0 and 1
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])
# Predicted tag probability
pred_proba = lr.predict_proba(X_train)


Model evaluation

  • Model evaluation is to know the generalization ability of the model.
  • Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
  • In cross validation, the data is divided many times and multiple models need to be trained.
  • The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
  • Accuracy measures how many samples are predicted to be positive examples
  • recall measures how many positive samples are predicted to be positive
  • f-score is the harmonic average of accuracy and recall

Task 1: cross validation

  • 10 fold cross validation was used to evaluate the logistic regression model
  • Calculate the average of cross validation accuracy

Tip 4

  • The cross validation module in sklearn is sklearn model_ selection


  • What impact will the more k-fold bring?
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
# k-fold cross validation score
array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 ,
       0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])
# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))


Task 2: confusion matrix

  • Calculating the confusion matrix of binary classification problem
  • Calculate the accuracy rate, recall rate and f-score


Tip 5

  • The method of confusion matrix is sklearn in sklearn Metrics module
  • The confusion matrix requires the input of real labels and prediction labels
from sklearn.metrics import confusion_matrix
# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
# Model prediction results
pred = lr.predict(X_train)
# Confusion matrix
confusion_matrix(y_train, pred)
array([[350,  62],
       [ 71, 185]], dtype=int64)
from sklearn.metrics import classification_report
# Accuracy, recall and F1 score
print(classification_report(y_train, pred))


Task 3: ROC curve

  • Draw ROC curve

Tip 6

  • The module of ROC curve in sklearn is sklearn metrics
  • The larger the area surrounded by the ROC curve, the better
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.ylabel("TPR (recall)")
# The threshold closest to 0 was found
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)

Topics: Data Analysis Data Mining