Hands on data analysis: 3 Model building and evaluation

Posted by pkSML on Tue, 04 Jan 2022 19:12:34 +0100

What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so we need to evaluate this model.

We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

# Read original data set
train = pd.read_csv('train.csv')
train.shape

train.head()

#Read cleaned data set
data = pd.read_csv('clear_data.csv')
data.head()

Model building

After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
On the one hand, the choice of model is determined by our task.
In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

Here, a library (sklearn), which is most commonly used in machine learning, is used to build the model.

The following gives the path selection algorithm of sklearn for your reference

# sklearn model algorithm selection path graph
Image('sklearn.png')

Task 1: cut training set and test set

The data set is divided using the set aside method

The data set is divided into independent variables and dependent variables
Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
Using stratified sampling
Set random seeds so that the results can be reproduced

Task tip 1

The purpose of cutting the data set is to evaluate the generalization ability of the model
The method of cutting data sets in sklearn is train_test_split
To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
Hierarchical and random seeds are found in the parameters

from sklearn.model_selection import train_test_split

# Generally, x and y are taken out before cutting. In some cases, uncut ones will be used. At this time, x and y can be used. x is the cleaned data and Y is the survival data we want to predict, 'Survived'
X = data
y = train['Survived']

# Cut the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# View data shapes
X_train.shape, X_test.shape

Task 2: model creation

Create a classification model based on a linear model (logistic regression)
Create a tree based classification model (decision tree, random forest)
These models are used for training respectively, and the scores of training set and test set are obtained respectively
View the parameters of the model, change the parameter values, and observe the changes of the model

Tip 2

Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
The module of linear model is sklearn linear_ model
The module where the tree model is located is sklearn ensemble

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# View training set and test set score values
print("Training set score: {:.2f}".format(lr.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

# Logistic regression model after adjusting parameters
lr2 = LogisticRegression(C=100)
lr2.fit(X_train, y_train)

print("Training set score: {:.2f}".format(lr2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

# Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

# Stochastic forest classification model with adjusted parameters
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5)
rfc2.fit(X_train, y_train)

print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

Task 3: output model prediction results

Output model prediction classification label
Output the prediction probability of different classification labels

Tip 3

The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability

# Forecast label
pred = lr.predict(X_train)

# At this point, we can see an array of 0 and 1
pred[:10]

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])

# Predicted tag probability
pred_proba = lr.predict_proba(X_train)

pred_proba[:10]

Model evaluation

Model evaluation is to know the generalization ability of the model.
Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
In cross validation, the data is divided many times and multiple models need to be trained.
The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
Accuracy measures how many samples are predicted to be positive examples
recall measures how many positive samples are predicted to be positive
f-score is the harmonic average of accuracy and recall

Task 1: cross validation

10 fold cross validation was used to evaluate the logistic regression model
Calculate the average of cross validation accuracy

Image('Snipaste_2020-01-05_16-37-56.png')

Tip 4

The cross validation module in sklearn is sklearn model_ selection

What impact will the more k-fold bring?

from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

# k-fold cross validation score
scores

array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 ,
       0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])

# Average cross validation score
print("Average cross-validation score: {:.2f}".format(scores.mean()))

Task 2: confusion matrix

Calculating the confusion matrix of binary classification problem
Calculate the accuracy rate, recall rate and f-score

Image('Snipaste_2020-01-05_16-38-26.png')

Image('Snipaste_2020-01-05_16-39-27.png')

Tip 5

The method of confusion matrix is sklearn in sklearn Metrics module
The confusion matrix requires the input of real labels and prediction labels

from sklearn.metrics import confusion_matrix

# Training model
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

# Model prediction results
pred = lr.predict(X_train)

# Confusion matrix
confusion_matrix(y_train, pred)

array([[350,  62],
       [ 71, 185]], dtype=int64)

from sklearn.metrics import classification_report

# Accuracy, recall and F1 score
print(classification_report(y_train, pred))

Task 3: ROC curve

Draw ROC curve

Tip 6

The module of ROC curve in sklearn is sklearn metrics
The larger the area surrounded by the ROC curve, the better

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# The threshold closest to 0 was found
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

Topics: Data Analysis Data Mining

Programmer Think

Hands on data analysis: 3 Model building and evaluation

Model building

Model evaluation

Hot Topics