What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so we need to evaluate this model.
We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import Image
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally plt.rcParams['figure.figsize'] = (10, 6) # Set output picture size
# Read original data set train = pd.read_csv('train.csv') train.shape train.head() #Read cleaned data set data = pd.read_csv('clear_data.csv') data.head()
Model building
- After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
- Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
- On the one hand, the choice of model is determined by our task.
- In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
- At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance
Here, a library (sklearn), which is most commonly used in machine learning, is used to build the model.
The following gives the path selection algorithm of sklearn for your reference
# sklearn model algorithm selection path graph Image('sklearn.png')
Task 1: cut training set and test set
The data set is divided using the set aside method
- The data set is divided into independent variables and dependent variables
- Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
- Using stratified sampling
- Set random seeds so that the results can be reproduced
Task tip 1
- The purpose of cutting the data set is to evaluate the generalization ability of the model
- The method of cutting data sets in sklearn is train_test_split
- To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
- Hierarchical and random seeds are found in the parameters
from sklearn.model_selection import train_test_split
# Generally, x and y are taken out before cutting. In some cases, uncut ones will be used. At this time, x and y can be used. x is the cleaned data and Y is the survival data we want to predict, 'Survived' X = data y = train['Survived']
# Cut the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# View data shapes X_train.shape, X_test.shape
Task 2: model creation
- Create a classification model based on a linear model (logistic regression)
- Create a tree based classification model (decision tree, random forest)
- These models are used for training respectively, and the scores of training set and test set are obtained respectively
- View the parameters of the model, change the parameter values, and observe the changes of the model
Tip 2
- Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
- Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
- The module of linear model is sklearn linear_ model
- The module where the tree model is located is sklearn ensemble
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
# View training set and test set score values print("Training set score: {:.2f}".format(lr.score(X_train, y_train))) print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
# Logistic regression model after adjusting parameters lr2 = LogisticRegression(C=100) lr2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(lr2.score(X_train, y_train))) print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))
# Random forest classification model with default parameters rfc = RandomForestClassifier() rfc.fit(X_train, y_train)
rfc = RandomForestClassifier() rfc.fit(X_train, y_train) # Stochastic forest classification model with adjusted parameters rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5) rfc2.fit(X_train, y_train)
print("Training set score: {:.2f}".format(rfc2.score(X_train, y_train))) print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))
Task 3: output model prediction results
- Output model prediction classification label
- Output the prediction probability of different classification labels
Tip 3
- The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability
# Forecast label pred = lr.predict(X_train)
# At this point, we can see an array of 0 and 1 pred[:10]
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])
# Predicted tag probability pred_proba = lr.predict_proba(X_train)
pred_proba[:10]
Model evaluation
- Model evaluation is to know the generalization ability of the model.
- Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
- In cross validation, the data is divided many times and multiple models need to be trained.
- The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
- Accuracy measures how many samples are predicted to be positive examples
- recall measures how many positive samples are predicted to be positive
- f-score is the harmonic average of accuracy and recall
Task 1: cross validation
- 10 fold cross validation was used to evaluate the logistic regression model
- Calculate the average of cross validation accuracy
Image('Snipaste_2020-01-05_16-37-56.png')
Tip 4
- The cross validation module in sklearn is sklearn model_ selection
- What impact will the more k-fold bring?
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10)
# k-fold cross validation score scores
array([0.82352941, 0.79411765, 0.80597015, 0.80597015, 0.8358209 , 0.88059701, 0.72727273, 0.86363636, 0.75757576, 0.71212121])
# Average cross validation score print("Average cross-validation score: {:.2f}".format(scores.mean()))
Task 2: confusion matrix
- Calculating the confusion matrix of binary classification problem
- Calculate the accuracy rate, recall rate and f-score
Image('Snipaste_2020-01-05_16-38-26.png')
Image('Snipaste_2020-01-05_16-39-27.png')
Tip 5
- The method of confusion matrix is sklearn in sklearn Metrics module
- The confusion matrix requires the input of real labels and prediction labels
from sklearn.metrics import confusion_matrix
# Training model lr = LogisticRegression(C=100) lr.fit(X_train, y_train)
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
# Model prediction results pred = lr.predict(X_train)
# Confusion matrix confusion_matrix(y_train, pred)
array([[350, 62], [ 71, 185]], dtype=int64)
from sklearn.metrics import classification_report
# Accuracy, recall and F1 score print(classification_report(y_train, pred))
Task 3: ROC curve
- Draw ROC curve
Tip 6
- The module of ROC curve in sklearn is sklearn metrics
- The larger the area surrounded by the ROC curve, the better
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR (recall)") # The threshold closest to 0 was found close_zero = np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2) plt.legend(loc=4)