Hands on learning data analysis Task05

Posted by swallace on Sun, 19 Dec 2021 23:12:57 +0100

reference material: https://gitee.com/datawhalechina/hands-on-data-analysis

Chapter 3 model building and evaluation - Modeling

After learning the knowledge points in the previous two chapters, I can process the data itself, such as adding, deleting, checking and supplementing the data itself, and do the necessary cleaning work. Then we'll start using the data we processed earlier. What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so I need to evaluate this model. Today we will study modeling, and in the next section we will study evaluation.

We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

Load these libraries, and if some are missing, install them

[thinking] what are the functions of these libraries? You need to check them

%matplotlib inline

Load the cleaned data (clear_data.csv) provided by us, and we also load the original data (train.csv). Tell us about their differences

#Write code
plt.rcParams['font.sans-serif'] = ['Simhei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10,6)

#Write code
train=pd.read_csv('train.csv')
train.shape

(891, 12)

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

#Write code
data = pd.read_csv('clear_data.csv')
data.head()

	PassengerId	Pclass	Age	SibSp	Fare	Sex_female	Sex_male	Embarked_C	Embarked_S
0	0	3	22.0	1	7.2500	0	1	0	1
1	1	1	38.0	1	71.2833	1	0	1	0
2	2	3	26.0	0	7.9250	1	0	0	1
3	3	1	35.0	1	53.1000	1	0	0	1
4	4	3	35.0	0	8.0500	0	1	0	1

Model building

After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
On the one hand, the choice of model is determined by our task.
In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

My modeling here does not start from scratch and compile all the code by myself. Here, we use a library (sklearn) most commonly used in machine learning to build our model

The following gives the path selection algorithm of sklearn for your reference

# sklearn model algorithm selection path graph
Image('sklearn.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-zuqvggo1-1639272190447) (output_18_0. PNG)]

[thinking] what differences in data sets will cause the model to change when fitting data

Task 1: cut training set and test set

The data set is divided using the set aside method

The data set is divided into independent variables and dependent variables
Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
Using stratified sampling
Set random seeds so that the results can be reproduced

[thinking]

What are the methods of dividing data sets?
Why use stratified sampling? What are the benefits?

Task tip 1

The purpose of cutting the data set is to evaluate the generalization ability of the model
The method of cutting data sets in sklearn is train_test_split
To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
Hierarchical and random seeds are found in the parameters

To clear_data.csv and train Extracting train from CSV_ test_ Parameters required for split()

#Write code
from sklearn.model_selection import train_test_split

#Write code
X = data
y = train['Survived']

#Write code
X_train,X_test,y_train,y_test = train_test_split(X, y,stratify=y, random_state=0)

#Write code
X_train.shape,X_test.shape

((668, 11), (223, 11))

Task 2: model creation

Create a classification model based on a linear model (logistic regression)
Create a tree based classification model (decision tree, random forest)
These models are used for training respectively, and the scores of training set and test set are obtained respectively
View the parameters of the model, change the parameter values, and observe the changes of the model

Tips

Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
The module of linear model is sklearn linear_ model
The module where the tree model is located is sklearn ensemble

#Write code
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

#Write code
lr = LogisticRegression()
lr.fit(X_train, y_train)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()

#Write code
print("Training set score: {:.2f}".format(lr.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.80
Testing set score: 0.79

#Write code
lr2 = LogisticRegression(C=100)
lr2.fit(X_train,y_train)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression(C=100)

print("Training set score: {:.2f}".format(lr2.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))

Training set score: 0.79
Testing set score: 0.78

#Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

RandomForestClassifier()

print("Training set score: {:.2f}".format(rfc.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))

Training set score: 1.00
Testing set score: 0.81

#Random forest classification model after adjusting parameters
rfc2 = RandomForestClassifier(n_estimators=100,max_depth=5)
rfc2.fit(X_train, y_train)

RandomForestClassifier(max_depth=5)

print("Training set score: {:.2f}".format(rfc2.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))

Training set score: 0.87
Testing set score: 0.81

Task 3: output model prediction results

Output model prediction classification label
Output the prediction probability of different classification labels

Tip 3

The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability

#Write code
pred = lr.predict(X_train)
pred[:10]

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)

#Write code
pred_proba = lr.predict_proba(X_train)

#Write code
pred_proba = lr.predict_proba(X_train)
pred_proba[:10]

array([[0.60890433, 0.39109567],
       [0.17653843, 0.82346157],
       [0.40595794, 0.59404206],
       [0.18889606, 0.81110394],
       [0.87987837, 0.12012163],
       [0.91389097, 0.08610903],
       [0.13279757, 0.86720243],
       [0.90556748, 0.09443252],
       [0.05280108, 0.94719892],
       [0.10934326, 0.89065674]])

Chapter III model building and evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the division of the data set we know. So how do we know if a model works? So that we can safely use the results given to me by the model? Then today's assessment of learning will be very helpful.

Load the following Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

Task: load data and split test set and training set

#Write code
from sklearn.model_selection import train_test_split

#Write code
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']

#Write code
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

#Write code
lr = LogisticRegression()
lr.fit(X_train,y_train)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()

Model evaluation

Model evaluation is to know the generalization ability of the model.
Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
In cross validation, the data is divided many times and multiple models need to be trained.
The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
Accuracy measures how many samples are predicted to be positive examples
recall measures how many positive samples are predicted to be positive
f-score is the harmonic average of accuracy and recall

Task 1: cross validation

10 fold cross validation was used to evaluate the previous logistic regression model
Calculate the average of cross validation accuracy

#Tip: cross validation
Image('Snipaste_2020-01-05_16-37-56.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Nenmn6xQ-1639927190449)(output_59_0.png)]

Tip 4

The cross validation module in sklearn is sklearn model_ selection

#Write code
from sklearn.model_selection import cross_val_score

#Write code
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

#Write code
scores

array([0.82089552, 0.74626866, 0.74626866, 0.80597015, 0.88059701,
       0.8358209 , 0.76119403, 0.82089552, 0.74242424, 0.72727273])

#Write code
print("Average cross-validation score: {:2f}".format(scores.mean()))

Average cross-validation score: 0.788761

Thinking 4

What impact will the more k-fold bring?

#Think and answer
#The k-fold assembly makes the evaluation time longer

Task 2: confusion matrix

Calculating the confusion matrix of binary classification problem
Calculate the accuracy rate, recall rate and f-score

[thinking] what is the confusion matrix of binary classification problem? Understand this concept and know what tasks it is mainly calculated to

#Think and answer
#In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.
Different thresholds will lead to different classification results, that is, the confusion matrix is poor, FPR and TPR It's different.
When the threshold moves slowly from 0 to 1, many pairs will be formed(FPR, TPR)The values of are drawn on the coordinate system, which is called ROC Curve.

#Tip: confusion matrix
Image('Snipaste_2020-01-05_16-38-26.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-1HnqZW8A-1639927190450)(output_70_0.png)]

#Tips: accuracy, Precision, Recall,f-score calculation method
Image('Snipaste_2020-01-05_16-39-27.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-yqprgmo1-1639272190451) (output_71_0. PNG)]

Tip 5

The method of confusion matrix is sklearn in sklearn Metrics module
The confusion matrix requires the input of real labels and prediction labels
Classification can be used for accuracy, recall and f-score_ Report module

#Write code
from sklearn.metrics import confusion_matrix

#Write code
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression(C=100)

#Write code
pred = lr.predict(X_train)

#Write code
confusion_matrix(y_train,pred)

array([[355,  57],
       [ 82, 174]], dtype=int64)

from sklearn.metrics import classification_report
print(classification_report(y_train,pred))

              precision    recall  f1-score   support

           0       0.81      0.86      0.84       412
           1       0.75      0.68      0.71       256

    accuracy                           0.79       668
   macro avg       0.78      0.77      0.78       668
weighted avg       0.79      0.79      0.79       668