reference material: https://gitee.com/datawhalechina/hands-on-data-analysis
Chapter 3 model building and evaluation - Modeling
After learning the knowledge points in the previous two chapters, I can process the data itself, such as adding, deleting, checking and supplementing the data itself, and do the necessary cleaning work. Then we'll start using the data we processed earlier. What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so I need to evaluate this model. Today we will study modeling, and in the next section we will study evaluation.
We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from IPython.display import Image
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally plt.rcParams['figure.figsize'] = (10, 6) # Set output picture size
Load these libraries, and if some are missing, install them
[thinking] what are the functions of these libraries? You need to check them
%matplotlib inline
Load the cleaned data (clear_data.csv) provided by us, and we also load the original data (train.csv). Tell us about their differences
#Write code plt.rcParams['font.sans-serif'] = ['Simhei'] plt.rcParams['axes.unicode_minus'] = False plt.rcParams['figure.figsize'] = (10,6)
#Write code train=pd.read_csv('train.csv') train.shape
(891, 12)
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#Write code data = pd.read_csv('clear_data.csv') data.head()
PassengerId | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 1 | 0 | 0 |
2 | 2 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 |
3 | 3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | 1 |
4 | 4 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
Model building
- After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
- Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
- On the one hand, the choice of model is determined by our task.
- In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
- At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance
My modeling here does not start from scratch and compile all the code by myself. Here, we use a library (sklearn) most commonly used in machine learning to build our model
The following gives the path selection algorithm of sklearn for your reference
# sklearn model algorithm selection path graph Image('sklearn.png')
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-zuqvggo1-1639272190447) (output_18_0. PNG)]
[thinking] what differences in data sets will cause the model to change when fitting data
Task 1: cut training set and test set
The data set is divided using the set aside method
- The data set is divided into independent variables and dependent variables
- Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
- Using stratified sampling
- Set random seeds so that the results can be reproduced
[thinking]
- What are the methods of dividing data sets?
- Why use stratified sampling? What are the benefits?
Task tip 1
- The purpose of cutting the data set is to evaluate the generalization ability of the model
- The method of cutting data sets in sklearn is train_test_split
- To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
- Hierarchical and random seeds are found in the parameters
To clear_data.csv and train Extracting train from CSV_ test_ Parameters required for split()
#Write code from sklearn.model_selection import train_test_split
#Write code X = data y = train['Survived']
#Write code X_train,X_test,y_train,y_test = train_test_split(X, y,stratify=y, random_state=0)
#Write code X_train.shape,X_test.shape
((668, 11), (223, 11))
Task 2: model creation
- Create a classification model based on a linear model (logistic regression)
- Create a tree based classification model (decision tree, random forest)
- These models are used for training respectively, and the scores of training set and test set are obtained respectively
- View the parameters of the model, change the parameter values, and observe the changes of the model
Tips
- Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
- Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
- The module of linear model is sklearn linear_ model
- The module where the tree model is located is sklearn ensemble
#Write code from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
#Write code lr = LogisticRegression() lr.fit(X_train, y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( LogisticRegression()
#Write code print("Training set score: {:.2f}".format(lr.score(X_train,y_train))) print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))
Training set score: 0.80 Testing set score: 0.79
#Write code lr2 = LogisticRegression(C=100) lr2.fit(X_train,y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( LogisticRegression(C=100)
print("Training set score: {:.2f}".format(lr2.score(X_train,y_train))) print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))
Training set score: 0.79 Testing set score: 0.78
#Random forest classification model with default parameters rfc = RandomForestClassifier() rfc.fit(X_train, y_train)
RandomForestClassifier()
print("Training set score: {:.2f}".format(rfc.score(X_train,y_train))) print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))
Training set score: 1.00 Testing set score: 0.81
#Random forest classification model after adjusting parameters rfc2 = RandomForestClassifier(n_estimators=100,max_depth=5) rfc2.fit(X_train, y_train)
RandomForestClassifier(max_depth=5)
print("Training set score: {:.2f}".format(rfc2.score(X_train,y_train))) print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))
Training set score: 0.87 Testing set score: 0.81
Task 3: output model prediction results
- Output model prediction classification label
- Output the prediction probability of different classification labels
Tip 3
- The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability
#Write code pred = lr.predict(X_train) pred[:10]
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)
#Write code pred_proba = lr.predict_proba(X_train)
#Write code pred_proba = lr.predict_proba(X_train) pred_proba[:10]
array([[0.60890433, 0.39109567], [0.17653843, 0.82346157], [0.40595794, 0.59404206], [0.18889606, 0.81110394], [0.87987837, 0.12012163], [0.91389097, 0.08610903], [0.13279757, 0.86720243], [0.90556748, 0.09443252], [0.05280108, 0.94719892], [0.10934326, 0.89065674]])
Chapter III model building and evaluation - Evaluation
According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the division of the data set we know. So how do we know if a model works? So that we can safely use the results given to me by the model? Then today's assessment of learning will be very helpful.
Load the following Libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Image from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally plt.rcParams['figure.figsize'] = (10, 6) # Set output picture size
Task: load data and split test set and training set
#Write code from sklearn.model_selection import train_test_split
#Write code data = pd.read_csv('clear_data.csv') train = pd.read_csv('train.csv') X = data y = train['Survived']
#Write code X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
#Write code lr = LogisticRegression() lr.fit(X_train,y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( LogisticRegression()
Model evaluation
- Model evaluation is to know the generalization ability of the model.
- Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
- In cross validation, the data is divided many times and multiple models need to be trained.
- The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
- Accuracy measures how many samples are predicted to be positive examples
- recall measures how many positive samples are predicted to be positive
- f-score is the harmonic average of accuracy and recall
Task 1: cross validation
- 10 fold cross validation was used to evaluate the previous logistic regression model
- Calculate the average of cross validation accuracy
#Tip: cross validation Image('Snipaste_2020-01-05_16-37-56.png')
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Nenmn6xQ-1639927190449)(output_59_0.png)]
Tip 4
- The cross validation module in sklearn is sklearn model_ selection
#Write code from sklearn.model_selection import cross_val_score
#Write code lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
#Write code scores
array([0.82089552, 0.74626866, 0.74626866, 0.80597015, 0.88059701, 0.8358209 , 0.76119403, 0.82089552, 0.74242424, 0.72727273])
#Write code print("Average cross-validation score: {:2f}".format(scores.mean()))
Average cross-validation score: 0.788761
Thinking 4
- What impact will the more k-fold bring?
#Think and answer #The k-fold assembly makes the evaluation time longer
Task 2: confusion matrix
- Calculating the confusion matrix of binary classification problem
- Calculate the accuracy rate, recall rate and f-score
[thinking] what is the confusion matrix of binary classification problem? Understand this concept and know what tasks it is mainly calculated to
#Think and answer #In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0. Different thresholds will lead to different classification results, that is, the confusion matrix is poor, FPR and TPR It's different. When the threshold moves slowly from 0 to 1, many pairs will be formed(FPR, TPR)The values of are drawn on the coordinate system, which is called ROC Curve.
#Tip: confusion matrix Image('Snipaste_2020-01-05_16-38-26.png')
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-1HnqZW8A-1639927190450)(output_70_0.png)]
#Tips: accuracy, Precision, Recall,f-score calculation method Image('Snipaste_2020-01-05_16-39-27.png')
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-yqprgmo1-1639272190451) (output_71_0. PNG)]
Tip 5
- The method of confusion matrix is sklearn in sklearn Metrics module
- The confusion matrix requires the input of real labels and prediction labels
- Classification can be used for accuracy, recall and f-score_ Report module
#Write code from sklearn.metrics import confusion_matrix
#Write code lr = LogisticRegression(C=100) lr.fit(X_train, y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result( LogisticRegression(C=100)
#Write code pred = lr.predict(X_train)
#Write code confusion_matrix(y_train,pred)
array([[355, 57], [ 82, 174]], dtype=int64)
from sklearn.metrics import classification_report print(classification_report(y_train,pred))
precision recall f1-score support 0 0.81 0.86 0.84 412 1 0.75 0.68 0.71 256 accuracy 0.79 668 macro avg 0.78 0.77 0.78 668 weighted avg 0.79 0.79 0.79 668
[thinking]
- What should I pay attention to when implementing the confusion matrix
#Think and answer #According to the definition of Baidu Encyclopedia, the confusion matrix is actually an error matrix, which is a standard format for progress evaluation. It is usually expressed in the form of matrix with n rows and N columns.
Task 3: ROC curve
- Draw ROC curve
[thinking] what is the ROC curve? What problems does the ROC curve exist to solve?
#reflection #(1) ROC curve can easily find out the recognition ability of a classifier to samples at a certain threshold. #(2) The best diagnostic limit value of a diagnostic method can be selected by using the ROC curve. The closer the ROC curve is to the upper left corner, the higher the FPR and the lower the FPR of the test, that is, the higher the sensitivity and the lower the misjudgment rate, the better the performance of the diagnostic method. It can be seen that the point on the ROC curve closest to the upper left corner of the ROC curve has the largest sum of sensitivity and specificity, and this point or its adjacent point is often ignored They are called diagnostic reference values, these points are called optimal critical points, and the values on the points are called optimal critical values. #(3) In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.
Tip 6
- The module of ROC curve in sklearn is sklearn metrics
- The larger the area surrounded by the ROC curve, the better
#Write code from sklearn.metrics import roc_curve
#Write code fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR(recall)") close_zero=np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2) plt.legend(loc=4)
<matplotlib.legend.Legend at 0x16b201b60d0>
[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Jd3Zcr5x-1639927190451)(output_85_1.png)]
## Chapter III model building and evaluation - Evaluation According to the modeling of the previous model, we know how to use it sklearn This library is used to complete modeling, as well as the division of data sets we know. So how do we know if a model works? So that we can safely use the results given to me by the model? Then today's assessment of learning will be very helpful. Load the following Libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Image from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier %matplotlib inline plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative signs normally plt.rcParams['figure.figsize'] = (10, 6) # Set output picture size **Task: load data and split test set and training set** #Write code from sklearn.model_selection import train_test_split #Write code data = pd.read_csv('clear_data.csv') train = pd.read_csv('train.csv') X = data y = train['Survived'] #Write code X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) #Write code lr = LogisticRegression() lr.fit(X_train,y_train) ### Model evaluation * Model evaluation is to know the generalization ability of the model. * Cross validation( cross-validation)It is a statistical method to evaluate the generalization performance. It is more stable and comprehensive than the method of dividing training set and test set. * In cross validation, the data is divided many times and multiple models need to be trained. * The most commonly used cross validation is k Fold cross validation( k-fold cross-validation),among k Is a number specified by the user, usually 5 or 10. * Accuracy( precision)It measures how many of the samples predicted as positive examples are true positive examples * Recall rate( recall)It measures how many positive samples are predicted to be positive * f-The score is the harmonic average of accuracy and recall #### Task 1: cross validation * 10 fold cross validation was used to evaluate the previous logistic regression model * Calculate the average of cross validation accuracy #Tip: cross validation Image('Snipaste_2020-01-05_16-37-56.png') #### Tip 4 * Cross validation in sklearn The module in is`sklearn.model_selection` #Write code from sklearn.model_selection import cross_val_score #Write code lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10) #Write code scores #Write code print("Average cross-validation score: {:2f}".format(scores.mean())) #### Thinking 4 * k What will be the impact of more discounts? #Think and answer #The k-fold assembly makes the evaluation time longer #### Task 2: confusion matrix * Calculating the confusion matrix of binary classification problem * Calculate accuracy, recall and f-fraction [[thinking] what is the confusion matrix of binary classification problems? Understand this concept and know what tasks it is mainly calculated to #Think and answer #In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0. Different thresholds will lead to different classification results, that is, the confusion matrix is poor, FPR and TPR It's different. When the threshold moves slowly from 0 to 1, many pairs will be formed(FPR, TPR)The values of are drawn on the coordinate system, which is called ROC Curve. #Tip: confusion matrix Image('Snipaste_2020-01-05_16-38-26.png') #Tips: accuracy, Precision, Recall,f-score calculation method Image('Snipaste_2020-01-05_16-39-27.png') #### Tip 5 * The method of confusion matrix sklearn Medium`sklearn.metrics`modular * The confusion matrix requires the input of real labels and prediction labels * Accuracy, recall and f-Scores can be used`classification_report`modular #Write code from sklearn.metrics import confusion_matrix #Write code lr = LogisticRegression(C=100) lr.fit(X_train, y_train) #Write code pred = lr.predict(X_train) #Write code confusion_matrix(y_train,pred) from sklearn.metrics import classification_report print(classification_report(y_train,pred)) [[thinking] * What should I pay attention to when implementing the confusion matrix #Think and answer #According to the definition of Baidu Encyclopedia, the confusion matrix is actually an error matrix, which is a standard format for progress evaluation. It is usually expressed in the form of matrix with n rows and N columns. #### Task 3: ROC curve * draw ROC curve [[thinking] what is ROC Curve, ROC What problem does the curve exist to solve? #reflection #(1) ROC curve can easily find out the recognition ability of a classifier to samples at a certain threshold. #(2) The best diagnostic limit value of a diagnostic method can be selected by using the ROC curve. The closer the ROC curve is to the upper left corner, the higher the FPR and the lower the FPR of the test, that is, the higher the sensitivity and the lower the misjudgment rate, the better the performance of the diagnostic method. It can be seen that the point on the ROC curve closest to the upper left corner of the ROC curve has the largest sum of sensitivity and specificity, and this point or its adjacent point is often ignored They are called diagnostic reference values, these points are called optimal critical points, and the values on the points are called optimal critical values. #(3) In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0. #### Tip 6 * ROC Curve in sklearn The module in is`sklearn.metrics` * ROC The larger the area surrounded by the curve, the better #Write code from sklearn.metrics import roc_curve #Write code fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR(recall)") close_zero=np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2) plt.legend(loc=4)