Hands on learning data analysis Task05

Posted by swallace on Sun, 19 Dec 2021 23:12:57 +0100

reference material: https://gitee.com/datawhalechina/hands-on-data-analysis

Chapter 3 model building and evaluation - Modeling

After learning the knowledge points in the previous two chapters, I can process the data itself, such as adding, deleting, checking and supplementing the data itself, and do the necessary cleaning work. Then we'll start using the data we processed earlier. What we need to do in this chapter is to use data. The purpose of data analysis is to use our data and combine my business to get some results we need to know. Then the first step of analysis is modeling, building a prediction model or other models; After we get to the results of this model, we need to analyze whether my model is reliable enough, so I need to evaluate this model. Today we will study modeling, and in the next section we will study evaluation.

We have the data set of the Titanic, so our purpose this time is to complete the task of predicting the survival of the Titanic.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

Load these libraries, and if some are missing, install them

[thinking] what are the functions of these libraries? You need to check them

%matplotlib inline

Load the cleaned data (clear_data.csv) provided by us, and we also load the original data (train.csv). Tell us about their differences

#Write code
plt.rcParams['font.sans-serif'] = ['Simhei']
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = (10,6)
#Write code
train=pd.read_csv('train.csv')
train.shape
(891, 12)
train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
#Write code
data = pd.read_csv('clear_data.csv')
data.head()

PassengerIdPclassAgeSibSpParchFareSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
00322.0107.250001001
11138.01071.283310100
22326.0007.925010001
33135.01053.100010001
44335.0008.050001001

Model building

  • After processing the previous data, we will get the modeling data. The next step is to select the appropriate model
  • Before model selection, we need to know whether the data set is finally supervised learning or unsupervised learning
  • On the one hand, the choice of model is determined by our task.
  • In addition to selecting the model according to our task, it can also be determined according to the sample size of data and the sparsity of features
  • At the beginning, we always try to use a basic model as its baseline, then train other models for comparison, and finally choose the model with better generalization ability or performance

My modeling here does not start from scratch and compile all the code by myself. Here, we use a library (sklearn) most commonly used in machine learning to build our model

The following gives the path selection algorithm of sklearn for your reference

# sklearn model algorithm selection path graph
Image('sklearn.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-zuqvggo1-1639272190447) (output_18_0. PNG)]

[thinking] what differences in data sets will cause the model to change when fitting data

Task 1: cut training set and test set

The data set is divided using the set aside method

  • The data set is divided into independent variables and dependent variables
  • Cut the training set and test set in proportion (the proportion of general test set is 30%, 25%, 20%, 15% and 10%)
  • Using stratified sampling
  • Set random seeds so that the results can be reproduced

[thinking]

  • What are the methods of dividing data sets?
  • Why use stratified sampling? What are the benefits?

Task tip 1

  • The purpose of cutting the data set is to evaluate the generalization ability of the model
  • The method of cutting data sets in sklearn is train_test_split
  • To view the function documentation, you can use train in Jupiter notebook_ test_ split? Press enter to see
  • Hierarchical and random seeds are found in the parameters

To clear_data.csv and train Extracting train from CSV_ test_ Parameters required for split()

#Write code
from sklearn.model_selection import train_test_split
#Write code
X = data
y = train['Survived']
#Write code
X_train,X_test,y_train,y_test = train_test_split(X, y,stratify=y, random_state=0)
#Write code
X_train.shape,X_test.shape


((668, 11), (223, 11))

Task 2: model creation

  • Create a classification model based on a linear model (logistic regression)
  • Create a tree based classification model (decision tree, random forest)
  • These models are used for training respectively, and the scores of training set and test set are obtained respectively
  • View the parameters of the model, change the parameter values, and observe the changes of the model

Tips

  • Logistic regression is not a regression model, but a classification model, which should not be confused with linear regression
  • Random forest is actually decision tree integration in order to reduce the over fitting of decision tree
  • The module of linear model is sklearn linear_ model
  • The module where the tree model is located is sklearn ensemble
#Write code
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#Write code
lr = LogisticRegression()
lr.fit(X_train, y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()
#Write code
print("Training set score: {:.2f}".format(lr.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(lr.score(X_test, y_test)))

Training set score: 0.80
Testing set score: 0.79
#Write code
lr2 = LogisticRegression(C=100)
lr2.fit(X_train,y_train)

E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression(C=100)
print("Training set score: {:.2f}".format(lr2.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(lr2.score(X_test, y_test)))
Training set score: 0.79
Testing set score: 0.78
#Random forest classification model with default parameters
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
RandomForestClassifier()
print("Training set score: {:.2f}".format(rfc.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(rfc.score(X_test, y_test)))
Training set score: 1.00
Testing set score: 0.81
#Random forest classification model after adjusting parameters
rfc2 = RandomForestClassifier(n_estimators=100,max_depth=5)
rfc2.fit(X_train, y_train)
RandomForestClassifier(max_depth=5)
print("Training set score: {:.2f}".format(rfc2.score(X_train,y_train)))
print("Testing set score: {:.2f}".format(rfc2.score(X_test, y_test)))
Training set score: 0.87
Testing set score: 0.81

Task 3: output model prediction results

  • Output model prediction classification label
  • Output the prediction probability of different classification labels

Tip 3

  • The general supervision model has a predict in sklearn, which can output the prediction tag_ Proba can output label probability
#Write code
pred = lr.predict(X_train)
pred[:10]

array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int64)
#Write code
pred_proba = lr.predict_proba(X_train)
#Write code
pred_proba = lr.predict_proba(X_train)
pred_proba[:10]

array([[0.60890433, 0.39109567],
       [0.17653843, 0.82346157],
       [0.40595794, 0.59404206],
       [0.18889606, 0.81110394],
       [0.87987837, 0.12012163],
       [0.91389097, 0.08610903],
       [0.13279757, 0.86720243],
       [0.90556748, 0.09443252],
       [0.05280108, 0.94719892],
       [0.10934326, 0.89065674]])

Chapter III model building and evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the division of the data set we know. So how do we know if a model works? So that we can safely use the results given to me by the model? Then today's assessment of learning will be very helpful.

Load the following Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

Task: load data and split test set and training set

#Write code
from sklearn.model_selection import train_test_split
#Write code
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']
#Write code
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

#Write code
lr = LogisticRegression()
lr.fit(X_train,y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression()

Model evaluation

  • Model evaluation is to know the generalization ability of the model.
  • Cross validation is a statistical method to evaluate generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
  • In cross validation, the data is divided many times and multiple models need to be trained.
  • The most commonly used cross validation is k-fold cross validation, where k is the number specified by the user, usually 5 or 10.
  • Accuracy measures how many samples are predicted to be positive examples
  • recall measures how many positive samples are predicted to be positive
  • f-score is the harmonic average of accuracy and recall

Task 1: cross validation

  • 10 fold cross validation was used to evaluate the previous logistic regression model
  • Calculate the average of cross validation accuracy
#Tip: cross validation
Image('Snipaste_2020-01-05_16-37-56.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Nenmn6xQ-1639927190449)(output_59_0.png)]

Tip 4

  • The cross validation module in sklearn is sklearn model_ selection
#Write code
from sklearn.model_selection import cross_val_score

#Write code
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
#Write code
scores

array([0.82089552, 0.74626866, 0.74626866, 0.80597015, 0.88059701,
       0.8358209 , 0.76119403, 0.82089552, 0.74242424, 0.72727273])
#Write code
print("Average cross-validation score: {:2f}".format(scores.mean()))

Average cross-validation score: 0.788761

Thinking 4

  • What impact will the more k-fold bring?
#Think and answer
#The k-fold assembly makes the evaluation time longer

Task 2: confusion matrix

  • Calculating the confusion matrix of binary classification problem
  • Calculate the accuracy rate, recall rate and f-score

[thinking] what is the confusion matrix of binary classification problem? Understand this concept and know what tasks it is mainly calculated to

#Think and answer
#In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.
Different thresholds will lead to different classification results, that is, the confusion matrix is poor, FPR and TPR It's different.
When the threshold moves slowly from 0 to 1, many pairs will be formed(FPR, TPR)The values of are drawn on the coordinate system, which is called ROC Curve.

#Tip: confusion matrix
Image('Snipaste_2020-01-05_16-38-26.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-1HnqZW8A-1639927190450)(output_70_0.png)]

#Tips: accuracy, Precision, Recall,f-score calculation method
Image('Snipaste_2020-01-05_16-39-27.png')

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-yqprgmo1-1639272190451) (output_71_0. PNG)]

Tip 5

  • The method of confusion matrix is sklearn in sklearn Metrics module
  • The confusion matrix requires the input of real labels and prediction labels
  • Classification can be used for accuracy, recall and f-score_ Report module
#Write code
from sklearn.metrics import confusion_matrix
#Write code
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)
E:\program\python\program data\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(





LogisticRegression(C=100)
#Write code
pred = lr.predict(X_train)

#Write code
confusion_matrix(y_train,pred)

array([[355,  57],
       [ 82, 174]], dtype=int64)
from sklearn.metrics import classification_report
print(classification_report(y_train,pred))
              precision    recall  f1-score   support

           0       0.81      0.86      0.84       412
           1       0.75      0.68      0.71       256

    accuracy                           0.79       668
   macro avg       0.78      0.77      0.78       668
weighted avg       0.79      0.79      0.79       668

[thinking]

  • What should I pay attention to when implementing the confusion matrix
#Think and answer
#According to the definition of Baidu Encyclopedia, the confusion matrix is actually an error matrix, which is a standard format for progress evaluation. It is usually expressed in the form of matrix with n rows and N columns.

Task 3: ROC curve

  • Draw ROC curve

[thinking] what is the ROC curve? What problems does the ROC curve exist to solve?

#reflection
#(1) ROC curve can easily find out the recognition ability of a classifier to samples at a certain threshold.
#(2) The best diagnostic limit value of a diagnostic method can be selected by using the ROC curve. The closer the ROC curve is to the upper left corner, the higher the FPR and the lower the FPR of the test, that is, the higher the sensitivity and the lower the misjudgment rate, the better the performance of the diagnostic method. It can be seen that the point on the ROC curve closest to the upper left corner of the ROC curve has the largest sum of sensitivity and specificity, and this point or its adjacent point is often ignored They are called diagnostic reference values, these points are called optimal critical points, and the values on the points are called optimal critical values.
#(3) In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.

Tip 6

  • The module of ROC curve in sklearn is sklearn metrics
  • The larger the area surrounded by the ROC curve, the better
#Write code
from sklearn.metrics import roc_curve
#Write code
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR(recall)")
close_zero=np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
<matplotlib.legend.Legend at 0x16b201b60d0>

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-Jd3Zcr5x-1639927190451)(output_85_1.png)]

## Chapter III model building and evaluation - Evaluation

According to the modeling of the previous model, we know how to use it sklearn This library is used to complete modeling, as well as the division of data sets we know. So how do we know if a model works? So that we can safely use the results given to me by the model? Then today's assessment of learning will be very helpful.

Load the following Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display negative signs normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set output picture size

**Task: load data and split test set and training set**

#Write code
from sklearn.model_selection import train_test_split

#Write code
data = pd.read_csv('clear_data.csv')
train = pd.read_csv('train.csv')
X = data
y = train['Survived']

#Write code
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)


#Write code
lr = LogisticRegression()
lr.fit(X_train,y_train)

### Model evaluation

* Model evaluation is to know the generalization ability of the model.
* Cross validation( cross-validation)It is a statistical method to evaluate the generalization performance. It is more stable and comprehensive than the method of dividing training set and test set.
* In cross validation, the data is divided many times and multiple models need to be trained.
* The most commonly used cross validation is k Fold cross validation( k-fold cross-validation),among k Is a number specified by the user, usually 5 or 10.
* Accuracy( precision)It measures how many of the samples predicted as positive examples are true positive examples
* Recall rate( recall)It measures how many positive samples are predicted to be positive
* f-The score is the harmonic average of accuracy and recall

#### Task 1: cross validation
* 10 fold cross validation was used to evaluate the previous logistic regression model
* Calculate the average of cross validation accuracy

#Tip: cross validation
Image('Snipaste_2020-01-05_16-37-56.png')

#### Tip 4
* Cross validation in sklearn The module in is`sklearn.model_selection`

#Write code
from sklearn.model_selection import cross_val_score


#Write code
lr = LogisticRegression(C=100)
scores = cross_val_score(lr, X_train, y_train, cv=10)

#Write code
scores


#Write code
print("Average cross-validation score: {:2f}".format(scores.mean()))


#### Thinking 4
* k What will be the impact of more discounts?

#Think and answer
#The k-fold assembly makes the evaluation time longer


#### Task 2: confusion matrix
* Calculating the confusion matrix of binary classification problem
* Calculate accuracy, recall and f-fraction

[[thinking] what is the confusion matrix of binary classification problems? Understand this concept and know what tasks it is mainly calculated to

#Think and answer
#In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.
Different thresholds will lead to different classification results, that is, the confusion matrix is poor, FPR and TPR It's different.
When the threshold moves slowly from 0 to 1, many pairs will be formed(FPR, TPR)The values of are drawn on the coordinate system, which is called ROC Curve.


#Tip: confusion matrix
Image('Snipaste_2020-01-05_16-38-26.png')

#Tips: accuracy, Precision, Recall,f-score calculation method
Image('Snipaste_2020-01-05_16-39-27.png')

#### Tip 5
* The method of confusion matrix sklearn Medium`sklearn.metrics`modular
* The confusion matrix requires the input of real labels and prediction labels
* Accuracy, recall and f-Scores can be used`classification_report`modular

#Write code
from sklearn.metrics import confusion_matrix

#Write code
lr = LogisticRegression(C=100)
lr.fit(X_train, y_train)

#Write code
pred = lr.predict(X_train)


#Write code
confusion_matrix(y_train,pred)


from sklearn.metrics import classification_report
print(classification_report(y_train,pred))

[[thinking]
* What should I pay attention to when implementing the confusion matrix

#Think and answer
#According to the definition of Baidu Encyclopedia, the confusion matrix is actually an error matrix, which is a standard format for progress evaluation. It is usually expressed in the form of matrix with n rows and N columns.

#### Task 3: ROC curve
* draw ROC curve

[[thinking] what is ROC Curve, ROC What problem does the curve exist to solve?

#reflection
#(1) ROC curve can easily find out the recognition ability of a classifier to samples at a certain threshold.
#(2) The best diagnostic limit value of a diagnostic method can be selected by using the ROC curve. The closer the ROC curve is to the upper left corner, the higher the FPR and the lower the FPR of the test, that is, the higher the sensitivity and the lower the misjudgment rate, the better the performance of the diagnostic method. It can be seen that the point on the ROC curve closest to the upper left corner of the ROC curve has the largest sum of sensitivity and specificity, and this point or its adjacent point is often ignored They are called diagnostic reference values, these points are called optimal critical points, and the values on the points are called optimal critical values.
#(3) In the binary classification problem (0, 1), the final output of the general model is a probability value, indicating the probability that the result is 1. At this time, it is necessary to determine a threshold. If the output probability of the model exceeds the threshold, it is classified as 1; if the output probability of the model is lower than the threshold, it is classified as 0.


#### Tip 6
* ROC Curve in sklearn The module in is`sklearn.metrics`
* ROC The larger the area surrounded by the curve, the better

#Write code
from sklearn.metrics import roc_curve

#Write code
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR(recall)")
close_zero=np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)

Topics: Python Machine Learning Data Analysis