Alibaba cloud Tianchi financial risk control training camp of Guangzhou University of Technology Data Association - task 1

Posted by reddevil on Wed, 02 Mar 2022 01:31:53 +0100

Task1 understand the game questions

I learning syllabus

● overview of competition questions
● data overview
● prediction index
● analyze competition questions

II Learning content

1. Overview of competition questions

The competition requires contestants to establish models and predict financial risks according to the given data set.
(1) The data comes from the loan records of a credit platform, with a total data volume of more than 120w, including 47 columns of variable information, one of which is anonymous.
(2) 800000 items are selected from the data set as the training set, 200000 items as the test set A and 200000 items as the test set B. at the same time, information such as employmentTitle, purpose, postCode and title will be desensitized.

2. Data overview

Characteristics of the nature to which the data column belongs.

3. Prediction indicators

The competition adopts AUC (Area Under Curve) as the evaluation index.

(1) Common evaluation indicators of classification algorithm:

(1). Confusion matrix

  1. True class TP(True Positive): the instance is a positive class and is predicted to be a positive class.
  2. False negative class FN (false negative): the instance is a positive class and is predicted to be a negative class.
  3. False positive class FP(False Positive): the instance is a negative class and is predicted to be a positive class.
  4. True negative class TM(True Negative): the instance is a negative class and is predicted to be a negative class.

(2). Accuracy [Note: not suitable for unbalanced samples]
(3). Precision
(4). Recall rate
(5).F1 Score [taking into account accuracy and recall]
(6)P-R curve
(7)ROC(ROC space defines the false positive case rate and real interest rate as X and Y axes)
(8).AUC (value range: 0.5 ~ 1.0; the closer it is to 1.0, the stronger the authenticity of the detection method; when it is equal to 0.5, the authenticity is the lowest and has no application value)

(2) For the common evaluation value indicators of financial risk control and prediction:

KS (commonly used to evaluate model discrimination)

  • In risk control, KS is often used to evaluate model discrimination. The greater the degree of discrimination, the stronger the risk ranking of the model.

  • Difference between K-S curve and ROC curve:
    The ROC curve takes the true case rate and false positive case rate as the horizontal and vertical axis;
    The K-S curve takes both the true case rate and the false positive case rate as the vertical axis, and the horizontal axis is served by the selected threshold.

  • In general, the larger the KS value is, the stronger the distinguishing ability of the model is. However, the larger the KS value is, the better the model effect will be. If the KS value is too large, the model may be abnormal. Therefore, when the KS value is too high, it may be necessary to check whether the model is over fitted. The following is the model corresponding to KS value, but this correspondence is not unique and only represents the general trend.

  • KS < 0.2 it is generally considered that the model has no distinguishing ability
    0.2 < = KS < = 0.3. It is considered that the model has certain distinguishing ability
    0.3 < = KS < = 0.5. It is considered that the model has strong discrimination ability
    KS > 0.75 model is abnormal

III Code example

1. Obtain data

import pandas as pd
train = pd.read_csv('http://tianchi-media.oss-cn-beijing.aliyuncs.com/dragonball/FRC/data_set/train.csv')
testA = pd.read_csv('http://tianchi-media.oss-cn-beijing.aliyuncs.com/dragonball/FRC/data_set/testA.csv')
print('Train data shape:',train.shape)
print('TestA data shape:',testA.shape)
train.head()

2. Evaluation index of classification algorithm

1) Confusion matrix

import numpy as meng
from sklearn.metrics import confusion_matrix
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('Confusion matrix:\n',confusion_matrix(y_true, y_pred))

Confusion matrix:
[[1 1]
[0 2]]

2) Accuracy

 from sklearn.metrics import accuracy_score
  y_pred = [0, 1, 1, 1]
  y_true = [0, 1, 1, 0]
  print('ACC:',accuracy_score(y_true, y_pred))

ACC: 0.75

3) Precision

from sklearn import metrics
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('Precision',metrics.precision_score(y_true, y_pred))

Precision 0.6666666666666666

4) Recall rate

from sklearn import metrics
y_pred = [0, 1, 1, 1]
y_true = [0, 1, 1, 0]
print('Recall',metrics.recall_score(y_true, y_pred))

Recall 1.0

5)F1 Score

from sklearn.metrics import accuracy_score
y_pred = [0, 1,1, 1]
y_true = [0, 1, 1,0]
print('F1-score:',metrics.f1_score(y_true, y_pred))

F1-score: 0.5

6) P-R curve

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
y_pred = [0, 0, 1, 0, 0, 0, 1, 1,1,1]
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
precision, recall, thresholds = precision_recall_curve(y_true, y_pred)
plt.plot(precision, recall)

[<matplotlib.lines.Line2D at 0x7f6ff870b908>]

7)ROC

from sklearn.metrics import roc_curve
y_pred = [0, 0, 1, 0, 0, 0, 1, 1,1,1]
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1]
FPR,TPR,thresholds=roc_curve(y_true, y_pred)
plt.title('ROC')
plt.plot(FPR, TPR,'b')
plt.plot([0,1],[0,1],'r--')
plt.ylabel('TPR')
plt.xlabel('FPR')

Text(0.5, 0, 'FPR')

8).AUC

import numpy as meng
from sklearn.metrics import roc_auc_score
y_true = meng.array([0, 1, 1, 0])
y_scores = meng.array([0.2, 0.3, 0.4, 0.7])
print('AUC socre:',roc_auc_score(y_true, y_scores))

AUC socre: 0.5

1.Tianchi financial risk control training camp

Topics: Python Algorithm Machine Learning