Study notes Day 46 (making score cards by logistic regression)

Posted by vig0 on Mon, 21 Feb 2022 13:01:06 +0100

api import:

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression as LR

Data acquisition:

df = pd.read_csv('./Data/rankingcard.csv',index_col=0)

Remove duplicates:

When you manually enter or enter more than one line, duplicate data is likely to appear. You need to remove the duplicate values first,

# Remove duplicate values
df.drop_duplicates(inplace=True) # drop_duplicates removes duplicate values

After removing duplicate values, the data is deleted, but the index does not change. At this time, you need to reset the index value.

df.index = range(df.shape[0])

Missing value handling:

df.isnull().mean() 
# It can be seen here that the number of families has a missing value of about% 2, and the income has a missing value of about 20%. During the credit evaluation, the impact of monthly income is too large, and there will be no large number of missing in the actual situation. Here, random forest is used to fill in

Fill in less missing features with averages:

# Average number of households filled
df.loc[:,'NumberOfDependents'].fillna(df.loc[:,'NumberOfDependents'].mean(),inplace=True)

Income has a great impact on the results, and there are many missing data. Here, random forest is used to fill in the missing value of income:

x = data.loc[:,data.columns != 'MonthlyIncome']

y = data.loc[:, 'MonthlyIncome']

# Split dataset
y_train = y[y.notnull()]
y_test = y[y.isnull()]
x_train = x.iloc[y_train.index,:]
x_test = x.iloc[y_test.index,:]

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train,y_train)

y_pre = rf.predict(x_test)

df.loc[df.loc[:,'MonthlyIncome'].isnull(),'MonthlyIncome'] = y_pre

Missing value resolution!

Exception handling:

# outlier handling -- box diagram, 3seigma rule, descriptive statistical observation to deal with outliers. Descriptive rule is used here

Observation data:

#The observed data will have two outliers
#Age here, there is a minimum value of zero, which is obviously unreasonable
#Then, the number of days of default here, 30 ~ 59,60 ~ 89,90 days overdue within two years, has a maximum of 98 times, which is an impossible thing to do
#In actual work, you can ask the data source workers about the situation and ask them how to calculate it. Here, we directly deal with it as an abnormal value

# Delete age anomalies
df = df[df['age'] != 0]

# Delete overdue exceptions
df = df[df['NumberOfTime30-59DaysPastDueNotWorse'] < 90]

# When deleting data, you must redefine the index
df.index = range(df.shape[0])

According to the above descriptive statistics, the data dimensions are obviously inconsistent, but there is no data standardization processing here. Although the data obey the normal distribution, the gradient decline can converge faster, our ultimate goal is to serve the business. The score card is a card used by business personnel to score customers based on various information filled in by new customers, Once the data is unified, the size and scope of the data will change, which may not be understood by business personnel. Therefore, due to business needs, we try our best to keep the original appearance of the data.

imblearn handles sample imbalance:

import imblearn
#imblearn is a library specially used to deal with unbalanced data sets. Its performance is much better than sklearn in dealing with sample imbalance
#imblearn is also a class by class, which also needs to be instantiated and fit, which is similar to sklearn

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 22)

x,y = sm.fit_resample(x,y)

y.value_counts()

Here to complete data preprocessing: 1 duplicate value processing, 2 missing value processing, 3 abnormal value processing, 4 sample imbalance processing

Divided into training set and test set:

#Divide training set and test set - when dividing the box (different scores are established in each stage), we need a structure of feature matrix and label. After dividing the data set, we can combine the training set and test set respectively

y = pd.DataFrame(y)

x = pd.DataFrame(x)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=22)

model_data = pd.concat([y_train,x_train],axis=1)
test_data = pd.concat([y_test,x_test],axis=1)

model_data.index = range(x_train.shape[0])
test_data.index = range(x_test.shape[0])

model_data.to_csv('./Data/model_data.csv')
test_data.to_csv('./Data/test_data.csv')

#The core part of the score card -- sub box
#Classify each feature to facilitate the staff's scoring
#The core is to discretize continuous data, so the number of boxes should not be too large,
#To understand a concept, the discretization of continuous variables must be accompanied by the loss of information. The fewer boxes, the greater the loss of information
#IV to measure the amount of information on features and the contribution of features to the prediction function

#Box splitting idea: WOE (weight of evidence) bad% (proportion of bad customers)
#1) We first divide the continuous variables into a group of categorical variables with a large number, for example, divide tens of thousands of samples into 100 groups or 50 groups
#2) Ensure that each group contains samples of two categories, otherwise the IV value will not be calculated
#3) We perform chi square test on adjacent groups. Groups with large P value (weak correlation) in Chi square test are merged until the number of groups in the data is less than the set N box
#4) Let's divide a feature into [2,3,4..... 20] boxes respectively, observe how the IV value changes under the number of boxes, and find out the most suitable number of boxes (let N from 2 to 20, and draw a learning curve of boxes)
#5) After the boxes are divided, we calculate the WOE value of each box, bad%, and observe the effect of boxes

model_data['qcut'],updown = pd.qcut(model_data['age'],retbins=True,q=20,duplicates='drop')
"""
pd.qcut，The essence of quantile based box function is to discretize continuous variables
 Only one-dimensional data can be processed. Returns the upper and lower limits of the box
 parameter q: Number of boxes to be divided
 parameter retbins=True At the same time, the return structure is the index, the sample index and the element is the index of the assigned box Series(Which box does this element belong to)
Now return two values: which box each sample belongs to, and the upper and lower limits of all boxes
"""

# Group aggregation calculates the number of 0 and 1 in the distribution box.
count_0 = model_data[model_data['SeriousDlqin2yrs'] == 0].groupby(by='qcut').count()['SeriousDlqin2yrs']
count_1 = model_data[model_data['SeriousDlqin2yrs'] == 1].groupby(by='qcut').count()['SeriousDlqin2yrs']

# Pack the upper and lower limits of each box, the number of 0 and the number of 1 together
num_bins = [*zip(updown,updown[1:],count_0,count_1)]

#Calculate WOE and BAD RATE
#BAD RATE and bad% are not the same thing
#BAD RATE is the proportion of bad samples in a box (bad/total)
#Bad% is the proportion of bad samples in a box in the whole feature

def get_woe(num_bins):    
    
    column = ['min','max','count_0','count_1']

    num_bin = pd.DataFrame(num_bins,columns=column)

    num_bin['total'] = num_bin.count_0 + num_bin.count_1

    num_bin['bad_rate'] = num_bin.count_1/num_bin.total

    num_bin['percentage'] = num_bin.total/num_bin.total.sum()

    num_bin['bad'] = num_bin.count_0/num_bin.count_0.sum()

    num_bin['good'] = num_bin.count_1/num_bin.count_1.sum()

    num_bin['woe'] = np.log(num_bin['good']/num_bin['bad'])
    
    return num_bin

def get_iv(num_bin):
    rate = num_bin['good'] - num_bin['bad']
    iv = np.sum(rate * num_bin.woe)
    return iv

The calculated iv value is:

0.36095534222757264

After writing out the corresponding function, carry out chi square test, merge the box and draw the iv curve.

# Calculate the maximum p value

iv_nums = []
axisx = []

while len(num_bins_) > 2: # Here n stands for the number of boxes to determine how many boxes to divide

    pvs = []

    for j in range(len(num_bins_)-1):

        pv = scipy.stats.chi2_contingency([num_bins_[j][2:],num_bins_[j+1][2:]])[1]

        pvs.append(pv)

    i = pvs.index(max(pvs))  
    # Observe the index where the maximum p value is located and determine that it is num_bins_ Generated from the first and second columns of data (starting from zero),
    # Merge the first column and the second column. The beginning of the merged data is the beginning of the first column, and the end is the end of the second column. The amount of tree with label equal to 0,1 is the sum of the two columns
    num_bins_[i:i+2]= [(
                    num_bins_[i][0],
                    num_bins_[i+1][1],
                    num_bins_[i][2]+num_bins_[i+1][2],
                    num_bins_[i][3]+num_bins_[i+1][3])]
    
    axisx.append(len(num_bins_))
    iv_nums.append(get_iv(get_woe(num_bins_)))
    
    
plt.figure(figsize=(20,8),dpi=100)
plt.plot(axisx,iv_nums)
plt.xticks(axisx)
plt.show()

Because the fewer boxes themselves will lead to greater information loss, we observe the image and choose the direction in which the image drops the fastest 6

Use the best number of boxes to verify the results

# Define a box function
def get_bin(num_bins_,n):
    while len(num_bins_) > n: # Here n stands for the number of boxes to determine how many boxes to divide

        pvs = []

        for j in range(len(num_bins_)-1):

            pv = scipy.stats.chi2_contingency([num_bins_[j][2:],num_bins_[j+1][2:]])[1]

            pvs.append(pv)

        i = pvs.index(max(pvs))  
        # Observe the index where the maximum p value is located and determine that it is num_bins_ Generated from the first and second columns of data (starting from zero),
        # Merge the first column and the second column. The beginning of the merged data is the beginning of the first column, and the end is the end of the second column. The amount of tree with label equal to 0,1 is the sum of the two columns
        num_bins_[i:i+2]= [(
                        num_bins_[i][0],
                        num_bins_[i+1][1],
                        num_bins_[i][2]+num_bins_[i+1][2],
                        num_bins_[i][3]+num_bins_[i+1][3])]
    
    return num_bins_

get_woe(get_bin(num_bins,5)) # When the box goes down, the result of woe should be monotonous, and there can be at most one turning point

The process of selecting the best container is packaged into a function

def best_bin(DF,x,y,n=2,q=20,isshow=True):
    """
    data:Data to be passed in
    x:Column name to be boxed
    y: Tag name corresponding to data
    n:Number of boxes reserved
    q:Number of boxes at the beginning
    isshow:Whether to draw or not. The default is yes True
    """
    DF = DF[[x,y]].copy() # Copy two columns with x and y
    
    DF['qcut'],updown = pd.qcut(DF[x],retbins=True,q=q,duplicates='drop')
    
    # Group aggregation calculates the number of 0 and 1 in the distribution box.
    count_0 = DF[DF[y] == 0].groupby(by='qcut').count()[y]
    count_1 = DF[DF[y] == 1].groupby(by='qcut').count()[y]
    # Pack the upper and lower limits of each box, the number of 0 and the number of 1 together
    num_bins = [*zip(updown,updown[1:],count_0,count_1)]
    
#     print(num_bins)
    
    # Make sure there are 0 and 1 in each box
    for i in range(q):
        if 0 in num_bins[0][2:]:
            num_bins[0:2]=[(
                    num_bins[0][0]
                    ,num_bins[1][1]
                    ,num_bins[0][2]+num_bins[1][2]
                    ,num_bins[1][3]+num_bins[1][3])] 
            """If there is zero or empty in the first row, merge it with the next row,
            At this time, we should also consider that the merged rows may not contain positive samples and negative samples
            So we need to get back to this place if The position, judgment, that is, the cycle of adding the number of boxes,
            One at a time continue，Go back and judge until there must be positive and negative samples in the row"""
            continue
        for i in range(len(num_bins)): 
            if 0 in num_bins[i][2:]:                                                                  
                num_bins[i-1:i+1]=[(
                    num_bins[i-1][0]
                    ,num_bins[i][1]
                    ,num_bins[i-1][2]+num_bins[i][2]
                    ,num_bins[i-1][3]+num_bins[i][3])]
                break
                """This is not to merge with the next row, because each row needs to be checked. The last row has no later row, so it is merged with the previous row, and the first row has been verified at the beginning"""
                """this break，Only in if It will be triggered only when the conditions are met
                   In other words, only when a merge occurs will it be interrupted for i in range(len(num_bins))This cycle
                   Why break the cycle? Because we are range(len(num_bins))Medium traversal
                   But after the merger, len(num_bins)Changes have taken place, but the cycle will not start again"""
        else:
            break
        
        
    def get_woe(num_bins):  

        column = ['min','max','count_0','count_1']

        num_bin = pd.DataFrame(num_bins,columns=column)

        num_bin['total'] = num_bin.count_0 + num_bin.count_1

        num_bin['bad_rate'] = num_bin.count_1/num_bin.total

        num_bin['percentage'] = num_bin.total/num_bin.total.sum()

        num_bin['bad'] = num_bin.count_0/num_bin.count_0.sum()

        num_bin['good'] = num_bin.count_1/num_bin.count_1.sum()

        num_bin['woe'] = np.log(num_bin['good']/num_bin['bad'])

        return num_bin

    def get_iv(num_bin):

        rate = num_bin['good'] - num_bin['bad']
        iv = np.sum(rate * num_bin.woe)
        return iv

    # Calculate the maximum p value

    iv_nums = []
    axisx = []
    
    while len(num_bins) > n: # Here n stands for the number of boxes to determine how many boxes to divide

        pvs = []

        for j in range(len(num_bins)-1):

            pv = scipy.stats.chi2_contingency([num_bins[j][2:],num_bins[j+1][2:]])[1]

            pvs.append(pv)
            
#             print(pvs)

        i = pvs.index(max(pvs))  
        # Observe the index where the maximum p value is located and determine that it is num_bins_ Generated from the first and second columns of data (starting from zero),
        # Merge the first column and the second column. The beginning of the merged data is the beginning of the first column, and the end is the end of the second column. The amount of tree with label equal to 0,1 is the sum of the two columns
        num_bins[i:i+2]= [(
                        num_bins[i][0],
                        num_bins[i+1][1],
                        num_bins[i][2]+num_bins[i+1][2],
                        num_bins[i][3]+num_bins[i+1][3])]
        
        bins_df = pd.DataFrame(get_woe(num_bins))
        axisx.append(len(num_bins))
        iv_nums.append(get_iv(bins_df))
#         print(bins_df)
#         print(iv_nums)
        

    if isshow:
        plt.figure(figsize=(20,8),dpi=100)
        plt.plot(axisx,iv_nums)
        plt.xticks(axisx)
        plt.show()
    return None

n_columns = model_data.columns

for i in n_columns[1:-1]:
    print(i)
    best_bin(model_data,i,'SeriousDlqin2yrs',n=2,q=20,isshow=True)

#Observe the above images. Some images will be abnormal. You should understand here,
#The purpose of dividing boxes is to change continuous variables into discrete ones,
#But variables such as family size and expected number of times are not very continuous in themselves,
#Therefore, this situation will occur when these characteristics are divided into boxes together with continuity variables
#At this time, you need to manually separate these results

auto_col_bins = {
    'RevolvingUtilizationOfUnsecuredLines':6,
    'age':5,
    'DebtRatio':4,
    'MonthlyIncome':4,
    'NumberOfOpenCreditLinesAndLoans':5
}

# Data requiring manual box splitting
hand_bins = {
    'NumberOfTime30-59DaysPastDueNotWorse':[0,1,213],
    'NumberOfTimes90DaysLate':[0,1,2,17],
    'NumberRealEstateLoansOrLines':[0,1,2,3,54],
    'NumberOfTime60-89DaysPastDueNotWorse':[0,1,2,11],
    'NumberOfDependents':[0,1,2,3,4,20]}

# Considering the actual situation, ensure the maximum coverage of the section, and use NP Inf replaces the maximum value with - NP Inf replaces the maximum (maximum and minimum)
hand_bins = {k:[-np.inf,*i[:-1],np.inf] for k,i in hand_bins.items()}

#Here I define a function of # optimal bin division result (it is found here that # when the bin division fails to meet the standard, the bins_df (i.e. bin division result) in the first function cannot be returned, and the data with normal bin division can be returned normally)

def best_bin_s(DF,x,y,n=2,q=20,isshow=True):
    """
    data:Data to be passed in
    x:Column name to be boxed
    y: Tag name corresponding to data
    n:Number of boxes reserved
    q:Number of boxes at the beginning
    isshow:Whether to draw or not. The default is yes True
    """
    DF = DF[[x,y]].copy() # Copy two columns with x and y
    
    DF['qcut'],updown = pd.qcut(DF[x],retbins=True,q=q,duplicates='drop')
    
    # Group aggregation calculates the number of 0 and 1 in the distribution box.
    count_0 = DF[DF[y] == 0].groupby(by='qcut').count()[y]
    count_1 = DF[DF[y] == 1].groupby(by='qcut').count()[y]
    # Pack the upper and lower limits of each box, the number of 0 and the number of 1 together
    num_bins = [*zip(updown,updown[1:],count_0,count_1)]
    
#     print(num_bins)
    
    # Make sure there are 0 and 1 in each box
    for i in range(q):
        if 0 in num_bins[0][2:]:
            num_bins[0:2]=[(
                    num_bins[0][0]
                    ,num_bins[1][1]
                    ,num_bins[0][2]+num_bins[1][2]
                    ,num_bins[1][3]+num_bins[1][3])] 
            """If there is zero or empty in the first row, merge it with the next row,
            At this time, we should also consider that the merged rows may not contain positive samples and negative samples
            So we need to get back to this place if The position, judgment, that is, the cycle of adding the number of boxes,
            One at a time continue，Go back and judge until there must be positive and negative samples in the row"""
            continue
        for i in range(len(num_bins)): 
            if 0 in num_bins[i][2:]:                                                                  
                num_bins[i-1:i+1]=[(
                    num_bins[i-1][0]
                    ,num_bins[i][1]
                    ,num_bins[i-1][2]+num_bins[i][2]
                    ,num_bins[i-1][3]+num_bins[i][3])]
                break
                """This is not to merge with the next row, because each row needs to be checked. The last row has no later row, so it is merged with the previous row, and the first row has been verified at the beginning"""
                """this break，Only in if It will be triggered only when the conditions are met
                   In other words, only when a merge occurs will it be interrupted for i in range(len(num_bins))This cycle
                   Why break the cycle? Because we are range(len(num_bins))Medium traversal
                   But after the merger, len(num_bins)Changes have taken place, but the cycle will not start again"""
        else:
            break
        
        
    def get_woe(num_bins):  

        column = ['min','max','count_0','count_1']

        num_bin = pd.DataFrame(num_bins,columns=column)

        num_bin['total'] = num_bin.count_0 + num_bin.count_1

        num_bin['bad_rate'] = num_bin.count_1/num_bin.total

        num_bin['percentage'] = num_bin.total/num_bin.total.sum()

        num_bin['bad'] = num_bin.count_0/num_bin.count_0.sum()

        num_bin['good'] = num_bin.count_1/num_bin.count_1.sum()

        num_bin['woe'] = np.log(num_bin['good']/num_bin['bad'])

        return num_bin

    def get_iv(num_bin):

        rate = num_bin['good'] - num_bin['bad']
        iv = np.sum(rate * num_bin.woe)
        return iv

    # Calculate the maximum p value

    iv_nums = []
    axisx = []
    
    while len(num_bins) > n: # Here n stands for the number of boxes to determine how many boxes to divide

        pvs = []

        for j in range(len(num_bins)-1):

            pv = scipy.stats.chi2_contingency([num_bins[j][2:],num_bins[j+1][2:]])[1]

            pvs.append(pv)
            
#             print(pvs)

        i = pvs.index(max(pvs))  
        # Observe the index where the maximum p value is located and determine that it is num_bins_ Generated from the first and second columns of data (starting from zero),
        # Merge the first column and the second column. The beginning of the merged data is the beginning of the first column, and the end is the end of the second column. The amount of tree with label equal to 0,1 is the sum of the two columns
        num_bins[i:i+2]= [(
                        num_bins[i][0],
                        num_bins[i+1][1],
                        num_bins[i][2]+num_bins[i+1][2],
                        num_bins[i][3]+num_bins[i+1][3])]
        
        bins_df = pd.DataFrame(get_woe(num_bins))
        axisx.append(len(num_bins))
        iv_nums.append(get_iv(bins_df))
#         print(bins_df)
#         print(iv_nums)
        

    if isshow:
        plt.figure(figsize=(20,8),dpi=100)
        plt.plot(axisx,iv_nums)
        plt.xticks(axisx)
        plt.show()
    return bins_df

#Define the above function, pass in the characteristic data that can be automatically divided into boxes, let it return df with detailed information, and replace the minimum and maximum values with - NP inf,np. inf

bins_col = {}

for col in auto_col_bins:
    bins_df = best_bin_s(model_data,col,
                         'SeriousDlqin2yrs',
                         n=auto_col_bins[col],
                         q=20,isshow=False)
    
    bins_df = sorted(set(bins_df['min']).union(bins_df['max'])) # Returns the union of multiple sets
    
    bins_df[0],bins_df[-1] = -np.inf,np.inf
    
    bins_col[col] =bins_df
    
    bins_col.update(hand_bins)

Calculate the WOE value of each box and map it to the data ¶

#When we use the function, the WOE value of the features that can be divided into boxes can be calculated automatically, but those that cannot be shared need to be processed manually
#Also replace the obtained WOE value (probability of non default without a box) with the original data model_ In data,
#Instead of using the original data, we want to use WOE to represent the difference of each box. What we want to obtain is the classification result of 'each box', that is, the classification result of each scoring item on the scoring card.

def new_get_woe(df,x,y,bins):
    df = df[[x,y]].copy()
    
    df['cut'] = pd.cut(df[x],bins)
    
    bins_df = df.groupby('cut')[y].value_counts().unstack()
    
    woe = bins_df['WOE'] = np.log((bins_df[0]/bins_df[0].sum()))/(bins_df[1]/bins_df[1].sum())
    
    return woe

woeall = {}

for col in bins_col:
    woeall[col] = new_get_woe(model_data,col,'SeriousDlqin2yrs',bins_col[col])

#Next, map all woe s to the original data

model_woe = pd.DataFrame(index=model_data.index)

# For all data--  
for col in bins_col:
    
    model_woe[col] = pd.cut(model_data[col],bins_col[col]).map(woeall[col])

model_woe["SeriousDlqin2yrs"] = model_data["SeriousDlqin2yrs"]

#  This is the modeling data to be used (that is, the woe value of the box to which each data belongs)
model_woe

Modeling and model validation ¶

test_woe = pd.DataFrame(index=test_data.index)

# For all data--  
for col in bins_col:
    
    test_woe[col] = pd.cut(test_data[col],bins_col[col]).map(woeall[col])


X = model_woe.iloc[:,:-1]
y = model_woe.iloc[:,-1]

test_x = test_woe.iloc[:,:-1]
test_y = test_woe.iloc[:,-1]

Machine learning:

lr = LR().fit(X,y)
lr.score(test_x,test_y)

Tuning:

c_1 = np.linspace(0.01,1,20)
c_2 = np.linspace(0.01,0.2,20)

import matplotlib.pyplot as plt
score = []
for i in c_2:
    lr = LR(solver='liblinear',C=i).fit(X,y)
    
    score.append(lr.score(test_x,test_y))
    
    
print(max(score),score.index(max(score)))
plt.figure(figsize=(20,8),dpi=80)

plt.plot(c_2,score)

plt.show()

lr.n_iter_ # View iterations

score = []
for i in [1,2,3,4,5,6]:
    lr = LR(solver='liblinear',C=0.01,max_iter=i).fit(X,y)
    score.append(lr.score(test_x,test_y))
plt.figure()
plt.plot([1,2,3,4,5,6],score)
plt.show()

# Get the optimal model
lr = LR(solver='liblinear',C=0.01,max_iter=i).fit(X,y)
lr.score(test_x,test_y)

# roc curve
import scikitplot as skplt
#%%cmd
#pip install scikit-plot
vali_proba_df = pd.DataFrame(lr.predict_proba(test_x))
skplt.metrics.plot_roc(test_y, vali_proba_df,plot_micro=False,figsize=(6,6),plot_macro=False)

Make score card:

#After modeling, we use accuracy and ROC curve to verify the prediction ability of the model. Next, let's talk about the transformation from logistic regression to standard scorecard. review
#The score in the scorecard is calculated by the following formula:

B = 20/np.log(2) 
A = 600 + B*np.log(1/60) 
B,A

# Calculation basis score
base_score = A - B*lr.intercept_ # intercept_  intercept
base_score

with open('./Data/ScoreData.csv','w')as f:
    f.write('base_score,{}\n'.format(base_score))
    
for i,col in enumerate(X.columns):
    score = woeall[col] * (-B*lr.coef_[0][i])
    score.name = "Score"
    score.index.name = col
    score.to_csv('./Data/ScoreData.csv',header=True,mode='a')

over!

Topics: Algorithm logistic regressive

Programmer Think

Study notes Day 46 (making score cards by logistic regression)

Use the best number of boxes to verify the results

Modeling and model validation ¶

Hot Topics