Principle and code sharing of characteristic stability index PSI

Posted by ravensshade on Fri, 25 Feb 2022 11:59:46 +0100

PSI is an index that we need to pay close attention to before and after risk control modeling. This index directly reflects the stability of the model and has direct reference significance for us to evaluate whether the model needs iteration. Today, I will introduce this indicator from the following aspects.

Index

01 PSI concept 02 PSI generation logic 03 business application of PSI Python implementation of 04 PSI

01 PSI concept

The full name of PSI is called "Population Stability Index", which is a group stability index in Chinese translation. From the perspective of risk control application, it is a grouping test and cross time stability index. When we model, the grouping proportion distribution of data (variables or model scores) is our expectation, that is, we hope to show a similar grouping distribution in the test data set and future data sets, which we call stable. There is no specified value range for PSI value. What we need to know is that the smaller the value, the more stable it is. Generally, 0.25 will be taken as the screening threshold in risk control, that is, if PSI > 0.25, we will determine that this variable or model is unstable. Well, how to calculate the specific PSI? No hurry, please go on to the next section.

02 PSI generation logic

According to the Convention, we first put the calculation formula of PSI:

PSI = \sum\limits_{i=i}^n(A_i-E_i)* ln(\frac{A_i}{E_i})

Where, A_i represents the actual proportion of group i (in the total quantity), E_i represents the expected proportion of group i (that is, the proportion of groups during training or online). Let's take the data in the previous article "risk control ml [5] | must the distribution box be monotonous before woe" as an example. See the following table for details:

The formula is relatively simple and can be realized in Excel. The calculated PSI is 0.018, so it is stable.

03 business application of PSI

So with this stability index, how can it be applied in specific risk control scenarios? I usually apply it in the following scenarios: 1. Filter variables before modeling 2. Monitor the model after the model goes online

Filter variables before modeling

When making score cards, we usually choose variables with strong stability, because after the model is generally online, the next iteration will be one year later, so we prefer variables with strong stability. How to screen? Let's operate from the following steps: 1) Select the training data and determine the optimal bin of variables (for details, please refer to the previous article on optimal bin)

[1] What are the methods of automatic box division in risk control modeling [2] Code sharing of three continuous variable box dividing methods

2) Expected proportion distribution of initialization variables 3) Calculate the variable PSI of each month 4) Observe whether there are variables with PSI exceeding 0.25 and eliminate them.

Monitor the model after the model goes online

After the model goes online, we usually group the model points, such as A - F, so we will monitor the grouping stability of the model points. Similarly, the input parameters of the model will also be monitored.

Python implementation of 04 PSI

In the previous article( Code sharing of three continuous variable box dividing methods Based on the automatic box sorting algorithm introduced, the PSI is calculated based on numpy, and the test set can be obtained from the cut keyword input in the background of official account SamShare.

import pandas as pd
import numpy as np
import random
import math
from scipy.stats import chi2
import scipy

# Test data structure, where target is Y, 1 represents bad people and 0 represents good people.  
df = pd.read_csv('./data/autocut_testdata.csv')
print(len(df))
print(df.target.value_counts()/len(df))
print(df.head())

def get_maxks_split_point(data, var, target, min_sample=0.05):
    """ calculation KS value
    Args:
        data: DataFrame,Data set of the optimal cut-off point list of chi square sub box to be calculated
        var: Name of continuous variable to be calculated
        target: Target column to be calculated Y Name of
        min_sample: int,The minimum data sample of the sub box, that is, the amount of data at least reaches, which is generally used at the sub box point at the beginning or end
    Returns:
        ks_v: KS Value, float
        BestSplit_Point: Return the optimal division point of this iteration, float
        BestSplit_Position: Returns the position of the optimal dividing point, with 0 on the leftmost and 1 on the rightmost, float
    """
    if len(data) < min_sample:
        ks_v, BestSplit_Point, BestSplit_Position = 0, -9999, 0.0
    else:
        freq_df = pd.crosstab(index=data[var], columns=data[target])
        freq_array = freq_df.values
        if freq_array.shape[1] == 1: # If a group has only one enumeration value, such as 0 or 1, the array shape will have problems and jump out of this calculation
            # tt = np.zeros(freq_array.shape).T
            # freq_array = np.insert(freq_array, 0, values=tt, axis=1)
            ks_v, BestSplit_Point, BestSplit_Position = 0, -99999, 0.0
        else:
            bincut = freq_df.index.values
            tmp = freq_array.cumsum(axis=0)/(np.ones(freq_array.shape) * freq_array.sum(axis=0).T)
            tmp_abs = abs(tmp.T[0] - tmp.T[1])
            ks_v = tmp_abs.max()
            BestSplit_Point = bincut[tmp_abs.tolist().index(ks_v)]
            BestSplit_Position = tmp_abs.tolist().index(ks_v)/max(len(bincut) - 1, 1)
        
    return ks_v, BestSplit_Point, BestSplit_Position


def get_bestks_bincut(data, var, target, leaf_stop_percent=0.05):
    """ Calculate the optimal dividing point
    Args:
        data: DataFrame,Data set to be operated
        var: String,Name of continuous variable to be divided into boxes
        target: String,Y Column name
        leaf_stop_percent: The proportion of leaf nodes, as the stop condition, is 5 by default%
    
    Returns:
        best_bincut: The best list of syncopation points, List
    """
    min_sample = len(data) * leaf_stop_percent
    best_bincut = []
    
    def cutting_data(data, var, target, min_sample, best_bincut):
        ks, split_point, position = get_maxks_split_point(data, var, target, min_sample)
        
        if split_point != -99999:
            best_bincut.append(split_point)
        
        # The data set is segmented according to the optimal segmentation point, and the segmentation point is calculated recursively for the segmented data set until the stop condition is met
        # print("the value range of this sub box is {0} ~ {1}". Format (data [var]. Min(), data [var] max()))
        left = data[data[var] < split_point]
        right = data[data[var] > split_point]
        
        # When the data set after segmentation is still larger than the minimum data sample requirements, continue segmentation
        if len(left) >= min_sample and position not in [0.0, 1.0]:
            cutting_data(left, var, target, min_sample, best_bincut)
        else:
            pass
        if len(right) >= min_sample and position not in [0.0, 1.0]:
            cutting_data(right, var, target, min_sample, best_bincut)
        else:
            pass
        return best_bincut
    best_bincut = cutting_data(data, var, target, min_sample, best_bincut)
    
    # Fill in the head and tail of the syncopation point
    best_bincut.append(data[var].min())
    best_bincut.append(data[var].max())
    best_bincut_set = set(best_bincut)
    best_bincut = list(best_bincut_set)
    
    best_bincut.remove(data[var].min())
    best_bincut.append(data[var].min()-1)
    # Sort cut point
    best_bincut.sort()
    
    return best_bincut
    

age_bins=get_bestks_bincut(df, 'age', 'target')
df['age_bins'] = pd.cut(df['age'], bins=age_bins)
print("age The optimal dividing point of the box:", age_bins)
print("age The optimal bin division result:\n", df['age_bins'].value_counts())
df.head()

Next, import the test set and calculate the PSI.

# Import test dataset
df_test = pd.read_csv('./data/psi_testdata.csv')

def cal_psi(df_train, df_test, var, target):
    train_bins = get_bestks_bincut(df_train, var, target)
    train_cut_nums = pd.cut(df_train[var], bins=train_bins).value_counts().sort_index().values
    
    # The test set is segmented according to the segmentation point of the training set box
    def cut_test_data(data, var, bincut):
        # Expand the boundary line at both ends
        if bincut[0] > data[var].min()-1:
            bincut.remove(bincut[0])
            bincut.append(data[var].min()-1)
        if bincut[-1] < data[var].max():
            bincut.remove(bincut[-1])
            bincut.append(data[var].max())
    
        # Sort cut point
        bincut.sort()
        return bincut
    
    test_cut_nums = pd.cut(df_test[var], 
                            bins=cut_test_data(df_test, var, train_bins)).value_counts().sort_index().values
    
    tt = pd.DataFrame(np.vstack((train_cut_nums, test_cut_nums)).T, columns=['train', 'test'])
    
    # Calculate PSI
    E = tt['train'].values/tt['train'].values.sum(axis=0)
    A = tt['test'].values/tt['test'].values.sum(axis=0)
    A_sub_E = A-E
    A_divide_E = A/E
    ln_A_divide_E = np.log(A_divide_E) # log in numpy actually refers to ln
    PSI_i = A_sub_E * ln_A_divide_E
    psi = PSI_i.sum()
    return tt, psi
    
tt, psi = cal_psi(df, df_test, 'age', 'target')
print("PSI: ", psi)
tt

Reference

[1] In depth understanding and application of risk control model group stability index (PSI) https://zhuanlan.zhihu.com/p/79682292