Alibaba cloud Tianchi financial risk control Task3 - Feature Engineering

Posted by exec1 on Sat, 19 Feb 2022 22:49:28 +0100

Game Title: zero basis entry data mining - loan default of zero basis entry financial risk control
Competition address: https://tianchi.aliyun.com/competition/entrance/531830/introduction
Purpose:
1. Learn feature preprocessing, missing value, outlier processing, data bucket sorting and other feature processing methods
2. Corresponding methods of learning feature interaction, coding and selection

Overview of learning points

1. Conversion time format
2. Abnormal value handling
3. Feature selection

Learning content

Conversion time format

Source code:

#Convert to time format
for data in [data_train, data_test_a]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
    #Tectonic time characteristics
    data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days

pandas.to_datetime

pandas.to_datetime(arg,errors ='raise',utc = None,format = None,unit = None )

errors:{'ignore','raise','coerce'},Default to'raise'
If yes“ raise",An invalid resolution will throw an exception.

If yes“ raise",Set invalid resolution to NaT. 

If yes“ ignore",The invalid resolution will return the input.


utc: Boolean, default None
 If yes True, return UTC Coordinated universal time.

format: str,default None,Function: format the display time
 Resolution time strftime,For example“%d /%m /%Y",Please note“%f"It will be parsed until nanoseconds.

unit: str,Default to“ ns"
arg Unit of( D,s,ms,us,ns)Represents the unit, which is an integer or floating-point number. This will be based on the origin. For example, for unit ='ms'and origin ='unix'(Default), which is calculated to unix The number of milliseconds at the beginning of the era.

-------------------From official documents: pandas.to_datetime

2.[1]datetime.strptime
The function from string format to date format is: datetime datetime. strptime()
The function of converting date format to string format is: datetime datetime. strftime()
Both functions involve formatted strings of date and time, which are listed as follows:

%a Abbreviation for day of the week;If Wednesday is Web 
%A Full name of the day of the week;If Wednesday is Wednesday 
%b Abbreviation of month; If April is Apr 
%B Full name of the month; If April is April 
%c Standard date time series;(For example: 04/07/10 10:43:39) 
%C Last two digits of the year 
%d The decimal day of the month 
%D month/day/year 
%e In the two character field, the decimal day of the month 
%F year-month-day 
%g The last two digits of the year, using the week based year 
%G Year minute, using week based year 
%h Abbreviated month name 
%H 24 Hour in hour system 
%I 12 Hour in hour system 
%j The decimal day of the year 
%m Decimal month 
%M Minutes in ten hour system 
%n newline  
%p Local AM or PM Equivalent display of 
%r 12 Hours 
%R Display hours and minutes: hh:mm 
%S Decimal seconds 
%t Horizontal tab 
%T Display hours, minutes and seconds: hh:mm:ss 
%u The day of the week, Monday is the first day (values from 0 to 6, Monday is 0) 
%U The week ordinal of the year, with Sunday as the first day (value from 0 to 53) 
%V The week ordinal of the year, using the week based year 
%w Day of the week in decimal (values from 0 to 6, Sunday is 0) 
%W The week ordinal of each year, with Monday as the first day (value from 0 to 53) 
%x Standard date string 
%X Standard time series 
%y Decimal year without Century (values from 0 to 99) 
%Y Ten year with century part 
%z,%Z Time zone name. If the time zone name cannot be obtained, null characters will be returned. 
%% Percent sign

Outlier handling

pandas.groupby
The groupby function can group data internally, and then perform a series of operations according to different values according to a selected column.
give an example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':list('aabba'),
                  'key2': ['one','two','one','two','one'],
                  'data1': np.random.randn(5),
                  'data2': np.random.randn(5)})
print(df)
print('*'*30)
print(df.groupby('data1').min())

Output:

	            key1 key2     data1     data2
0    a  one -0.986048 -0.852297
1    a  two -1.644016  1.083959
2    b  one -0.428630  0.997801
3    b  two -0.146261 -0.156321
4    a  one -0.806370 -0.848416
******************************
          key1 key2     data2
data1                        
-1.644016    a  two  1.083959
-0.986048    a  one -0.852297
-0.806370    a  one -0.848416
-0.428630    b  one  0.997801
-0.146261    b  two -0.156321

feature selection

[2]pandas.DataFrame.corrwith
This function is used to calculate the correlation between rows or columns in the DataFrame

DataFrame.corrwith(other, axis=0, drop=False)

other: DataFrame, Series. Object with which to compute correlations.
axis: {0 or 'index', 1 or 'columns'}, default 0. 0 or 'index' to compute column-wise, 1 or 'columns' for row-wise.
drop: delete the missing index from the result and return all union sets by default

axis=0 or axis = 'index' indicates the correlation between calculated columns, and axis=1 or axis = 'columns' indicates the correlation between calculated rows.
Example:

import pandas as pd
import numpy as np


df1 = pd.DataFrame(np.random.randn(4, 4))
print(df1)
df2 = pd.DataFrame(np.random.randn(4, 4))
print(df2)
print(df1.corrwith(df2, axis=0)) # Output column dependencies
print(df1.corrwith(df2, axis=1)) # Output row to row correlation

Output:

          0         1         2         3
0 -2.117707  1.077943  0.612401  0.581080
1  0.245959 -0.401353  0.333307 -0.589932
2 -0.886114 -0.165022  0.019672 -0.917109
3  1.041763  1.171818 -0.350419  2.252435
          0         1         2         3
0  0.235729 -1.033179 -1.470501  0.194247
1 -0.821702 -1.017748  1.337973 -0.242012
2  0.809055  2.193382  1.408613 -1.317768
3  0.533227 -0.940242  1.722331 -0.201507
0   -0.182795
1   -0.467261
2   -0.805739
3    0.549257
dtype: float64
0   -0.614108
1    0.486213
2    0.760603
3   -0.756266
dtype: float64

The difference between corrwith() and * corr() *:
corrwith() only compares rows or columns with the same name, while corr() compares data at the same location.
For example:

df3 = pd.DataFrame(np.random.randn(3, 2), columns=list('ab'))
df4 = pd.DataFrame(np.random.randn(3, 2), columns=list('ac'))
df3.corr()
print(df3.corrwith(df4, axis=0))

Output:

          a         c
0 -1.354717  0.343711
1 -0.321272 -0.348672
2  0.499325  1.685045
a    0.980936
b         NaN
c         NaN
dtype: float64

If you want pandas to ignore column names and compare the first row of DF1 with the first row of df2, you can rename the column of df2 to match the column of DF1 as follows: DF1 corrwith(df2.set_axis( df1.columns, axis='columns', inplace=False))

a   -0.510442
b    0.955783
dtype: float64

Note that in this case, df1 and df2 need to have the same number of columns.

Summary:

Feature engineering is the most important part of machine learning, even in-depth learning, and it often takes the most time in practical application. For a beginner like me, it's hard to read and understand the code and model of task3, which contains a lot of knowledge points and needs to be understood and summarized step by step.

[1]strftime/strptime function in datetime module in python
[2]python corrwith_ Panda corr() versus corrwith()

Topics: Python