Game Title: zero basis entry data mining - loan default of zero basis entry financial risk control
Competition address: https://tianchi.aliyun.com/competition/entrance/531830/introduction
Purpose:
1. Learn feature preprocessing, missing value, outlier processing, data bucket sorting and other feature processing methods
2. Corresponding methods of learning feature interaction, coding and selection
Overview of learning points
1. Conversion time format
2. Abnormal value handling
3. Feature selection
Learning content
Conversion time format
Source code:
#Convert to time format for data in [data_train, data_test_a]: data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d') startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d') #Tectonic time characteristics data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days
pandas.to_datetime
pandas.to_datetime(arg,errors ='raise',utc = None,format = None,unit = None )
errors:{'ignore','raise','coerce'},Default to'raise' If yes“ raise",An invalid resolution will throw an exception. If yes“ raise",Set invalid resolution to NaT. If yes“ ignore",The invalid resolution will return the input. utc: Boolean, default None If yes True, return UTC Coordinated universal time. format: str,default None,Function: format the display time Resolution time strftime,For example“%d /%m /%Y",Please note“%f"It will be parsed until nanoseconds. unit: str,Default to“ ns" arg Unit of( D,s,ms,us,ns)Represents the unit, which is an integer or floating-point number. This will be based on the origin. For example, for unit ='ms'and origin ='unix'(Default), which is calculated to unix The number of milliseconds at the beginning of the era.
-------------------From official documents: pandas.to_datetime
2.[1]datetime.strptime
The function from string format to date format is: datetime datetime. strptime()
The function of converting date format to string format is: datetime datetime. strftime()
Both functions involve formatted strings of date and time, which are listed as follows:
%a Abbreviation for day of the week;If Wednesday is Web %A Full name of the day of the week;If Wednesday is Wednesday %b Abbreviation of month; If April is Apr %B Full name of the month; If April is April %c Standard date time series;(For example: 04/07/10 10:43:39) %C Last two digits of the year %d The decimal day of the month %D month/day/year %e In the two character field, the decimal day of the month %F year-month-day %g The last two digits of the year, using the week based year %G Year minute, using week based year %h Abbreviated month name %H 24 Hour in hour system %I 12 Hour in hour system %j The decimal day of the year %m Decimal month %M Minutes in ten hour system %n newline %p Local AM or PM Equivalent display of %r 12 Hours %R Display hours and minutes: hh:mm %S Decimal seconds %t Horizontal tab %T Display hours, minutes and seconds: hh:mm:ss %u The day of the week, Monday is the first day (values from 0 to 6, Monday is 0) %U The week ordinal of the year, with Sunday as the first day (value from 0 to 53) %V The week ordinal of the year, using the week based year %w Day of the week in decimal (values from 0 to 6, Sunday is 0) %W The week ordinal of each year, with Monday as the first day (value from 0 to 53) %x Standard date string %X Standard time series %y Decimal year without Century (values from 0 to 99) %Y Ten year with century part %z,%Z Time zone name. If the time zone name cannot be obtained, null characters will be returned. %% Percent sign
Outlier handling
pandas.groupby
The groupby function can group data internally, and then perform a series of operations according to different values according to a selected column.
give an example:
import pandas as pd import numpy as np df = pd.DataFrame({'key1':list('aabba'), 'key2': ['one','two','one','two','one'], 'data1': np.random.randn(5), 'data2': np.random.randn(5)}) print(df) print('*'*30) print(df.groupby('data1').min())
Output:
key1 key2 data1 data2 0 a one -0.986048 -0.852297 1 a two -1.644016 1.083959 2 b one -0.428630 0.997801 3 b two -0.146261 -0.156321 4 a one -0.806370 -0.848416 ****************************** key1 key2 data2 data1 -1.644016 a two 1.083959 -0.986048 a one -0.852297 -0.806370 a one -0.848416 -0.428630 b one 0.997801 -0.146261 b two -0.156321
feature selection
[2]pandas.DataFrame.corrwith
This function is used to calculate the correlation between rows or columns in the DataFrame
DataFrame.corrwith(other, axis=0, drop=False)
other: DataFrame, Series. Object with which to compute correlations.
axis: {0 or 'index', 1 or 'columns'}, default 0. 0 or 'index' to compute column-wise, 1 or 'columns' for row-wise.
drop: delete the missing index from the result and return all union sets by default
axis=0 or axis = 'index' indicates the correlation between calculated columns, and axis=1 or axis = 'columns' indicates the correlation between calculated rows.
Example:
import pandas as pd import numpy as np df1 = pd.DataFrame(np.random.randn(4, 4)) print(df1) df2 = pd.DataFrame(np.random.randn(4, 4)) print(df2) print(df1.corrwith(df2, axis=0)) # Output column dependencies print(df1.corrwith(df2, axis=1)) # Output row to row correlation
Output:
0 1 2 3 0 -2.117707 1.077943 0.612401 0.581080 1 0.245959 -0.401353 0.333307 -0.589932 2 -0.886114 -0.165022 0.019672 -0.917109 3 1.041763 1.171818 -0.350419 2.252435 0 1 2 3 0 0.235729 -1.033179 -1.470501 0.194247 1 -0.821702 -1.017748 1.337973 -0.242012 2 0.809055 2.193382 1.408613 -1.317768 3 0.533227 -0.940242 1.722331 -0.201507 0 -0.182795 1 -0.467261 2 -0.805739 3 0.549257 dtype: float64 0 -0.614108 1 0.486213 2 0.760603 3 -0.756266 dtype: float64
The difference between corrwith() and * corr() *:
corrwith() only compares rows or columns with the same name, while corr() compares data at the same location.
For example:
df3 = pd.DataFrame(np.random.randn(3, 2), columns=list('ab')) df4 = pd.DataFrame(np.random.randn(3, 2), columns=list('ac')) df3.corr() print(df3.corrwith(df4, axis=0))
Output:
a c 0 -1.354717 0.343711 1 -0.321272 -0.348672 2 0.499325 1.685045 a 0.980936 b NaN c NaN dtype: float64
If you want pandas to ignore column names and compare the first row of DF1 with the first row of df2, you can rename the column of df2 to match the column of DF1 as follows: DF1 corrwith(df2.set_axis( df1.columns, axis='columns', inplace=False))
a -0.510442 b 0.955783 dtype: float64
Note that in this case, df1 and df2 need to have the same number of columns.
Summary:
Feature engineering is the most important part of machine learning, even in-depth learning, and it often takes the most time in practical application. For a beginner like me, it's hard to read and understand the code and model of task3, which contains a lot of knowledge points and needs to be understood and summarized step by step.
[1]strftime/strptime function in datetime module in python
[2]python corrwith_ Panda corr() versus corrwith()