Used car transaction price_ Task01&02

Posted by jola on Thu, 07 Oct 2021 06:38:49 +0200

Recently, I participated in the learning activity of 'Coggle data science 30 Days of ML', and recorded my notes here. The activity is completely free and feels like a good opportunity for promotion. The publicity pictures of the event are as follows:


All right, let's get to the point

Import package

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

Task01 read in data

Method 1

Train_Data = pd.read_csv("D:/competition/Heaven Pool/Used car transaction price forecast/used_car_train_20200313/used_car_train_20200313.csv")
Test_Data = pd.read_csv("D:/competition/Heaven Pool/Used car transaction price forecast/used_car_testB_20200421/used_car_testB_20200421.csv")
Train_Data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate price v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
00 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 0.0 ...
11 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 - 43...
22 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5...
33 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0...
44 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 0...
Test_Data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer notRepairedDamage regionCode seller offerType creatDate v_0 v_1 v_2 v_3 v_4 v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0200000 133777 20000501 67.0 0 1.0 0.0 0.0 101 ...
1200001 61206 19950211 19.0 6 2.0 0.0 0.0 73 6....
2200002 67829 20090606 5.0 5 4.0 0.0 0.0 120 5....
3200003 8892 20020601 22.0 9 1.0 0.0 0.0 58 15....
4200004 76998 20030301 46.0 6 0.0 0.0 116 15.0...
Train_Data.shape
(150000, 1)
Test_Data.shape
(50000, 1)
#Clean up the data format so that each column of data corresponds to the corresponding column name one by one
columns_name = Train_Data.columns.str.split()[0]
Train_Data = Train_Data.iloc[:,0].str.split(' ',expand=True)
Train_Data.columns = columns_name

columns_name = Test_Data.columns.str.split()[0]
Test_Data = Test_Data.iloc[:,0].str.split(' ',expand=True)
Test_Data.columns = columns_name
Train_Data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.235675906699110150.101988240779538830.1295486614187890.022816367400062690.09746182870576199-2.88180323855531652.8040967707208506-2.42082079261227840.79529194331183770.9147624995703408
1122622003030140.012.00.00.0015.0...0.26477725550370970.121003594041165120.13573070688290550.0265974481182627740.020581662632484482-4.90048188176667752.0963376444273414-1.0304828371563102-1.72267377538513490.2455224109670493
221487420040403115.0151.00.00.016312.5...0.251410147808758750.114912276540464150.165147493344964150.062172837307262450.02707482416830506-4.8467492602699031.8035589412299321.5653296250457633-0.8326873267265079-0.22996285613259074
337186519960908109.0100.00.01.019315.0...0.27429317090828240.110300084686438020.121963745731867930.0333945471221996150.0-4.50959882352479551.2859397444845837-0.5018679084368517-2.4383527366881763-0.4786993792688288
4411108020120103110.051.00.00.0685.0...0.22803562179978280.07320505355646850.091880479282627770.078819384734986060.12153424142524565-1.89624027860507250.91078313373793660.93110955881517092.83451782039383771.9234819632780635

5 rows × 31 columns

Test_Data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
02000001337772000050167.001.00.00.010115.0...0.236519538074633940.000240779655166038380.105318990279280280.046233338583585470.094522312711035133.619512125855613-0.2806066537539741-2.0197611432958540.97882772608007120.8033215020878566
1200001612061995021119.062.00.00.0736.0...0.261518419764214970.00.120323453618618150.04678423783058520.0353852626712750852.9973763596922285-1.4067050523440334-1.0208835817916766-1.3499898633435856-0.20054163936348302
220000267829200906065.054.00.00.01205.0...0.261690618119556240.090836486560924080.00.079655325707377110.0735862207476284-3.951083771010004-0.43346732852137490.91896384285603361.63460398900783081.027172758680927
320000388922002060122.091.00.00.05815.0...0.23604950750635730.101776895240694780.098949895111060320.0268296275028264140.09661365556957097-2.84678777188327332.8002670817288-2.52461032354957831.07681922984697030.4616102367935517
4200004769982003030146.060.00.011615.0...0.25699953582330250.00.066731757342178860.057771172014675780.068852469780262832.839010006118193-1.6598006754576482-0.92414174941761240.199422612408331120.4510139980592859

5 rows × 30 columns

Method 2

Train_Data = pd.read_csv("D:/competition/Heaven Pool/Used car transaction price forecast/used_car_train_20200313/used_car_train_20200313.csv", sep=' ')
Test_Data = pd.read_csv("D:/competition/Heaven Pool/Used car transaction price forecast/used_car_testB_20200421/used_car_testB_20200421.csv",sep=' ')
Train_Data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
007362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
1122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
221487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
337186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
4411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

5 rows × 31 columns

Test_Data.head()
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
02000001337772000050167.001.00.00.010115.0...0.2365200.0002410.1053190.0462330.0945223.619512-0.280607-2.0197610.9788280.803322
1200001612061995021119.062.00.00.0736.0...0.2615180.0000000.1203230.0467840.0353852.997376-1.406705-1.020884-1.349990-0.200542
220000267829200906065.054.00.00.01205.0...0.2616910.0908360.0000000.0796550.073586-3.951084-0.4334670.9189641.6346041.027173
320000388922002060122.091.00.00.05815.0...0.2360500.1017770.0989500.0268300.096614-2.8467882.800267-2.5246101.0768190.461610
4200004769982003030146.060.0NaN0.011615.0...0.2570000.0000000.0667320.0577710.0688522.839010-1.659801-0.9241420.1994230.451014

5 rows × 30 columns

Task 2 data analysis

Analyze the value, range and type of each field

Train_Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   SaleID             150000 non-null  int64  
 1   name               150000 non-null  int64  
 2   regDate            150000 non-null  int64  
 3   model              149999 non-null  float64
 4   brand              150000 non-null  int64  
 5   bodyType           145494 non-null  float64
 6   fuelType           141320 non-null  float64
 7   gearbox            144019 non-null  float64
 8   power              150000 non-null  int64  
 9   kilometer          150000 non-null  float64
 10  notRepairedDamage  150000 non-null  object 
 11  regionCode         150000 non-null  int64  
 12  seller             150000 non-null  int64  
 13  offerType          150000 non-null  int64  
 14  creatDate          150000 non-null  int64  
 15  price              150000 non-null  int64  
 16  v_0                150000 non-null  float64
 17  v_1                150000 non-null  float64
 18  v_2                150000 non-null  float64
 19  v_3                150000 non-null  float64
 20  v_4                150000 non-null  float64
 21  v_5                150000 non-null  float64
 22  v_6                150000 non-null  float64
 23  v_7                150000 non-null  float64
 24  v_8                150000 non-null  float64
 25  v_9                150000 non-null  float64
 26  v_10               150000 non-null  float64
 27  v_11               150000 non-null  float64
 28  v_12               150000 non-null  float64
 29  v_13               150000 non-null  float64
 30  v_14               150000 non-null  float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
Test_Data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   SaleID             50000 non-null  int64  
 1   name               50000 non-null  int64  
 2   regDate            50000 non-null  int64  
 3   model              50000 non-null  float64
 4   brand              50000 non-null  int64  
 5   bodyType           48496 non-null  float64
 6   fuelType           47076 non-null  float64
 7   gearbox            48032 non-null  float64
 8   power              50000 non-null  int64  
 9   kilometer          50000 non-null  float64
 10  notRepairedDamage  50000 non-null  object 
 11  regionCode         50000 non-null  int64  
 12  seller             50000 non-null  int64  
 13  offerType          50000 non-null  int64  
 14  creatDate          50000 non-null  int64  
 15  v_0                50000 non-null  float64
 16  v_1                50000 non-null  float64
 17  v_2                50000 non-null  float64
 18  v_3                50000 non-null  float64
 19  v_4                50000 non-null  float64
 20  v_5                50000 non-null  float64
 21  v_6                50000 non-null  float64
 22  v_7                50000 non-null  float64
 23  v_8                50000 non-null  float64
 24  v_9                50000 non-null  float64
 25  v_10               50000 non-null  float64
 26  v_11               50000 non-null  float64
 27  v_12               50000 non-null  float64
 28  v_13               50000 non-null  float64
 29  v_14               50000 non-null  float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
Train_Data.describe().iloc[:,:15]
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometerregionCodesellerofferTypecreatDateprice
count150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000150000.000000150000.000000150000.01.500000e+05150000.000000
mean74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.5971602583.0772670.0000070.02.016033e+075923.327333
std43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.9195761885.3632180.0025820.01.067328e+027501.998477
min0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.0000000.02.015062e+0711.000000
25%37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.5000001018.0000000.0000000.02.016031e+071300.000000
50%74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.0000002196.0000000.0000000.02.016032e+073250.000000
75%112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.0000003843.0000000.0000000.02.016033e+077700.000000
max149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.0000008120.0000001.0000000.02.016041e+0799999.000000
Train_Data.describe().iloc[:,15:]
v_0v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
mean44.406268-0.0448090.0807650.0788330.0178750.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
std2.4575483.6418932.9296182.0265141.1936610.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
min30.451976-4.295589-4.470671-7.275037-4.3645650.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
25%43.135799-3.192349-0.970671-1.462580-0.9211910.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
50%44.610266-3.052671-0.3829470.099722-0.0759100.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
75%46.0047214.0006700.2413351.5658380.8687580.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
max52.3041787.32030819.0354969.8547026.8293520.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418
Test_Data.describe().iloc[:,:15]
SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometerregionCodesellerofferTypecreatDatev_0
count50000.00000050000.0000005.000000e+0450000.0000050000.00000048496.00000047076.00000048032.00000050000.00000050000.00000050000.00000050000.050000.05.000000e+0450000.000000
mean224999.50000068505.6061002.003401e+0747.649488.0871401.7937360.3764980.226953119.76696012.5982602581.0806800.00.02.016033e+0744.400023
std14433.90106761032.1242715.351615e+0449.907417.8996481.7649700.5492810.418866206.3133483.9125191889.2485590.00.01.113395e+022.459920
min200000.0000001.0000001.991000e+070.000000.0000000.0000000.0000000.0000000.0000000.5000000.0000000.00.02.014031e+0731.122325
25%212499.75000011315.0000001.999100e+0711.000001.0000000.0000000.0000000.00000075.00000012.5000001006.0000000.00.02.016031e+0743.120935
50%224999.50000052215.0000002.003091e+0730.000006.0000001.0000000.0000000.000000110.00000015.0000002204.5000000.00.02.016032e+0744.601493
75%237499.250000118710.7500002.007110e+0766.0000013.0000003.0000001.0000000.000000150.00000015.0000003842.0000000.00.02.016033e+0745.987018
max249999.000000196808.0000002.015121e+07246.0000039.0000007.0000006.0000001.00000019211.00000015.0000008120.0000000.00.02.016041e+0751.676686
Test_Data.describe().iloc[:,15:]
v_1v_2v_3v_4v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14
count50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
mean-0.0655250.0797060.0783810.0223610.2481470.0446240.1246930.0581980.0621130.0196330.0027590.0043420.004570-0.007209
std3.6366312.9308292.0191361.1942150.0458360.0516640.2014400.0291710.0357233.7640953.2895232.5159121.2871941.044718
min-4.231855-4.032142-5.801254-4.2336260.0000000.0000000.0000000.0000000.000000-9.119719-5.662163-8.291868-4.157649-6.098192
25%-3.193169-0.967832-1.456793-0.9221530.2434360.0000350.0625190.0354130.033880-3.675196-1.963928-1.865406-1.048722-0.440706
50%-3.053506-0.3849100.118448-0.0681870.2578180.0008010.0958800.0568040.0587491.632134-0.375537-0.138943-0.0363520.136849
75%3.9787030.2396891.5634900.8715650.2652630.1016540.1254700.0793870.0876242.8462051.2634511.7756320.9452390.685555
max7.19075918.8659889.3865584.9591060.2911760.1534031.4115590.1574580.21130412.17786418.78949613.3848285.6353742.649768

It can be found that except that notrepairedamage is of object type, others are numbers. Here we can show several different values

Train_Data['notRepairedDamage'].value_counts()
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64
Train_Data['notRepairedDamage'].replace('-',np.nan,inplace=True)
Test_Data['notRepairedDamage'].value_counts()
0.0    37224
-       8069
1.0     4707
Name: notRepairedDamage, dtype: int64
Test_Data['notRepairedDamage'].replace('-',np.nan,inplace=True)

View missing values

missing = Train_Data.isnull().sum()
missing = missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()
<AxesSubplot:>

missing = Test_Data.isnull().sum()
missing = missing[missing>0]
missing.sort_values(inplace=True)
missing.plot.bar()
<AxesSubplot:>

Delete highly unevenly distributed columns

del Train_Data["seller"]
del Train_Data["offerType"]
del Test_Data["seller"]
del Test_Data["offerType"]

Observe the skewness and kurtosis of the data

sns.distplot(Train_Data.skew(),color='blue',axlabel ='Skewness')

sns.distplot(Test_Data.skew(),color='blue',axlabel ='Skewness')

sns.distplot(Train_Data.kurt(),color='orange',axlabel ='Kurtness')

sns.distplot(Test_Data.kurt(),color='orange',axlabel ='Kurtness')

It is found that there are large values of skewness and kurtosis

See the specific frequency of the predicted value

plt.hist(Train_Data['price'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()

The viewing frequency is very few values greater than 20000. In fact, these can also be regarded as special values (outliers) and directly filled in or deleted, and then carried out in the front

The distribution z after log transformation is more uniform, so log transformation can be used for prediction, which is also a commonly used trick in prediction problems

plt.hist(np.log(Train_Data['price']), orientation = 'vertical',histtype = 'bar', color ='red') 
plt.show()

Features are divided into category features and digital features

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]

categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

Category characteristics

Characteristic nunique distribution

for cat_fea in categorical_features:
    print(cat_fea + "The characteristic distribution of is as follows:")
    print("{}The feature has a{}Different values".format(cat_fea, Train_Data[cat_fea].nunique()))
    print(Train_Data[cat_fea].value_counts())
name The characteristic distribution of is as follows:
name The feature has a 99662 different value
708       282
387       282
55        280
1541      263
203       233
         ... 
119983      1
63443       1
104410      1
154956      1
177672      1
Name: name, Length: 99662, dtype: int64
model The characteristic distribution of is as follows:
model The feature has 248 different values
0.0      11762
19.0      9573
4.0       8445
1.0       6038
29.0      5186
         ...  
240.0        2
209.0        2
245.0        2
242.0        2
247.0        1
Name: model, Length: 248, dtype: int64
brand The characteristic distribution of is as follows:
brand The feature has 40 different values
0     31480
4     16737
14    16089
10    14249
1     13794
6     10217
9      7306
5      4665
13     3817
11     2945
3      2461
7      2361
16     2223
8      2077
25     2064
27     2053
21     1547
15     1458
19     1388
20     1236
12     1109
22     1085
26      966
30      940
17      913
24      772
28      649
32      592
29      406
37      333
2       321
31      318
18      316
36      228
34      227
33      218
23      186
35      180
38       65
39        9
Name: brand, dtype: int64
bodyType The characteristic distribution of is as follows:
bodyType The feature has 8 different values
0.0    41420
1.0    35272
2.0    30324
3.0    13491
4.0     9609
5.0     7607
6.0     6482
7.0     1289
Name: bodyType, dtype: int64
fuelType The characteristic distribution of is as follows:
fuelType The feature has 7 different values
0.0    91656
1.0    46991
2.0     2212
3.0      262
4.0      118
5.0       45
6.0       36
Name: fuelType, dtype: int64
gearbox The characteristic distribution of is as follows:
gearbox The feature has 2 different values
0.0    111623
1.0     32396
Name: gearbox, dtype: int64
notRepairedDamage The characteristic distribution of is as follows:
notRepairedDamage The feature has 3 different values
0.0    111361
-       24324
1.0     14315
Name: notRepairedDamage, dtype: int64
regionCode The characteristic distribution of is as follows:
regionCode The characteristic has a 7905 different value
419     369
764     258
125     137
176     136
462     134
       ... 
6377      1
7994      1
7973      1
7975      1
8117      1
Name: regionCode, Length: 7905, dtype: int64

visualization

def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_Data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

for cat_fea in categorical_features:
    print(cat_fea + "The characteristic distribution of is as follows:")
    print("{}The feature has a{}Different values".format(cat_fea, Test_Data[cat_fea].nunique()))
    print(Test_Data[cat_fea].value_counts())
name The characteristic distribution of is as follows:
name The feature has a 37536 different value
387       94
55        93
1541      86
708       85
203       78
          ..
69206      1
125326     1
82297      1
168470     1
78202      1
Name: name, Length: 37536, dtype: int64
model The characteristic distribution of is as follows:
model The feature has 245 different values
0.0      3772
19.0     3226
4.0      2790
1.0      1981
29.0     1778
         ... 
209.0       2
229.0       2
241.0       1
242.0       1
244.0       1
Name: model, Length: 245, dtype: int64
brand The characteristic distribution of is as follows:
brand The feature has 40 different values
0     10473
4      5532
14     5345
10     4713
1      4627
6      3500
9      2360
5      1485
13     1386
11      942
3       820
16      770
25      728
7       727
8       708
27      623
21      543
15      476
19      473
20      411
12      399
22      358
26      328
30      321
17      312
24      248
28      216
32      183
29      139
37      117
2       115
31      113
18      107
33       84
35       75
34       75
36       72
23       60
38       31
39        5
Name: brand, dtype: int64
bodyType The characteristic distribution of is as follows:
bodyType The feature has 8 different values
0.0    13765
1.0    11960
2.0     9886
3.0     4491
4.0     3258
5.0     2494
6.0     2212
7.0      430
Name: bodyType, dtype: int64
fuelType The characteristic distribution of is as follows:
fuelType The feature has 7 different values
0.0    30489
1.0    15708
2.0      736
3.0       78
4.0       31
5.0       18
6.0       16
Name: fuelType, dtype: int64
gearbox The characteristic distribution of is as follows:
gearbox The feature has 2 different values
0.0    37131
1.0    10901
Name: gearbox, dtype: int64
notRepairedDamage The characteristic distribution of is as follows:
notRepairedDamage The feature has 2 different values
0.0    37224
1.0     4707
Name: notRepairedDamage, dtype: int64
regionCode The characteristic distribution of is as follows:
regionCode The feature has a 6998 different value
419     120
764      98
176      48
85       45
3304     45
       ... 
5365      1
6353      1
7077      1
1317      1
2214      1
Name: regionCode, Length: 6998, dtype: int64
def count_plot(x,  **kwargs):
    sns.countplot(x=x)
    x=plt.xticks(rotation=90)

f = pd.melt(Test_Data,  value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(count_plot, "value")

  1. Category feature box diagram visualization
pd.melt(Train_Data, id_vars=['price'], value_vars=categorical_features)
pricevariablevalue
01850name736
13600name2262
26222name14874
32400name71865
45200name111080
............
11999955900regionCode4576
11999969500regionCode2826
11999977500regionCode3302
11999984999regionCode1877
11999994700regionCode235

1200000 rows × 3 columns

# Because the categories of name and regionCode are too sparse, let's draw some non sparse categories here
categorical_features = ['model',
 'brand',
 'bodyType',
 'fuelType',
 'gearbox',
 'notRepairedDamage']
for c in categorical_features:
    Train_Data[c] = Train_Data[c].astype('category')
    if Train_Data[c].isnull().any():
        Train_Data[c] = Train_Data[c].cat.add_categories(['MISSING'])
        Train_Data[c] = Train_Data[c].fillna('MISSING')

def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(Train_Data, id_vars=['price'], value_vars=categorical_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", "price")

  1. Violin graph visualization of category features
## 3) Violin graph visualization of category features
catg_list = categorical_features
target = 'price'
for catg in catg_list :
    sns.violinplot(x=catg, y=target, data=Train_Data)
    plt.show()

  1. Column chart visualization of category features

Digital features

  1. Correlation (view the correlation of individual variables and predictive variables)
numeric_features.append("price")
price_numeric = Train_Data[numeric_features]
correlation = price_numeric.corr()
print(correlation['price'].sort_values(ascending = False),'\n')
price        1.000000
v_12         0.692823
v_8          0.685798
v_0          0.628397
power        0.219834
v_5          0.164317
v_2          0.085322
v_6          0.068970
v_1          0.060914
v_14         0.035911
v_13        -0.013993
v_7         -0.053024
v_4         -0.147085
v_9         -0.206205
v_10        -0.246175
v_11        -0.275320
kilometer   -0.440519
v_3         -0.730946
Name: price, dtype: float64 

(view the correlation between variables)

plt.title('Correlation of Numeric Features with Price',y=1,size=16)

sns.heatmap(correlation,square = True,  vmax=0.8)

del price_numeric['price']
  1. Look at the skewness and peaks of several features


    Variables with large skewness and kurtosis can be found
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Train_Data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Train_Data[col].kurt())  
         )
power           Skewness: 65.86     Kurtosis: 5733.45
kilometer       Skewness: -1.53     Kurtosis: 001.14
v_0             Skewness: -1.32     Kurtosis: 003.99
v_1             Skewness: 00.36     Kurtosis: -01.75
v_2             Skewness: 04.84     Kurtosis: 023.86
v_3             Skewness: 00.11     Kurtosis: -00.42
v_4             Skewness: 00.37     Kurtosis: -00.20
v_5             Skewness: -4.74     Kurtosis: 022.93
v_6             Skewness: 00.37     Kurtosis: -01.74
v_7             Skewness: 05.13     Kurtosis: 025.85
v_8             Skewness: 00.20     Kurtosis: -00.64
v_9             Skewness: 00.42     Kurtosis: -00.32
v_10            Skewness: 00.03     Kurtosis: -00.58
v_11            Skewness: 03.03     Kurtosis: 012.57
v_12            Skewness: 00.37     Kurtosis: 000.27
v_13            Skewness: 00.27     Kurtosis: -00.44
v_14            Skewness: -1.19     Kurtosis: 002.39
price           Skewness: 03.35     Kurtosis: 019.00
for col in numeric_features:
    print('{:15}'.format(col), 
          'Skewness: {:05.2f}'.format(Test_Data[col].skew()) , 
          '   ' ,
          'Kurtosis: {:06.2f}'.format(Test_Data[col].kurt())  
         )
power           Skewness: 60.02     Kurtosis: 4533.77
kilometer       Skewness: -1.52     Kurtosis: 001.13
v_0             Skewness: -1.31     Kurtosis: 003.98
v_1             Skewness: 00.37     Kurtosis: -01.74
v_2             Skewness: 04.84     Kurtosis: 023.85
v_3             Skewness: 00.09     Kurtosis: -00.44
v_4             Skewness: 00.38     Kurtosis: -00.22
v_5             Skewness: -4.73     Kurtosis: 022.87
v_6             Skewness: 00.38     Kurtosis: -01.73
v_7             Skewness: 05.13     Kurtosis: 025.83
v_8             Skewness: 00.22     Kurtosis: -00.62
v_9             Skewness: 00.42     Kurtosis: -00.33
v_10            Skewness: 00.02     Kurtosis: -00.56
v_11            Skewness: 03.02     Kurtosis: 012.48
v_12            Skewness: 00.38     Kurtosis: 000.32
v_13            Skewness: 00.26     Kurtosis: -00.49
v_14            Skewness: -1.21     Kurtosis: 002.40
  1. Visualization of the distribution of each digital feature
    a) Observe for abnormal distribution
    b) Observe whether there are variables with different distribution between train and test
f = pd.melt(Train_Data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

numeric_features.remove("price")
f = pd.melt(Test_Data, value_vars=numeric_features)
g = sns.FacetGrid(f, col="variable",  col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

Topics: Python Machine Learning