DataWhale - (hands on data analysis) - task02 (Section 1 of data cleaning and feature processing) - 202201

Posted by ss-mike on Thu, 13 Jan 2022 11:55:55 +0100

Hands on data analysis

Chapter 2: data cleaning and feature processing

import numpy as np
import pandas as pd
#Load data train csv
df = pd.read.csv('train.csv')
df.head(10)

Brief description of data cleaning

The data we get is usually unclean. The so-called unclean means that there are missing values and some abnormal points in the data, which need to be processed before we can continue the subsequent analysis or modeling. Therefore, the first step to get the data is to clean the data. In this chapter, we will learn about the operations such as missing values, duplicate values, string and data conversion, Clean the data into a form that can be analyzed or modeled.

2.1 observation and treatment of missing values

The data we get often has many missing values. For example, we can see that NaN exists in the bin column. Are there any missing values in other columns? What should we do with these missing values

2.1.1 task 1: missing value observation

(1) Please check the number of missing values for each feature
(2) Please view the data of Age, Cabin and embanked columns
There are many ways to do the above, so it is recommended that you learn as much as you can

#Method 1
df.info()
#Method 2
df.isnull().sum()

2.1.2 task 2: deal with missing values

(1) There are several ways to deal with missing values

(2) Try processing the missing value of the data in the Age column

(3) Please try to use different methods to deal with the missing values of the whole table directly

df[df['Age']==None]=0
df[df['Age'].isnull()] = 0 
df[df['Age'] == np.nan] = 0
#Delete data with vacancy value
df.dropna().head(3)
#Fill 0 with vacancy value
df.fillna(0).head(3)

2.2 observation and treatment of repeated values

For one reason or another, will there be duplicate values in the data? If so, how to deal with them

2.2.1 task 1: please check the duplicate values in the data

df[df.duplicated()]

2.2.2 task 2: process duplicate values

(1) What are the processing methods for duplicate values?

(2) Process duplicate values of our data

The more methods, the better

df = df.drop_duplicates()

2.2.3 task 3: save the previously cleaned data in csv format

df.to_csv('test_clear.csv')

2.3 feature observation and treatment

By observing the features, we can roughly divide them into two categories:
Numerical features: Survived, Pclass, Age, SibSp, Parch, Fare. Among them, Survived and Pclass are discrete numerical features, and Age, SibSp, Parch, Fare are continuous numerical features
Text type features: Name, Sex, Cabin, embossed, Ticket, where Sex, Cabin, embossed, Ticket are category type text features.

Numerical features can generally be directly used for model training, but sometimes continuous variables are discretized for the sake of model stability and robustness. Text features often need to be converted into numerical features before they can be used for modeling and analysis.

2.3.1 task 1: box (discretize) the age

(1) What is the sub box operation?

(2) The continuous variable Age was divided into five Age groups and represented by category variable 12345

(3) The continuous variable Age is divided into five Age groups (0,5] (5,15] (15,30] (30,50] (50,80) and represented by category variable 12345 respectively

(4) The continuous variable Age was divided into five Age groups: 10%, 30%, 50%, 70%, 90%, and expressed by the categorical variable 12345

(5) Save the data obtained above in csv format

#The continuous variable Age was divided into five Age groups and represented by category variable 12345
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])
#The continuous variable Age is divided into five Age groups (0,5] (5,15] (15,30] (30,50] (50,80) and represented by category variable 12345 respectively
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
#The continuous variable Age was divided into five Age groups of 10% 30% 50 70% 90% and expressed by the categorical variable 12345
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])

2.3.2 task 2: convert text variables

(1) View text variable name and type
(2) Text variables Sex, Cabin and embanked are represented by numerical variables 12345
(3) The text variables Sex, Cabin and embanked are represented by one hot coding

The more methods, the better

#View category text variable name and category

#Method 1: value_counts
df['Sex'].value_counts()
df['Cabin'].value_counts()
df['Embarked'].value_counts()
#Method 2: unique
df['Sex'].unique()
df['Sex'].nunique()
#Convert category text to 12345

#Method 1: replace
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
df.head()
#Method 2: map
df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})
df.head()
#Method 3: use sklearn LabelEncoder for preprocessing
from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))
#Convert category text to one hot encoding

#Method 1: OneHotEncoder
for feat in ["Age", "Embarked"]:
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1)
    #df[feat] = pd.get_dummies(df[feat], prefix=feat)

2.3.3 task 3 (additional): extract the Titles feature from the plain text Name feature (the so-called Titles are Mr,Miss,Mrs, etc.)

df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)
df.head()

Topics: Python Data Analysis Data Mining