Hands on deep learning Chapter II learning notes

Posted by WildcatRudy on Mon, 20 Dec 2021 11:11:07 +0100

Section I data cleaning and feature processing

View missing values in dataset

Use isnull() or isna(), or info() to view the number of non empty data in the dataset
Here are some pits:

  1. NP in numpy Nan is in float format and cannot judge two NP The number of Nan is equal
  2. None is a NoneType, NP The result of Nan is none will also be False
  3. The isnan method of numpy module only supports the judgment of numerical value, so if the input is of string type, an error will be reported
  4. If our null value only appears the nan value generated by numpy module, or we just want to judge the nan value generated by numpy module, we can use it, but it is not recommended to use it in other occasions (for example, math module also generates math.nan)
  5. pd.isna() and PD Isnull () is used in the same way. There is no difference. There will be NP in DataFrame Nan, string and None are applicable. This method is recommended and is not easy to report errors.

Processing of missing values in data set

  • Ignoring missing values will reduce the number of data sets. If the number of data sets is small, it will have a great impact on the analysis results. dropna() has the how parameter, which is a common parameter. When how='all ', a row of data will be deleted only when it is all NA. By default, any row containing missing values will be discarded. To delete a column, add the parameter axis=1.
train['Age'].dropna()

You can also set the parameter thresh=n to reserve at least n rows with non NA numbers

  • Fill in missing values, fillna(),
    method = {fill '/'pad', 'bfill' /'backfill ', None}, fill means to fill in the missing value with the previous non null value, bfill means to fill in the missing value with the latter non null value, and the default parameter is None, which means to fill in the missing value with a number;
    The inplace parameter indicates whether to replace the original dataset;
    The limit parameter limits the number of fills
    The axis parameter limits the filling direction. By default, axis=0 is filled by column and axis=1 is filled by row
train['Age']
train['Age'].fillna(method='ffill',limit=100)

In addition to these simple ways of filling in missing values, they can also be filled with average value, mode and median. They can also observe the laws between data and fill them with laws, such as establishing a regression model between the features to be filled and other relevant features, taking the known data as the training set and filling the data as the test set. Of course, the more complex the method is, the better, In the face of specific problems, we should analyze the data characteristics and business characteristics to determine the implementation plan

Processing of duplicate values in data sets

  • Viewing duplicate values: using the duplicated() method
    The duplicated() method has two parameters: the subset parameter, which can specify the data subset; Keep ='first '/'last', that is, in case of duplicate data retention, the retained data is marked as False, and the deleted data is marked as true. First means to retain the first, last means to retain the last, and keep=False means to delete all duplicate data
  • Delete duplicate values: drop_ The parameters of the duplicated() method are similar to those of the duplicated() method. An inplace parameter is added. The default inplace=False. The dataset is not modified. A new dataset is generated after the duplicate data is deleted. inplace=True will modify the current dataset

Data acquisition is hard won. Don't delete data easily!

Feature observation and processing

In order to facilitate modeling and calculation, it is necessary to convert non numerical features into numerical features and continuous numerical features into discrete numerical features.
Discrete continuous feature: box operation
An unsupervised discretization method is divided into two categories: 1 equal distance box and 2 equal frequency box
Equidistant distribution box:

pd.cut(train['Age'],5,labels=[1,2,3,4,5])

A special category object is returned

#It can also be imported into the self-defined boundary sub box
bins=[0,5,15,30,50,80]
pd.cut(train['Age'],bins,labels=[1,2,3,4,5])

Equal frequency sub box:

pd.qcut(data, 4) 
#Custom quantiles can also be passed in
q=pd.qcut(train['Age'],[0,0.1,0.3,0.5,0.7,0.9,1.0],labels=[1,2,3,4,5,6])

Feature Digitization:

  1. Relapace method:
train['Sex'].replace({'male':1,'female':2})
#You can also pass in the corresponding array
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])
  1. LabelEncoder class in scikit learn
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in ['Cabin','Embarked']:
    label_dict = dict(zip(train[i].unique(), range(train[i].nunique())))
    #Map different types of dictionaries to continuous numeric float with map
    train[ i+"_labelEncode"] = train['Cabin'].map(label_dict)
    #The passed in parameter can only be numeric or string
    train[ i+"_labelEncode"] = le.fit_transform(train[i].astype(str))
    #Convert to discrete index  

OneHot encoding
One hot coding, also known as one bit valid coding, mainly uses n-bit status registers to encode N states. Each state is composed of its own register bits, and only one bit is valid at any time. One hot coding is the representation of classification variables as binary vectors. This first requires mapping classification values to integer values. Then, each integer value is represented as a binary vector, which is zero except for the index of the integer, and it is marked as 1.

  1. OneHot class:
#onehotencoder
from sklearn.preprocessing import OneHotEncoder
train1=train.dropna()
ohe=OneHotEncoder()
features=ohe.fit_transform(train1[['Cabin']])
features.toarray()[:,1:].shape[1]

The number after the first column is taken here because in fact, the first column in onehot coding is redundant, and three different types of values only need two bit coding.

  1. get_dummies() method
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)[source]

Prefix = prefix, drop_first means to delete the first redundant column. Other parameters are not commonly used

#     x = pd.get_dummies(df["Age"] // 6) # / / indicates rounding down
#     x = pd.get_dummies(pd.cut(df['Age'],5))
    x = pd.get_dummies(df['Embarked'], prefix='Embarked')
    df = pd.concat([df, x], axis=1)
    #Or DF = DF join(x)
    df['Embarked'] = pd.get_dummies(df['Embarked'], prefix='Embarked')
    
df.head()

Extracting character features
In this part, I refer to a blogger on csdn summary , it's written in detail. I've learned the usage of regular expressions many times. I'll forget it after a period of time. I'll refer to it when I decisively choose to use it and learn it more times.

Topics: Python Data Analysis