44_Pandas converts classified variables into dummy variables (get_dummies)

Posted by dodgeqwe on Thu, 27 Jan 2022 18:42:21 +0100

44_Pandas converts classified variables into dummy variables (get_dummies)

To convert categorical variables (categorical data, qualitative data) into virtual variables in pandas, use pandas get_ Dummies() function.

Data such as gender classified by string can be converted into 0 for men and 1 for women, and multi class features can be converted into one hot expression. It is usually performed as a preprocessing of machine learning.

Here, the following will be described.

  • pandas. get_ Basic usage of dummies()
  • Exclude the first category: drop_first
  • Missing value NaN Virtualization: dummy_na
  • Specify pandas Column name of virtual variable of dataframe: prefix, prefix_sep
  • In pandas Virtual number / Boolean column of the specified column in the dataframe: columns
  • Arbitrarily digitize each category (level): map() method

The following data is used as an example.

import pandas as pd
import numpy as np

df = pd.read_csv('./data/44/sample_pandas_normal.csv', index_col=0)

df['sex'] = ['female', np.nan, 'male', 'male', 'female', 'male']
df['rank'] = [2, 1, 1, 0, 2, 0]

print(df)
#          age state  point     sex  rank
# name
# Alice     24    NY     64  female     2
# Bob       42    CA     92     NaN     1
# Charlie   18    CA     70    male     1
# Dave      68    TX     70    male     0
# Ellen     24    CA     88  female     2
# Frank     30    NY     57    male     0

pandas. get_ Basic usage of dummies()

Specify pandas Series, array (Python list, NumPy array, ndarray, etc.) and pandas Dataframe as the first parameter data.

In either case, pandas is returned A new object in the dataframe. If you want to update the original object, you can assign it to the original object, for example, as shown below.

df = pd.get_dummies(df)

Add pandas When series and array are specified as parameters

In pandas In the case of series and arrays (Python list, NumPy array, ndarray, etc.), the category name will be the column name.

print(pd.get_dummies(df['sex']))
#          female  male
# name                 
# Alice         1     0
# Bob           0     0
# Charlie       0     1
# Dave          0     1
# Ellen         1     0
# Frank         0     1

print(pd.get_dummies(['male', 1, 1, 2]))
#    1  2  male
# 0  0  0     1
# 1  1  0     0
# 2  1  0     0
# 3  0  1     0

print(pd.get_dummies(np.arange(6)))
#    0  1  2  3  4  5
# 0  1  0  0  0  0  0
# 1  0  1  0  0  0  0
# 2  0  0  1  0  0  0
# 3  0  0  0  1  0  0
# 4  0  0  0  0  1  0
# 5  0  0  0  0  0  1

Arrays (Python list, NumPy array, ndarray, etc.) must be one-dimensional arrays. Two dimensional arrays can cause errors.

# print(pd.get_dummies(np.arange(6).reshape((2, 3))))
# Exception: Data must be 1-dimensional

Add pandas When dataframe is specified as a parameter

For pandas Dataframe. By default, all columns whose data type dtype is object (mainly string) or category are virtual variables.

Numeric values (int, float) and Boolean bool columns are not converted and remain unchanged. The settings when you want to virtualize numeric and Boolean columns will be described later.

In pandas In the case of dataframe, the column name will be the original column name_ Category name. The settings to change will be described later.

print(pd.get_dummies(df))
#          age  point  rank  state_CA  state_NY  state_TX  sex_female  sex_male
# name                                                                         
# Alice     24     64     2         0         1         0           1         0
# Bob       42     92     1         1         0         0           0         0
# Charlie   18     70     1         1         0         0           0         1
# Dave      68     70     0         0         0         1           0         1
# Ellen     24     88     2         1         0         0           1         0
# Frank     30     57     0         0         1         0           0         1

Exclude the first category: drop_first

If you want to dummy K categories, you only need k-1 dummy variables, but get_ The Dummies () function converts them to K dummy variables by default. If the parameter drop_first = True, the first category is excluded and converted to a k-1 dummy variable.

print(pd.get_dummies(df, drop_first=True))
#          age  point  rank  state_NY  state_TX  sex_male
# name                                                   
# Alice     24     64     2         1         0         0
# Bob       42     92     1         0         0         0
# Charlie   18     70     1         0         0         1
# Dave      68     70     0         0         1         1
# Ellen     24     88     2         0         0         0
# Frank     30     57     0         1         0         1

Missing value NaN Virtualization: dummy_na

By default, the missing value NaN is excluded and processed.

If you want to use NaN as a category as a dummy variable, set the parameter dummy_na = True. At this point, NaN dummy variables are also generated for columns that do not contain NaN. All elements are 0.

print(pd.get_dummies(df, drop_first=True, dummy_na=True))
#          age  point  rank  state_NY  state_TX  state_nan  sex_male  sex_nan
# name                                                                       
# Alice     24     64     2         1         0          0         0        0
# Bob       42     92     1         0         0          0         0        1
# Charlie   18     70     1         0         0          0         1        0
# Dave      68     70     0         0         1          0         1        0
# Ellen     24     88     2         0         0          0         0        0
# Frank     30     57     0         1         0          0         1        0

Specify pandas Column name of virtual variable of dataframe: prefix, prefix_sep

Take pandas Dataframe as an example, the column name of the generated virtual variable defaults to the original column name_ Class alias.

This can be done through the parameters prefix and prefix_sep to make changes. Category name.

Parameter prefixes are specified by strings, string lists, or string dictionaries.

In the case of strings, all prefixes are common. If you only want to use the category name as the column name of the virtual variable, specify prefix and prefix in the empty string ''_ sep.

print(pd.get_dummies(df, drop_first=True, prefix='', prefix_sep=''))
#          age  point  rank  NY  TX  male
# name                                   
# Alice     24     64     2   1   0     0
# Bob       42     92     1   0   0     0
# Charlie   18     70     1   0   0     1
# Dave      68     70     0   0   1     1
# Ellen     24     88     2   0   0     0
# Frank     30     57     0   1   0     1

In the case of lists and dictionaries, you need to specify values for the columns that you want to keep the original column names. An error occurs if the number of elements in the list or dictionary does not match the number of columns to be virtualized.

print(pd.get_dummies(df, drop_first=True, prefix=['ST', 'sex'], prefix_sep='-'))
#          age  point  rank  ST-NY  ST-TX  sex-male
# name                                             
# Alice     24     64     2      1      0         0
# Bob       42     92     1      0      0         0
# Charlie   18     70     1      0      0         1
# Dave      68     70     0      0      1         1
# Ellen     24     88     2      0      0         0
# Frank     30     57     0      1      0         1

print(pd.get_dummies(df, drop_first=True, prefix={'state': 'ST', 'sex': 'sex'}, prefix_sep='-'))
#          age  point  rank  ST-NY  ST-TX  sex-male
# name                                             
# Alice     24     64     2      1      0         0
# Bob       42     92     1      0      0         0
# Charlie   18     70     1      0      0         1
# Dave      68     70     0      0      1         1
# Ellen     24     88     2      0      0         0
# Frank     30     57     0      1      0         1

In pandas Virtual number / Boolean column of the specified column in the dataframe: columns

As mentioned above, in pandas In the case of dataframe, only columns with data type dtype object (mainly string) or category are virtualized by default.

If you specify the column name of the column to be virtualized in the parameter column, you can also virtualize numeric or Boolean columns. Unspecified columns are not virtualized.

print(pd.get_dummies(df, drop_first=True, columns=['sex', 'rank']))
#          age state  point  sex_male  rank_1  rank_2
# name                                               
# Alice     24    NY     64         0       0       1
# Bob       42    CA     92         0       1       0
# Charlie   18    CA     70         1       1       0
# Dave      68    TX     70         1       0       0
# Ellen     24    CA     88         0       0       1
# Frank     30    NY     57         1       0       0

If you don't want to specify a large number of columns in the list, it may be easier to use astype() to convert the data type of the numeric or Boolean column you want to virtual to an object.

Note that if you convert the data type of a column to an object and update the original object, you need to return the original type when using the column for numerical or Boolean operations.

df['rank'] = df['rank'].astype(object)
print(pd.get_dummies(df, drop_first=True))
#          age  point  state_NY  state_TX  sex_male  rank_1  rank_2
# name                                                             
# Alice     24     64         1         0         0       0       1
# Bob       42     92         0         0         0       1       0
# Charlie   18     70         0         0         1       1       0
# Dave      68     70         0         1         1       0       0
# Ellen     24     88         0         0         0       0       1
# Frank     30     57         1         0         1       0       0

Arbitrarily digitize each category (level): map() method

If you want to replace each category of string classification with any numeric value instead of generating 0 or 1 columns for each category (level) like a virtual variable, use the map() method.

Specify the dictionary {original value:converted value} as a parameter.

print(df['state'].map({'CA': 0, 'NY': 1, 'TX': 2}))
# name
# Alice      1
# Bob        0
# Charlie    0
# Dave       2
# Ellen      0
# Frank      1
# Name: state, dtype: int64

map() is pandas A method of series. If you want to deal with pandas Dataframe column and update the value, which can be assigned to the original column as follows.

df['state'] = df['state'].map({'CA': 0, 'NY': 1, 'TX': 2})
print(df)
#          age  state  point     sex rank
# name                                   
# Alice     24      1     64  female    2
# Bob       42      0     92     NaN    1
# Charlie   18      0     70    male    1
# Dave      68      2     70    male    0
# Ellen     24      0     88  female    2
# Frank     30      1     57    male    0

Topics: Python Machine Learning Data Mining pandas