44_Pandas converts classified variables into dummy variables (get_dummies)
To convert categorical variables (categorical data, qualitative data) into virtual variables in pandas, use pandas get_ Dummies() function.
Data such as gender classified by string can be converted into 0 for men and 1 for women, and multi class features can be converted into one hot expression. It is usually performed as a preprocessing of machine learning.
Here, the following will be described.
- pandas. get_ Basic usage of dummies()
- Exclude the first category: drop_first
- Missing value NaN Virtualization: dummy_na
- Specify pandas Column name of virtual variable of dataframe: prefix, prefix_sep
- In pandas Virtual number / Boolean column of the specified column in the dataframe: columns
- Arbitrarily digitize each category (level): map() method
The following data is used as an example.
import pandas as pd import numpy as np df = pd.read_csv('./data/44/sample_pandas_normal.csv', index_col=0) df['sex'] = ['female', np.nan, 'male', 'male', 'female', 'male'] df['rank'] = [2, 1, 1, 0, 2, 0] print(df) # age state point sex rank # name # Alice 24 NY 64 female 2 # Bob 42 CA 92 NaN 1 # Charlie 18 CA 70 male 1 # Dave 68 TX 70 male 0 # Ellen 24 CA 88 female 2 # Frank 30 NY 57 male 0
pandas. get_ Basic usage of dummies()
Specify pandas Series, array (Python list, NumPy array, ndarray, etc.) and pandas Dataframe as the first parameter data.
In either case, pandas is returned A new object in the dataframe. If you want to update the original object, you can assign it to the original object, for example, as shown below.
df = pd.get_dummies(df)
Add pandas When series and array are specified as parameters
In pandas In the case of series and arrays (Python list, NumPy array, ndarray, etc.), the category name will be the column name.
print(pd.get_dummies(df['sex'])) # female male # name # Alice 1 0 # Bob 0 0 # Charlie 0 1 # Dave 0 1 # Ellen 1 0 # Frank 0 1 print(pd.get_dummies(['male', 1, 1, 2])) # 1 2 male # 0 0 0 1 # 1 1 0 0 # 2 1 0 0 # 3 0 1 0 print(pd.get_dummies(np.arange(6))) # 0 1 2 3 4 5 # 0 1 0 0 0 0 0 # 1 0 1 0 0 0 0 # 2 0 0 1 0 0 0 # 3 0 0 0 1 0 0 # 4 0 0 0 0 1 0 # 5 0 0 0 0 0 1
Arrays (Python list, NumPy array, ndarray, etc.) must be one-dimensional arrays. Two dimensional arrays can cause errors.
# print(pd.get_dummies(np.arange(6).reshape((2, 3)))) # Exception: Data must be 1-dimensional
Add pandas When dataframe is specified as a parameter
For pandas Dataframe. By default, all columns whose data type dtype is object (mainly string) or category are virtual variables.
Numeric values (int, float) and Boolean bool columns are not converted and remain unchanged. The settings when you want to virtualize numeric and Boolean columns will be described later.
In pandas In the case of dataframe, the column name will be the original column name_ Category name. The settings to change will be described later.
print(pd.get_dummies(df)) # age point rank state_CA state_NY state_TX sex_female sex_male # name # Alice 24 64 2 0 1 0 1 0 # Bob 42 92 1 1 0 0 0 0 # Charlie 18 70 1 1 0 0 0 1 # Dave 68 70 0 0 0 1 0 1 # Ellen 24 88 2 1 0 0 1 0 # Frank 30 57 0 0 1 0 0 1
Exclude the first category: drop_first
If you want to dummy K categories, you only need k-1 dummy variables, but get_ The Dummies () function converts them to K dummy variables by default. If the parameter drop_first = True, the first category is excluded and converted to a k-1 dummy variable.
print(pd.get_dummies(df, drop_first=True)) # age point rank state_NY state_TX sex_male # name # Alice 24 64 2 1 0 0 # Bob 42 92 1 0 0 0 # Charlie 18 70 1 0 0 1 # Dave 68 70 0 0 1 1 # Ellen 24 88 2 0 0 0 # Frank 30 57 0 1 0 1
Missing value NaN Virtualization: dummy_na
By default, the missing value NaN is excluded and processed.
If you want to use NaN as a category as a dummy variable, set the parameter dummy_na = True. At this point, NaN dummy variables are also generated for columns that do not contain NaN. All elements are 0.
print(pd.get_dummies(df, drop_first=True, dummy_na=True)) # age point rank state_NY state_TX state_nan sex_male sex_nan # name # Alice 24 64 2 1 0 0 0 0 # Bob 42 92 1 0 0 0 0 1 # Charlie 18 70 1 0 0 0 1 0 # Dave 68 70 0 0 1 0 1 0 # Ellen 24 88 2 0 0 0 0 0 # Frank 30 57 0 1 0 0 1 0
Specify pandas Column name of virtual variable of dataframe: prefix, prefix_sep
Take pandas Dataframe as an example, the column name of the generated virtual variable defaults to the original column name_ Class alias.
This can be done through the parameters prefix and prefix_sep to make changes. Category name.
Parameter prefixes are specified by strings, string lists, or string dictionaries.
In the case of strings, all prefixes are common. If you only want to use the category name as the column name of the virtual variable, specify prefix and prefix in the empty string ''_ sep.
print(pd.get_dummies(df, drop_first=True, prefix='', prefix_sep='')) # age point rank NY TX male # name # Alice 24 64 2 1 0 0 # Bob 42 92 1 0 0 0 # Charlie 18 70 1 0 0 1 # Dave 68 70 0 0 1 1 # Ellen 24 88 2 0 0 0 # Frank 30 57 0 1 0 1
In the case of lists and dictionaries, you need to specify values for the columns that you want to keep the original column names. An error occurs if the number of elements in the list or dictionary does not match the number of columns to be virtualized.
print(pd.get_dummies(df, drop_first=True, prefix=['ST', 'sex'], prefix_sep='-')) # age point rank ST-NY ST-TX sex-male # name # Alice 24 64 2 1 0 0 # Bob 42 92 1 0 0 0 # Charlie 18 70 1 0 0 1 # Dave 68 70 0 0 1 1 # Ellen 24 88 2 0 0 0 # Frank 30 57 0 1 0 1 print(pd.get_dummies(df, drop_first=True, prefix={'state': 'ST', 'sex': 'sex'}, prefix_sep='-')) # age point rank ST-NY ST-TX sex-male # name # Alice 24 64 2 1 0 0 # Bob 42 92 1 0 0 0 # Charlie 18 70 1 0 0 1 # Dave 68 70 0 0 1 1 # Ellen 24 88 2 0 0 0 # Frank 30 57 0 1 0 1
In pandas Virtual number / Boolean column of the specified column in the dataframe: columns
As mentioned above, in pandas In the case of dataframe, only columns with data type dtype object (mainly string) or category are virtualized by default.
If you specify the column name of the column to be virtualized in the parameter column, you can also virtualize numeric or Boolean columns. Unspecified columns are not virtualized.
print(pd.get_dummies(df, drop_first=True, columns=['sex', 'rank'])) # age state point sex_male rank_1 rank_2 # name # Alice 24 NY 64 0 0 1 # Bob 42 CA 92 0 1 0 # Charlie 18 CA 70 1 1 0 # Dave 68 TX 70 1 0 0 # Ellen 24 CA 88 0 0 1 # Frank 30 NY 57 1 0 0
If you don't want to specify a large number of columns in the list, it may be easier to use astype() to convert the data type of the numeric or Boolean column you want to virtual to an object.
Note that if you convert the data type of a column to an object and update the original object, you need to return the original type when using the column for numerical or Boolean operations.
df['rank'] = df['rank'].astype(object) print(pd.get_dummies(df, drop_first=True)) # age point state_NY state_TX sex_male rank_1 rank_2 # name # Alice 24 64 1 0 0 0 1 # Bob 42 92 0 0 0 1 0 # Charlie 18 70 0 0 1 1 0 # Dave 68 70 0 1 1 0 0 # Ellen 24 88 0 0 0 0 1 # Frank 30 57 1 0 1 0 0
Arbitrarily digitize each category (level): map() method
If you want to replace each category of string classification with any numeric value instead of generating 0 or 1 columns for each category (level) like a virtual variable, use the map() method.
Specify the dictionary {original value:converted value} as a parameter.
print(df['state'].map({'CA': 0, 'NY': 1, 'TX': 2})) # name # Alice 1 # Bob 0 # Charlie 0 # Dave 2 # Ellen 0 # Frank 1 # Name: state, dtype: int64
map() is pandas A method of series. If you want to deal with pandas Dataframe column and update the value, which can be assigned to the original column as follows.
df['state'] = df['state'].map({'CA': 0, 'NY': 1, 'TX': 2}) print(df) # age state point sex rank # name # Alice 24 1 64 female 2 # Bob 42 0 92 NaN 1 # Charlie 18 0 70 male 1 # Dave 68 2 70 male 0 # Ellen 24 0 88 female 2 # Frank 30 1 57 male 0