Data cleaning
- Data cleaning is a key step in data analysis, which directly affects the subsequent processing work.
- Does the data need to be modified? Is there anything to change? How should the data be adjusted for the next analysis and mining?
- It is an iterative process, and the actual project may need to perform these cleaning operations more than once
- Processing missing data: pd.fillna(), pd.dropna().
Data connection (pd.merge)
- pd.merge
- Join rows of different dataframes based on a single or multiple key
- Connection operation similar to database
Example code:
import pandas as pd import numpy as np df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1' : np.random.randint(0,10,7)}) df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2' : np.random.randint(0,10,3)}) print(df_obj1) print(df_obj2)
Operation result:
data1 key data1 key 0 8 b 1 8 b 2 3 a 3 5 c 4 4 a 5 9 a 6 6 b data2 key 0 9 a 1 0 b 2 3 d
1. By default, the column names of overlapping columns are connected as "foreign key"
Example code:
# By default, the column names of overlapping columns are connected as "foreign keys" print(pd.merge(df_obj1, df_obj2))
Operation result:
data1 key data2 0 8 b 0 1 8 b 0 2 6 b 0 3 3 a 9 4 4 a 9 5 9 a 9
2. on display specifies "foreign key"
Example code:
# on display specify "foreign key" print(pd.merge(df_obj1, df_obj2, on='key'))
Operation result:
data1 key data2 0 8 b 0 1 8 b 0 2 6 b 0 3 3 a 9 4 4 a 9 5 9 a 9
3. Left [on], left data "foreign key", right [on], right data "foreign key"
Example code:
# Left ﹣ on, right ﹣ on specify the "foreign key" of the left data and the right data respectively # Change column names df_obj1 = df_obj1.rename(columns={'key':'key1'}) df_obj2 = df_obj2.rename(columns={'key':'key2'}) print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))
Operation result:
data1 key1 data2 key2 0 8 b 0 b 1 8 b 0 b 2 6 b 0 b 3 3 a 9 a 4 4 a 9 a 5 9 a 9 a
The default is inner, that is, the key in the result is an intersection
how to specify the connection method
4. Outer, the key in the result is union
Example code:
# "External connection" print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))
Operation result:
data1 key1 data2 key2 0 8.0 b 0.0 b 1 8.0 b 0.0 b 2 6.0 b 0.0 b 3 3.0 a 9.0 a 4 4.0 a 9.0 a 5 9.0 a 9.0 a 6 5.0 c NaN NaN 7 NaN NaN 3.0 d
5. Left connection
Example code:
# Left join print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))
Operation result:
data1 key1 data2 key2 0 8 b 0.0 b 1 8 b 0.0 b 2 3 a 9.0 a 3 5 c NaN NaN 4 4 a 9.0 a 5 9 a 9.0 a 6 6 b 0.0 b
6. "Right"
Example code:
# Right join print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))
Operation result:
data1 key1 data2 key2 0 8.0 b 0 b 1 8.0 b 0 b 2 6.0 b 0 b 3 3.0 a 9 a 4 4.0 a 9 a 5 9.0 a 9 a 6 NaN NaN 3 d
7. Processing duplicate column names
suffixes, default to _x, _y
Example code:
# Process duplicate column names df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data' : np.random.randint(0,10,7)}) df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data' : np.random.randint(0,10,3)}) print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))
Operation result:
data_left key data_right 0 9 b 1 1 5 b 1 2 1 b 1 3 2 a 8 4 2 a 8 5 5 a 8
8. Connect by index
Left? Index = true or right? Index = true
Example code:
# Connect by index df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1' : np.random.randint(0,10,7)}) df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd']) print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))
Operation result:
data1 key data2 0 3 b 6 1 4 b 6 6 8 b 6 2 6 a 0 4 3 a 0 5 0 a 0
Data merging (pd.concat)
- Combine multiple objects along an axis
1. concat of numpy
np.concatenate
Example code:
import numpy as np import pandas as pd arr1 = np.random.randint(0, 10, (3, 4)) arr2 = np.random.randint(0, 10, (3, 4)) print(arr1) print(arr2) print(np.concatenate([arr1, arr2])) print(np.concatenate([arr1, arr2], axis=1))
Operation result:
# print(arr1) [[3 3 0 8] [2 0 3 1] [4 8 8 2]] # print(arr2) [[6 8 7 3] [1 6 8 7] [1 4 7 1]] # print(np.concatenate([arr1, arr2])) [[3 3 0 8] [2 0 3 1] [4 8 8 2] [6 8 7 3] [1 6 8 7] [1 4 7 1]] # print(np.concatenate([arr1, arr2], axis=1)) [[3 3 0 8 6 8 7 3] [2 0 3 1 1 6 8 7] [4 8 8 2 1 4 7 1]]
2. pd.concat
- Note to specify axis direction, axis=0 by default
- join specifies the merge method. The default is outer.
- View row indexes for duplicates during Series consolidation
1) index does not repeat
Example code:
# index does not repeat ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5)) ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9)) ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12)) print(ser_obj1) print(ser_obj2) print(ser_obj3) print(pd.concat([ser_obj1, ser_obj2, ser_obj3])) print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
Operation result:
# print(ser_obj1) 0 1 1 8 2 4 3 9 4 4 dtype: int64 # print(ser_obj2) 5 2 6 6 7 4 8 2 dtype: int64 # print(ser_obj3) 9 6 10 2 11 7 dtype: int64 # print(pd.concat([ser_obj1, ser_obj2, ser_obj3])) 0 1 1 8 2 4 3 9 4 4 5 2 6 6 7 4 8 2 9 6 10 2 11 7 dtype: int64 # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1)) 0 1 2 0 1.0 NaN NaN 1 5.0 NaN NaN 2 3.0 NaN NaN 3 2.0 NaN NaN 4 4.0 NaN NaN 5 NaN 9.0 NaN 6 NaN 8.0 NaN 7 NaN 3.0 NaN 8 NaN 6.0 NaN 9 NaN NaN 2.0 10 NaN NaN 3.0 11 NaN NaN 3.0
2) repeated index
Example code:
# index has duplicate ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5)) ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4)) ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3)) print(ser_obj1) print(ser_obj2) print(ser_obj3) print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
Operation result:
# print(ser_obj1) 0 0 1 3 2 7 3 2 4 5 dtype: int64 # print(ser_obj2) 0 5 1 1 2 9 3 9 dtype: int64 # print(ser_obj3) 0 8 1 7 2 9 dtype: int64 # print(pd.concat([ser_obj1, ser_obj2, ser_obj3])) 0 0 1 3 2 7 3 2 4 5 0 5 1 1 2 9 3 9 0 8 1 7 2 9 dtype: int64 # print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')) # join='inner 'will remove the row or column where NaN is located 0 1 2 0 0 5 8 1 3 1 7 2 7 9 9
3) check whether the row index and column index are duplicate during dataframe merging.
Example code:
df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'], columns=['A', 'B']) df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'], columns=['C', 'D']) print(df_obj1) print(df_obj2) print(pd.concat([df_obj1, df_obj2])) print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
Operation result:
# print(df_obj1) A B a 3 3 b 5 4 c 8 6 # print(df_obj2) C D a 1 9 b 6 8 # print(pd.concat([df_obj1, df_obj2])) A B C D a 3.0 3.0 NaN NaN b 5.0 4.0 NaN NaN c 8.0 6.0 NaN NaN a NaN NaN 1.0 9.0 b NaN NaN 6.0 8.0 # print(pd.concat([df_obj1, df_obj2], axis=1, join='inner')) A B C D a 3 3 1 9 b 5 4 6 8
Data reconstruction
1. stack
- Rotate column index to row index to complete hierarchical index
- DataFrame->Series
Example code:
import numpy as np import pandas as pd df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2']) print(df_obj) stacked = df_obj.stack() print(stacked)
Operation result:
# print(df_obj) data1 data2 0 7 9 1 7 8 2 8 9 3 4 1 4 1 2 # print(stacked) 0 data1 7 data2 9 1 data1 7 data2 8 2 data1 8 data2 9 3 data1 4 data2 1 4 data1 1 data2 2 dtype: int64
2. unstack
- Expand hierarchy index
- Series->DataFrame
- Let the inner index of the operation, i.e. level=-1
Example code:
# Default operation inner index print(stacked.unstack()) # Specify the level of the operation index through level print(stacked.unstack(level=0))
Operation result:
# print(stacked.unstack()) data1 data2 0 7 9 1 7 8 2 8 9 3 4 1 4 1 2 # print(stacked.unstack(level=0)) 0 1 2 3 4 data1 7 7 8 4 1 data2 9 8 9 1 2
data conversion
I. handling duplicate data
1 duplicated() returns a Boolean Series indicating whether each row is a duplicate row
Example code:
import numpy as np import pandas as pd df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4, 'data2' : np.random.randint(0, 4, 8)}) print(df_obj) print(df_obj.duplicated())
Operation result:
# print(df_obj) data1 data2 0 a 3 1 a 2 2 a 3 3 a 3 4 b 1 5 b 0 6 b 3 7 b 0 # print(df_obj.duplicated()) 0 False 1 False 2 True 3 True 4 False 5 False 6 False 7 True dtype: bool
2 drop ABCD duplicates() filter duplicate lines
Default judge all columns
You can specify to judge by some columns
Example code:
print(df_obj.drop_duplicates()) print(df_obj.drop_duplicates('data2'))
Operation result:
# print(df_obj.drop_duplicates()) data1 data2 0 a 3 1 a 2 4 b 1 5 b 0 6 b 3 # print(df_obj.drop_duplicates('data2')) data1 data2 0 a 3 1 a 2 4 b 1 5 b 0
3. Convert each row or column according to the function passed in by map
- Series transforms each row or column based on the function passed in from map
Example code:
ser_obj = pd.Series(np.random.randint(0,10,10)) print(ser_obj) print(ser_obj.map(lambda x : x ** 2))
Operation result:
# print(ser_obj) 0 1 1 4 2 8 3 6 4 8 5 6 6 6 7 4 8 7 9 3 dtype: int64 # print(ser_obj.map(lambda x : x ** 2)) 0 1 1 16 2 64 3 36 4 64 5 36 6 36 7 16 8 49 9 9 dtype: int64
II. Data replacement
replace replaces based on the content of the value
Example code:
# Single value replaces single value print(ser_obj.replace(1, -100)) # Multiple values replace one print(ser_obj.replace([6, 8], -100)) # Multiple values replace multiple values print(ser_obj.replace([4, 7], [-100, -200]))
Operation result:
# print(ser_obj.replace(1, -100)) 0 -100 1 4 2 8 3 6 4 8 5 6 6 6 7 4 8 7 9 3 dtype: int64 # print(ser_obj.replace([6, 8], -100)) 0 1 1 4 2 -100 3 -100 4 -100 5 -100 6 -100 7 4 8 7 9 3 dtype: int64 # print(ser_obj.replace([4, 7], [-100, -200])) 0 1 1 -100 2 8 3 6 4 8 5 6 6 6 7 -100 8 -200 9 3 dtype: int64