Data cleaning, merging, transformation and reconstruction

Posted by mooshuligan on Thu, 17 Oct 2019 13:14:30 +0200

Data cleaning

  • Data cleaning is a key step in data analysis, which directly affects the subsequent processing work.
  • Does the data need to be modified? Is there anything to change? How should the data be adjusted for the next analysis and mining?
  • It is an iterative process, and the actual project may need to perform these cleaning operations more than once
  • Processing missing data: pd.fillna(), pd.dropna().

Data connection (pd.merge)

  • pd.merge
  • Join rows of different dataframes based on a single or multiple key
  • Connection operation similar to database

Example code:

import pandas as pd
import numpy as np

df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data2' : np.random.randint(0,10,3)})

print(df_obj1)
print(df_obj2)

Operation result:

   data1 key
   data1 key
0      8   b
1      8   b
2      3   a
3      5   c
4      4   a
5      9   a
6      6   b

   data2 key
0      9   a
1      0   b
2      3   d

1. By default, the column names of overlapping columns are connected as "foreign key"

Example code:

# By default, the column names of overlapping columns are connected as "foreign keys"
print(pd.merge(df_obj1, df_obj2))

Operation result:

   data1 key  data2
0      8   b      0
1      8   b      0
2      6   b      0
3      3   a      9
4      4   a      9
5      9   a      9

2. on display specifies "foreign key"

Example code:

# on display specify "foreign key"
print(pd.merge(df_obj1, df_obj2, on='key'))

Operation result:

   data1 key  data2
0      8   b      0
1      8   b      0
2      6   b      0
3      3   a      9
4      4   a      9
5      9   a      9

3. Left [on], left data "foreign key", right [on], right data "foreign key"

Example code:

# Left ﹣ on, right ﹣ on specify the "foreign key" of the left data and the right data respectively

# Change column names
df_obj1 = df_obj1.rename(columns={'key':'key1'})
df_obj2 = df_obj2.rename(columns={'key':'key2'})

print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2'))

Operation result:

   data1 key1  data2 key2
0      8    b      0    b
1      8    b      0    b
2      6    b      0    b
3      3    a      9    a
4      4    a      9    a
5      9    a      9    a

The default is inner, that is, the key in the result is an intersection

how to specify the connection method

4. Outer, the key in the result is union

Example code:

# "External connection"
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))

Operation result:

   data1 key1  data2 key2
0    8.0    b    0.0    b
1    8.0    b    0.0    b
2    6.0    b    0.0    b
3    3.0    a    9.0    a
4    4.0    a    9.0    a
5    9.0    a    9.0    a
6    5.0    c    NaN  NaN
7    NaN  NaN    3.0    d

5. Left connection

Example code:

# Left join
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))

Operation result:

   data1 key1  data2 key2
0      8    b    0.0    b
1      8    b    0.0    b
2      3    a    9.0    a
3      5    c    NaN  NaN
4      4    a    9.0    a
5      9    a    9.0    a
6      6    b    0.0    b

6. "Right"

Example code:

# Right join
print(pd.merge(df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))

Operation result:

   data1 key1  data2 key2
0    8.0    b      0    b
1    8.0    b      0    b
2    6.0    b      0    b
3    3.0    a      9    a
4    4.0    a      9    a
5    9.0    a      9    a
6    NaN  NaN      3    d

7. Processing duplicate column names

suffixes, default to _x, _y

Example code:

# Process duplicate column names
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                        'data' : np.random.randint(0,10,3)})

print(pd.merge(df_obj1, df_obj2, on='key', suffixes=('_left', '_right')))

Operation result:

   data_left key  data_right
0          9   b           1
1          5   b           1
2          1   b           1
3          2   a           8
4          2   a           8
5          5   a           8

8. Connect by index

Left? Index = true or right? Index = true

Example code:

# Connect by index
df_obj1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'data2' : np.random.randint(0,10,3)}, index=['a', 'b', 'd'])

print(pd.merge(df_obj1, df_obj2, left_on='key', right_index=True))

Operation result:

   data1 key  data2
0      3   b      6
1      4   b      6
6      8   b      6
2      6   a      0
4      3   a      0
5      0   a      0

Data merging (pd.concat)

  • Combine multiple objects along an axis

1. concat of numpy

np.concatenate

Example code:

import numpy as np
import pandas as pd

arr1 = np.random.randint(0, 10, (3, 4))
arr2 = np.random.randint(0, 10, (3, 4))

print(arr1)
print(arr2)

print(np.concatenate([arr1, arr2]))
print(np.concatenate([arr1, arr2], axis=1))

Operation result:

# print(arr1)
[[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]]

# print(arr2)
[[6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2]))
 [[3 3 0 8]
 [2 0 3 1]
 [4 8 8 2]
 [6 8 7 3]
 [1 6 8 7]
 [1 4 7 1]]

# print(np.concatenate([arr1, arr2], axis=1)) 
[[3 3 0 8 6 8 7 3]
 [2 0 3 1 1 6 8 7]
 [4 8 8 2 1 4 7 1]]

2. pd.concat

  • Note to specify axis direction, axis=0 by default
  • join specifies the merge method. The default is outer.
  • View row indexes for duplicates during Series consolidation

1) index does not repeat

Example code:

# index does not repeat
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(0,5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(5,9))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(9,12))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))

Operation result:

# print(ser_obj1)
0    1
1    8
2    4
3    9
4    4
dtype: int64

# print(ser_obj2)
5    2
6    6
7    4
8    2
dtype: int64

# print(ser_obj3)
9     6
10    2
11    7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0     1
1     8
2     4
3     9
4     4
5     2
6     6
7     4
8     2
9     6
10    2
11    7
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1))
      0    1    2
0   1.0  NaN  NaN
1   5.0  NaN  NaN
2   3.0  NaN  NaN
3   2.0  NaN  NaN
4   4.0  NaN  NaN
5   NaN  9.0  NaN
6   NaN  8.0  NaN
7   NaN  3.0  NaN
8   NaN  6.0  NaN
9   NaN  NaN  2.0
10  NaN  NaN  3.0
11  NaN  NaN  3.0

2) repeated index

Example code:

# index has duplicate
ser_obj1 = pd.Series(np.random.randint(0, 10, 5), index=range(5))
ser_obj2 = pd.Series(np.random.randint(0, 10, 4), index=range(4))
ser_obj3 = pd.Series(np.random.randint(0, 10, 3), index=range(3))

print(ser_obj1)
print(ser_obj2)
print(ser_obj3)

print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))

Operation result:

# print(ser_obj1)
0    0
1    3
2    7
3    2
4    5
dtype: int64

# print(ser_obj2)
0    5
1    1
2    9
3    9
dtype: int64

# print(ser_obj3)
0    8
1    7
2    9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3]))
0    0
1    3
2    7
3    2
4    5
0    5
1    1
2    9
3    9
0    8
1    7
2    9
dtype: int64

# print(pd.concat([ser_obj1, ser_obj2, ser_obj3], axis=1, join='inner')) 
# join='inner 'will remove the row or column where NaN is located
   0  1  2
0  0  5  8
1  3  1  7
2  7  9  9

3) check whether the row index and column index are duplicate during dataframe merging.

Example code:

df_obj1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)), index=['a', 'b', 'c'],
                       columns=['A', 'B'])
df_obj2 = pd.DataFrame(np.random.randint(0, 10, (2, 2)), index=['a', 'b'],
                       columns=['C', 'D'])
print(df_obj1)
print(df_obj2)

print(pd.concat([df_obj1, df_obj2]))
print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))

Operation result:

# print(df_obj1)
   A  B
a  3  3
b  5  4
c  8  6

# print(df_obj2)
   C  D
a  1  9
b  6  8

# print(pd.concat([df_obj1, df_obj2]))
     A    B    C    D
a  3.0  3.0  NaN  NaN
b  5.0  4.0  NaN  NaN
c  8.0  6.0  NaN  NaN
a  NaN  NaN  1.0  9.0
b  NaN  NaN  6.0  8.0

# print(pd.concat([df_obj1, df_obj2], axis=1, join='inner'))
   A  B  C  D
a  3  3  1  9
b  5  4  6  8

Data reconstruction

1. stack

  • Rotate column index to row index to complete hierarchical index
  • DataFrame->Series

Example code:

import numpy as np
import pandas as pd

df_obj = pd.DataFrame(np.random.randint(0,10, (5,2)), columns=['data1', 'data2'])
print(df_obj)

stacked = df_obj.stack()
print(stacked)

Operation result:

# print(df_obj)
   data1  data2
0      7      9
1      7      8
2      8      9
3      4      1
4      1      2

# print(stacked)
0  data1    7
   data2    9
1  data1    7
   data2    8
2  data1    8
   data2    9
3  data1    4
   data2    1
4  data1    1
   data2    2
dtype: int64

2. unstack

  • Expand hierarchy index
  • Series->DataFrame
  • Let the inner index of the operation, i.e. level=-1

Example code:

# Default operation inner index
print(stacked.unstack())

# Specify the level of the operation index through level
print(stacked.unstack(level=0))

Operation result:

# print(stacked.unstack())
   data1  data2
0      7      9
1      7      8
2      8      9
3      4      1
4      1      2

# print(stacked.unstack(level=0))
       0  1  2  3  4
data1  7  7  8  4  1
data2  9  8  9  1  2

data conversion

I. handling duplicate data

1 duplicated() returns a Boolean Series indicating whether each row is a duplicate row

Example code:

import numpy as np
import pandas as pd

df_obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,
                       'data2' : np.random.randint(0, 4, 8)})
print(df_obj)

print(df_obj.duplicated())

Operation result:

# print(df_obj)
  data1  data2
0     a      3
1     a      2
2     a      3
3     a      3
4     b      1
5     b      0
6     b      3
7     b      0

# print(df_obj.duplicated())
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7     True
dtype: bool

2 drop ABCD duplicates() filter duplicate lines

Default judge all columns

You can specify to judge by some columns

Example code:

print(df_obj.drop_duplicates())
print(df_obj.drop_duplicates('data2'))

Operation result:

# print(df_obj.drop_duplicates())
  data1  data2
0     a      3
1     a      2
4     b      1
5     b      0
6     b      3

# print(df_obj.drop_duplicates('data2'))
  data1  data2
0     a      3
1     a      2
4     b      1
5     b      0

3. Convert each row or column according to the function passed in by map

  • Series transforms each row or column based on the function passed in from map

Example code:

ser_obj = pd.Series(np.random.randint(0,10,10))
print(ser_obj)

print(ser_obj.map(lambda x : x ** 2))

Operation result:

# print(ser_obj)
0    1
1    4
2    8
3    6
4    8
5    6
6    6
7    4
8    7
9    3
dtype: int64

# print(ser_obj.map(lambda x : x ** 2))
0     1
1    16
2    64
3    36
4    64
5    36
6    36
7    16
8    49
9     9
dtype: int64

II. Data replacement

replace replaces based on the content of the value

Example code:

# Single value replaces single value
print(ser_obj.replace(1, -100))

# Multiple values replace one
print(ser_obj.replace([6, 8], -100))

# Multiple values replace multiple values
print(ser_obj.replace([4, 7], [-100, -200]))

Operation result:

# print(ser_obj.replace(1, -100))
0   -100
1      4
2      8
3      6
4      8
5      6
6      6
7      4
8      7
9      3
dtype: int64

# print(ser_obj.replace([6, 8], -100))
0      1
1      4
2   -100
3   -100
4   -100
5   -100
6   -100
7      4
8      7
9      3
dtype: int64

# print(ser_obj.replace([4, 7], [-100, -200]))
0      1
1   -100
2      8
3      6
4      8
5      6
6      6
7   -100
8   -200
9      3
dtype: int64

Topics: Big Data Lambda Database