Pandas processing missing data

Posted by jxrd on Mon, 30 Mar 2020 07:50:09 +0200

Check for missing values

To make it easier to detect missing values (and dtypes across different arrays), Pandas provides isnull() and notnull() functions, which are also methods for Series and DataFrame objects

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].isnull())

Execute the above example code and get the following results-

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

Calculation of missing data

  • When summing data, NA will be treated as 0
  • If the data is all NA, the result will be NA

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].sum())

Execute the above example code and get the following results-

-2.6163354325445014

Clean / fill missing data

Pandas provides various methods to clear missing values. The fillna() function can "fill" NA values with non null data in several ways

Replace NaN with scalar value

The following procedure shows how to replace NaN with 0.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print (df)
print ("NaN replaced with '0':")
print (df.fillna(0))

Execute the above example code and get the following results-

        one       two     three
a -0.479425 -1.711840 -1.453384
b       NaN       NaN       NaN
c -0.733606 -0.813315  0.476788
NaN replaced with '0':
        one       two     three
a -0.479425 -1.711840 -1.453384
b  0.000000  0.000000  0.000000
c -0.733606 -0.813315  0.476788

Fill in NA forward and backward

  • pad/fill method forward

  • bfill/backfill fill method backward

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df.fillna(method='pad'))

Execute the above example code and get the following results-

        one       two     three
a  0.614938 -0.452498 -2.113057
b  0.614938 -0.452498 -2.113057
c -0.118390  1.333962 -0.037907
d -0.118390  1.333962 -0.037907
e  0.699733  0.502142 -0.243700
f  0.544225 -0.923116 -1.123218
g  0.544225 -0.923116 -1.123218
h -0.669783  1.187865  1.112835

Missing value

If you want to exclude only missing values, use the dropna function and axis parameters. By default, axis = 0, which is applied on the row, meaning that if any value in the row is NA, the entire row is excluded.

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print (df.dropna())

Execute the above example code and get the following results-

        one       two     three
a -0.719623  0.028103 -1.093178
c  0.040312  1.729596  0.451805
e -1.029418  1.920933  1.289485
f  1.217967  1.368064  0.527406
h  0.667855  0.147989 -1.035978

Replace missing (or) common values

Many times, you have to replace a generic value with a specific value. This can be achieved by applying replacement methods. Replacing NA with a scalar value is the equivalent behavior of the fillna() function.

Example:

import pandas as pd
import numpy as np
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))

Execute the above example and get the following results-

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60