Check for missing values
To make it easier to detect missing values (and dtypes across different arrays), Pandas provides isnull() and notnull() functions, which are also methods for Series and DataFrame objects
Example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df['one'].isnull())
Execute the above example code and get the following results-
a False b True c False d True e False f False g True h False Name: one, dtype: bool
Calculation of missing data
- When summing data, NA will be treated as 0
- If the data is all NA, the result will be NA
Example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df['one'].sum())
Execute the above example code and get the following results-
-2.6163354325445014
Clean / fill missing data
Pandas provides various methods to clear missing values. The fillna() function can "fill" NA values with non null data in several ways
Replace NaN with scalar value
The following procedure shows how to replace NaN with 0.
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c']) print (df) print ("NaN replaced with '0':") print (df.fillna(0))
Execute the above example code and get the following results-
one two three a -0.479425 -1.711840 -1.453384 b NaN NaN NaN c -0.733606 -0.813315 0.476788 NaN replaced with '0': one two three a -0.479425 -1.711840 -1.453384 b 0.000000 0.000000 0.000000 c -0.733606 -0.813315 0.476788
Fill in NA forward and backward
-
pad/fill method forward
-
bfill/backfill fill method backward
Example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df.fillna(method='pad'))
Execute the above example code and get the following results-
one two three a 0.614938 -0.452498 -2.113057 b 0.614938 -0.452498 -2.113057 c -0.118390 1.333962 -0.037907 d -0.118390 1.333962 -0.037907 e 0.699733 0.502142 -0.243700 f 0.544225 -0.923116 -1.123218 g 0.544225 -0.923116 -1.123218 h -0.669783 1.187865 1.112835
Missing value
If you want to exclude only missing values, use the dropna function and axis parameters. By default, axis = 0, which is applied on the row, meaning that if any value in the row is NA, the entire row is excluded.
Example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three']) df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) print (df.dropna())
Execute the above example code and get the following results-
one two three a -0.719623 0.028103 -1.093178 c 0.040312 1.729596 0.451805 e -1.029418 1.920933 1.289485 f 1.217967 1.368064 0.527406 h 0.667855 0.147989 -1.035978
Replace missing (or) common values
Many times, you have to replace a generic value with a specific value. This can be achieved by applying replacement methods. Replacing NA with a scalar value is the equivalent behavior of the fillna() function.
Example:
import pandas as pd import numpy as np df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]}) print (df.replace({1000:10,2000:60}))
Execute the above example and get the following results-
one two 0 10 10 1 20 0 2 30 30 3 40 40 4 50 50 5 60 60