These 20 Pandas functions can improve your "data cleaning" ability by 100 times

Posted by icm on Wed, 15 Dec 2021 16:09:12 +0100

Hello, I'm Jiejie

Today I'm going to introduce a super liver product!

Pandas is a NumPy based tool created to solve data analysis tasks. It provides a large number of functions and methods that enable us to process data quickly and conveniently.

The 20 [divided into 15 groups] functions introduced in this article are definitely data processing killers, which you will love when you use them.

Construct dataset

Here we first construct a data set to demonstrate these 20 functions.

import pandas as pd
df ={'full name':[' Classmate Huang','Huang Zhizun','Huang Laoxie ','Da Mei Chen','Sun Shangxiang'],
     'English name':['Huang tong_xue','huang zhi_zun','Huang Lao_xie','Chen Da_mei','sun shang_xiang'],
     'Gender':['male','women','men','female','male'],
     'ID':['463895200003128433','429475199912122345','420934199110102311','431085200005230122','420953199509082345'],
     'height':['mid:175_good','low:165_bad','low:159_bad','high:180_verygood','low:172_bad'],
     'Home address':['Guangshui, Hubei','Xinyang, Henan','Guangxi Guilin','Hubei Xiaogan','Guangzhou, Guangdong'],
     'Telephone number':['13434813546','19748672895','16728613064','14561586431','19384683910'],
     'income':['1.1 ten thousand','8.5 thousand','0.9 ten thousand','6.5 thousand','2.0 ten thousand']}
df = pd.DataFrame(df)
df

design sketch:

1. cat function

This function is mainly used for string splicing;

df["full name"].str.cat(df["Home address"],sep='-'*3)

design sketch:

2. contains function

This function is mainly used to judge whether a string contains a given character;

df["Home address"].str.contains("wide")

design sketch:

3. Startswitch and endswitch functions

This function is mainly used to determine whether a string is represented by Beginning / end;

#"Huang Wei" in the first line begins with a space
df["full name"].str.startswith("yellow") 
df["English name"].str.endswith("e")

design sketch:

4. count function

This function is mainly used to calculate the number of occurrences of a given character in the string;

df["Telephone number"].str.count("3")

design sketch:

5. get function

This function is mainly used to obtain the string at the specified position;

df["full name"].str.get(-1)
df["height"].str.split(":")
df["height"].str.split(":").str.get(0)

design sketch:

6. len function

This function is mainly used to calculate the string length;

df["Gender"].str.len()

design sketch:

7. upper and lower functions

This function is mainly used for English case conversion;

df["English name"].str.upper()
df["English name"].str.lower()

design sketch:

8. pad+side parameter / center function

This function is mainly used to add a given character to the left, right or left and right of the string;

df["Home address"].str.pad(10,fillchar="*")      #Equivalent to ljust()
df["Home address"].str.pad(10,side="right",fillchar="*")    #Equivalent to rjust()
df["Home address"].str.center(10,fillchar="*")

design sketch:

9. repeat function

This function is mainly used to repeat the string several times;

df["Gender"].str.repeat(3)

design sketch:

10.  slice_replace function

This function is mainly used to replace the character at the specified position with a given string;

df["Telephone number"].str.slice_replace(4,8,"*"*4)

design sketch:

11. replace function

This function is mainly used to replace the character at the specified position with the given string;

df["height"].str.replace(":","-")

design sketch:

This function also accepts a regular expression to replace the character at the specified position with the given string.

df["income"].str.replace("\d+\.\d+","regular")

design sketch:

12. split method + expand parameter

This function is mainly used to expand a column into several columns;

#Common usage
df["height"].str.split(":")
#split method with expand parameter
df[["Height description","final height"]] = df["height"].str.split(":",expand=True)
df
#split method with join method
df["height"].str.split(":").str.join("?"*5)

design sketch:

13. strip, rstrip and lstrip functions

This function is mainly used to remove blank characters and line breaks;

df["full name"].str.len()
df["full name"] = df["full name"].str.strip()
df["full name"].str.len()

design sketch:

14. findall function

This function is mainly used to use regular expressions to match strings and return a list of search results;

df["height"]
df["height"].str.findall("[a-zA-Z]+")

design sketch:

15. extract and extractall functions

This function is mainly used to accept regular expressions and extract matching strings (be sure to add parentheses);

df["height"].str.extract("([a-zA-Z]+)")
#Extract the composite index from extractall
df["height"].str.extractall("([a-zA-Z]+)")
#extract with expand parameter
df["height"].str.extract("([a-zA-Z]+).*?([a-zA-Z]+)",expand=True)

design sketch: