These 20 Pandas functions can improve your "data cleaning" ability by 100 times

Posted by happyme on Wed, 15 Dec 2021 06:52:24 +0100

Hello, I'm Jiejie

Today I'm going to introduce a super liver product! Pandas # is a tool based on NumPy, which is created to solve data analysis tasks. It provides a large number of functions and methods that enable us to process data quickly and conveniently. The 20 [divided into 15 groups] functions introduced in this article are definitely data processing killers, which you will love when you use them.

Construct dataset

Here we first construct a data set to demonstrate these 20 functions.

import pandas as pd
df ={'full name':[' Classmate Huang','Huang Zhizun','Huang Laoxie ','Da Mei Chen','Sun Shangxiang'],
     'English name':['Huang tong_xue','huang zhi_zun','Huang Lao_xie','Chen Da_mei','sun shang_xiang'],
     'Gender':['male','women','men','female','male'],
     'ID':['463895200003128433','429475199912122345','420934199110102311','431085200005230122','420953199509082345'],
     'height':['mid:175_good','low:165_bad','low:159_bad','high:180_verygood','low:172_bad'],
     'Home address':['Guangshui, Hubei','Xinyang, Henan','Guangxi Guilin','Hubei Xiaogan','Guangzhou, Guangdong'],
     'Telephone number':['13434813546','19748672895','16728613064','14561586431','19384683910'],
     'income':['1.1 ten thousand','8.5 thousand','0.9 ten thousand','6.5 thousand','2.0 ten thousand']}
df = pd.DataFrame(df)
df

design sketch:

1. cat function

This function is mainly used for string splicing;

df["full name"].str.cat(df["Home address"],sep='-'*3)

design sketch:

2. contains function

This function is mainly used to judge whether a string contains a given character;

df["Home address"].str.contains("wide")</pre>

design sketch:

3. Startswitch and endswitch functions

This function is mainly used to determine whether a string is represented by Beginning / end;

#"Huang Wei" in the first line begins with a space
df["full name"].str.startswith("yellow") 
df["English name"].str.endswith("e")</pre>

design sketch:

4. count function

This function is mainly used to calculate the number of occurrences of a given character in the string;

df["Telephone number"].str.count("3")</pre>

design sketch:

5. get function

This function is mainly used to obtain the string at the specified position;

df["full name"].str.get(-1)
df["height"].str.split(":")
df["height"].str.split(":").str.get(0)</pre>

design sketch:

6. len function

This function is mainly used to calculate the string length;

df["Gender"].str.len()</pre>

design sketch:

7. upper and lower functions

This function is mainly used for English case conversion;

df["English name"].str.upper()
df["English name"].str.lower()</pre>

design sketch:

8. pad+side parameter / center function

This function is mainly used to add a given character to the left, right or left and right of the string;

df["Home address"].str.pad(10,fillchar="*")      #Equivalent to ljust()
df["Home address"].str.pad(10,side="right",fillchar="*")    #Equivalent to rjust()
df["Home address"].str.center(10,fillchar="*")</pre>

design sketch:

9. repeat function

This function is mainly used to repeat the string several times;

df["Gender"].str.repeat(3)</pre>

design sketch:

10.  slice_replace function

This function is mainly used to replace the character at the specified position with a given string;

df["Telephone number"].str.slice_replace(4,8,"*"*4)</pre>

design sketch:

11. replace function

This function is mainly used to replace the character at the specified position with the given string;

df["height"].str.replace(":","-")</pre>

design sketch:

This function also accepts a regular expression to replace the character at the specified position with the given string.

df["income"].str.replace("\d+\.\d+","regular")</pre>

design sketch:

12. split method + expand parameter

This function is mainly used to expand a column into several columns;

#Common usage
df["height"].str.split(":")
#split method with expand parameter
df[["Height description","final height"]] = df["height"].str.split(":",expand=True)
df
#split method with join method
df["height"].str.split(":").str.join("?"*5)</pre>

design sketch:

13. strip, rstrip and lstrip functions

This function is mainly used to remove blank characters and line breaks;

df["full name"].str.len()
df["full name"] = df["full name"].str.strip()
df["full name"].str.len()</pre>

design sketch:

14. findall function

This function is mainly used to use regular expressions to match strings and return a list of search results;

df["height"]
df["height"].str.findall("[a-zA-Z]+")</pre>

design sketch:

15. extract and extractall functions

This function is mainly used to accept regular expressions and extract matching strings (be sure to add parentheses);

df["height"].str.extract("([a-zA-Z]+)")
#Extract the composite index from extractall
df["height"].str.extractall("([a-zA-Z]+)")
#extract with expand parameter
df["height"].str.extract("([a-zA-Z]+).*?([a-zA-Z]+)",expand=True)</pre>

design sketch:

If you think this article is of some use to you, don't forget to connect three times, because it will be the strongest driving force for me to continue to output more high-quality articles!

Topics: Python Data Mining