The literal meaning of "de duplication" is not difficult to understand, that is, to delete duplicate data. In a data set, find out the duplicate data and delete it. Finally, only one unique data item is saved, which is the whole process of data De duplication. Deleting duplicate data is a common problem in data analysis. Through data De duplication, it can not only save memory space and improve writing performance, but also improve the accuracy of the data set, so that the data set is not affected by duplicate data.
The Panda DataFrame object provides a data De duplication function, {drop_duplicates(), this section describes the usage of this function in detail.
Function format
drop_ The syntax format of duplicates() function is as follows:
df.drop_duplicates(subset=['A','B','C'],keep='first',inplace=True)
The parameters are described as follows:
- subset: indicates the column name to be reset. The default value is None.
- keep: there are three optional parameters: first, last and False. The default is first, which means that only the duplicate items that appear for the first time will be retained and the remaining duplicates will be deleted. Last means that only the duplicate items that appear for the last time will be retained, and False means that all duplicates will be deleted.
- inplace: Boolean parameter. The default value is False, which means that a copy is returned after deleting duplicates. If it is true, it means that duplicates are deleted directly on the original data.
practical application
First, create a DataFrame object with duplicate values, as follows:
import pandas as pd data={ 'A':[1,0,1,1], 'B':[0,2,5,0], 'C':[4,0,4,4], 'D':[1,0,1,1] } df=pd.DataFrame(data=data) print(df)
Output result:
A B C D 0 1 0 4 1 1 0 2 0 0 2 1 5 4 1 3 1 0 4 1
1) The first occurrence of duplicates is retained by default
import pandas as pd data={ 'A':[1,0,1,1], 'B':[0,2,5,0], 'C':[4,0,4,4], 'D':[1,0,1,1] } df=pd.DataFrame(data=data) #The first occurrence of duplicates is retained by default df.drop_duplicates()
Output result:
A B C D 0 1 0 4 1 1 0 2 0 0 2 1 5 4 1
2) keep=False delete all duplicates
import pandas as pd data={ 'A':[1,0,1,1], 'B':[0,2,5,0], 'C':[4,0,4,4], 'D':[1,0,1,1] } df=pd.DataFrame(data=data) #The first occurrence of duplicates is retained by default df.drop_duplicates(keep=False)
Output result:
A B C D 1 0 2 0 0 2 1 5 4 1
3) De duplication according to the specified column label
import pandas as pd data={ 'A':[1,3,3,3], 'B':[0,1,2,0], 'C':[4,5,4,4], 'D':[3,3,3,3] } df=pd.DataFrame(data=data) print(df) df.drop_duplicates(subset=['B'],keep=False) print(df) #Remove all duplicates, for B Two zeros are duplicates for columns df1=df.drop_duplicates(subset=['B'],keep=False) print(df1) df.drop_duplicates(subset=['B'],keep=False,inplace=True) print(df) #Abbreviation, omit subset parameter #df.drop_duplicates(['B'],keep=False,inplace=True) print(df1)
Output result:
A B C D 0 1 0 4 3 1 3 1 5 3 2 3 2 4 3 3 3 0 4 3 A B C D 0 1 0 4 3 1 3 1 5 3 2 3 2 4 3 3 3 0 4 3 A B C D 1 3 1 5 3 2 3 2 4 3 A B C D 1 3 1 5 3 2 3 2 4 3 A B C D 1 3 1 5 3 2 3 2 4 3
As can be seen from the above example, after deleting duplicate items, the number used in the row label is the original and does not restart from 0. How should we reset the index from 0? Reset provided by Pandas_ The index() function will directly use the reset index. As follows:
import pandas as pd data={ 'A':[1,3,3,3], 'B':[0,1,2,0], 'C':[4,5,4,4], 'D':[3,3,3,3] } df=pd.DataFrame(data=data) #Remove all duplicates, for B For example, two zeros are duplicates df=df.drop_duplicates(subset=['B'],keep=False) #Reset index, restart from 0 df.reset_index(drop=True)
Output result:
A B C D 0 3 1 5 3 1 3 2 4 3
4) Specifies that multiple columns are de duplicated at the same time
Create a DataFrame object as follows:
import numpy as np import pandas as pd df = pd.DataFrame({'Country ID':[1,1,2,12,34,23,45,34,23,12,2,3,4,1], 'Age':[12,12,15,18, 19, 25, 21, 25, 25, 18, 25,12,32,18], 'Group ID':['a','z','c','a','b','s','d','a','b','s','a','d','a','f']}) #last Keep only the last duplicate df.drop_duplicates(['Age','Group ID'],keep='last')
Output result:
Country ID Age Group ID 0 1 12 a 1 1 12 z 2 2 15 c 3 3 18 a 4 4 19 b 5 3 25 s 6 4 21 d 8 2 25 b 9 1 18 s 10 2 25 a 11 3 12 d 12 4 32 a 13 1 18 f
In the above data set, the column label data corresponding to row 7 and row 10 are the same. We use the parameter value "last" to retain the last duplicate, that is, the data in row 10.