DataFrame: data De duplication

Posted by phrygius on Sun, 30 Jan 2022 12:20:46 +0100

The literal meaning of "de duplication" is not difficult to understand, that is, to delete duplicate data. In a data set, find out the duplicate data and delete it. Finally, only one unique data item is saved, which is the whole process of data De duplication. Deleting duplicate data is a common problem in data analysis. Through data De duplication, it can not only save memory space and improve writing performance, but also improve the accuracy of the data set, so that the data set is not affected by duplicate data.

The Panda DataFrame object provides a data De duplication function, {drop_duplicates(), this section describes the usage of this function in detail.

Function format

drop_ The syntax format of duplicates() function is as follows:

df.drop_duplicates(subset=['A','B','C'],keep='first',inplace=True)

The parameters are described as follows:

subset: indicates the column name to be reset. The default value is None.
keep: there are three optional parameters: first, last and False. The default is first, which means that only the duplicate items that appear for the first time will be retained and the remaining duplicates will be deleted. Last means that only the duplicate items that appear for the last time will be retained, and False means that all duplicates will be deleted.
inplace: Boolean parameter. The default value is False, which means that a copy is returned after deleting duplicates. If it is true, it means that duplicates are deleted directly on the original data.

practical application

First, create a DataFrame object with duplicate values, as follows:

import pandas as pd
data={
'A':[1,0,1,1],
'B':[0,2,5,0],
'C':[4,0,4,4],
'D':[1,0,1,1]
}
df=pd.DataFrame(data=data)
print(df)

Output result:

1) The first occurrence of duplicates is retained by default

import pandas as pd
data={
  
    'A':[1,0,1,1],
    'B':[0,2,5,0],
    'C':[4,0,4,4],
    'D':[1,0,1,1]
}
df=pd.DataFrame(data=data)
#The first occurrence of duplicates is retained by default
df.drop_duplicates()

Output result:

2) keep=False delete all duplicates

import pandas as pd
data={
'A':[1,0,1,1],
'B':[0,2,5,0],
'C':[4,0,4,4],
'D':[1,0,1,1]
}
df=pd.DataFrame(data=data)
#The first occurrence of duplicates is retained by default
df.drop_duplicates(keep=False)

Output result:

  A B C D
1 0 2 0 0
2 1 5 4 1

3) De duplication according to the specified column label

import pandas as pd
data={
'A':[1,3,3,3],
'B':[0,1,2,0],
'C':[4,5,4,4],
'D':[3,3,3,3]
}
df=pd.DataFrame(data=data)
print(df)
df.drop_duplicates(subset=['B'],keep=False)
print(df)
#Remove all duplicates, for B Two zeros are duplicates for columns
df1=df.drop_duplicates(subset=['B'],keep=False)
print(df1)
df.drop_duplicates(subset=['B'],keep=False,inplace=True)
print(df)
#Abbreviation, omit subset parameter
#df.drop_duplicates(['B'],keep=False,inplace=True) 
print(df1)

Output result:

  A  B  C  D
0  1  0  4  3
1  3  1  5  3
2  3  2  4  3
3  3  0  4  3
    A  B  C  D
0  1  0  4  3
1  3  1  5  3
2  3  2  4  3
3  3  0  4  3
   A  B  C  D
1  3  1  5  3
2  3  2  4  3
   A  B  C  D
1  3  1  5  3
2  3  2  4  3
   A  B  C  D
1  3  1  5  3
2  3  2  4  3

As can be seen from the above example, after deleting duplicate items, the number used in the row label is the original and does not restart from 0. How should we reset the index from 0? Reset provided by Pandas_ The index() function will directly use the reset index. As follows:

import pandas as pd

data={
   
    'A':[1,3,3,3],
    'B':[0,1,2,0],
    'C':[4,5,4,4],
    'D':[3,3,3,3]
}
df=pd.DataFrame(data=data)
#Remove all duplicates, for B For example, two zeros are duplicates
df=df.drop_duplicates(subset=['B'],keep=False)
#Reset index, restart from 0
df.reset_index(drop=True)

Output result:

  A B C D
0 3 1 5 3
1 3 2 4 3

4) Specifies that multiple columns are de duplicated at the same time

Create a DataFrame object as follows:

import numpy as np
import pandas as pd
df = pd.DataFrame({'Country ID':[1,1,2,12,34,23,45,34,23,12,2,3,4,1],
'Age':[12,12,15,18, 19, 25, 21, 25, 25, 18, 25,12,32,18],
'Group ID':['a','z','c','a','b','s','d','a','b','s','a','d','a','f']})
#last Keep only the last duplicate
df.drop_duplicates(['Age','Group ID'],keep='last')

Output result:

  Country ID Age Group ID
0   1         12      a
1   1         12      z
2   2         15      c
3   3         18      a
4   4         19      b
5   3         25      s
6   4         21      d
8   2         25      b
9   1         18      s
10  2         25      a
11  3         12      d
12  4         32      a
13  1         18      f

In the above data set, the column label data corresponding to row 7 and row 10 are the same. We use the parameter value "last" to retain the last duplicate, that is, the data in row 10.

Programmer Think

DataFrame: data De duplication

Function format

practical application

1) The first occurrence of duplicates is retained by default

2) keep=False delete all duplicates

3) De duplication according to the specified column label

4) Specifies that multiple columns are de duplicated at the same time

Hot Topics