Python topic 9: Advanced Application of common standard library jieba Library: common and unique

Posted by coderWil on Sun, 05 Dec 2021 04:16:56 +0100

There are two text documents, which are extracted from the government work reports in 2019 and 2018. Now it is necessary to count the ten words that appear most frequently in the two files as subject words, and the words are required to be no less than 2 characters.

Output example: 2019: Reform: 10, enterprise: 9........, deepening: 2

Then, data association is required to compare the differences between the two groups of words and output the common words and unique words of the two groups

Output example:

Common words: reform,..., deepening

2019: enterprises,..., strengthen

2018 unique: benefits,..., innovation

catalogue

Article catalog

preface

1, Train of thought

2, Use steps

1. jieba database word segmentation and traversal statistics word frequency

2. Data Association

2.1 statistics of word frequency

2.2 screening common words

two point three   Circular extraction of common words

2.4 cyclic extraction of unique words

3. Consolidation  

summary





preface

In the first few questions, jieba library is used for simple word segmentation and word frequency statistics, while in this question, jieba library has the use of "Chuanxin": output the common words and unique words of two groups of subject words.





1, Train of thought

The method of counting character frequency has been described in sub topic 6. Children's shoes without ideas can read them by themselves. There is a second requirement for the main difficulty of this problem: Data Association. It is required to output common and unique words. Firstly, the empty dictionary is used to count the annual subject words, then the common words are filtered through simple traversal and comparison, and then the unique words are filtered through traversal and comparison between common words and subject words in a single year. The whole process is a little cumbersome.





2, Steps





1. jieba database word segmentation and traversal statistics word frequency

The code is as follows (example):

import jieba#call library
f1=open('data2019.txt','r')
d = {}
txt=f1.read()#Read out the text content
words=jieba.lcut(txt)#Word segmentation of the whole text
for word in words:
    if len(word)==1:
        continue
    else:
        d[word]=d.get(word,0)+1
lt = list(d.items())
lt.sort(key = lambda x:x[1],reverse = True)
print('2019:',end='')
for i in range(10):
    if i <9:        #Since the last character cannot be ',', the condition is used for printing
        print('{}:{}'.format(lt[i][0],lt[i][1]),end=',')
    else:
        print('{}:{}'.format(lt[i][0],lt[i][1]))
f1.close()
f2=open('data2018.txt','r')#Repeat the second operation
d = {}
txt=f2.read()#Read out the text content
words=jieba.lcut(txt)#Word segmentation of the whole text
for word in words:
    if len(word)==1:
        continue
    else:
        d[word]=d.get(word,0)+1
lt = list(d.items())
lt.sort(key = lambda x:x[1],reverse = True)
print('2018:',end='')
for i in range(10):
    if i <9:
        print('{}:{}'.format(lt[i][0],lt[i][1]),end=',')
    else:
        print('{}:{}'.format(lt[i][0],lt[i][1]))
f2.close()

Although the above code is lengthy, it can be simply divided into two parts, namely 2018 and 2019,

In Code:

for i in range(10):
    if i <9:        #Since the last character cannot be ',', the condition is used for printing
        print('{}:{}'.format(lt[i][0],lt[i][1]),end=',')
    else:
        print('{}:{}'.format(lt[i][0],lt[i][1]))

At the same time, in order to meet the title conditions, "the output format adopts English colon and English comma, there is no space before and after punctuation, each word is separated by comma, and there is no comma after the last word", the conditional branch is used for print.

Some children's shoes don't understand the usage of format. Let's talk about it here:

Starting with Python 2.6, a new function for formatting strings has been added   str.format(), which enhances string formatting.

The basic grammar is through   {}   and  :  To replace the previous  % .

The format function can accept unlimited parameters, and the positions can be out of order.

For example:

>>>"{} {}".format("hello", "world")    # Do not set the specified location, in the default order
'hello world'
 
>>> "{0} {1}".format("hello", "world")  # Set specified location
'hello world'
 
>>> "{1} {0} {1}".format("hello", "world")  # Set specified location
'world hello world'





2. Data Association

2.1 statistics of word frequency

The content is similar to the first part, such as:

#2019 theme words
import jieba
f=open('data2019.txt','r')
txt=f.read()
f.close()
words=jieba.lcut(txt)
d={}#Use an empty dictionary to store all subject words in 2019
for word in words:
    if len(word)<2:
        continue
    else:
        d[word]=d.get(word,0)+1
ls=list(d.items())
ls.sort(key=lambda x:x[1],reverse=True)
print('2019:',end='')
da={}#Give a da dictionary to store the top ten subject words in 2019
for i in range(10):
    da[i]=ls[i][0]#Store the top ten subject words in the dictionary

Instead of reducing print on the original basis, the subject words are stored in variables to facilitate later comparison

#Find out the theme words in 2018
f=open('data2018.txt','r')
txt=f.read()
f.close()
words=jieba.lcut(txt)
d={}#Use an empty dictionary to store all subject words in 2018
for word in words:
    if len(word)<2:
        continue
    else:
        d[word]=d.get(word,0)+1
ls=list(d.items())
ls.sort(key=lambda x:x[1],reverse=True)
print('2018:',end='')
db={}#Give a da dictionary to store the top ten subject words in 2018
for i in range(10):
    db[i]=ls[i][0]#Store the top ten subject words in the dictionary

At this time, all the top ten subject words in 2019 and 2018 have been stored in the variables.

2.2 screening common words

#Find m common words and store them in gy, and change the common words in da and db to empty
gy={}
m=0
for i in range(10):
    for j in range(10):        #Traverse two common words and compare them one by one
        if da[i]==db[i]:
            gy[m]=da[i]
            da[i]=''
            db[i]=''        #If there is a common word, it will be cleared after saving it
            m=m+1         #Bring the shared dictionary gy into the next storage space
            break        #If there is a common word, it is no longer compared and has no meaning
   print('Common words:',end='')

two point three   Circular extraction of common words

  #Circular extraction of common words
for i in range(m):
    if i<m-1:
        print('{}'.format(gy[i]),end=',')
    else:
        print('{}'.format(gy[i]))

2.4 cyclic extraction of unique words

#Cycle out 2019 specific
print('2019 Year specific:',end='')
j=0
for i in range(10):
    if da[i]!='':#Remove items with empty data
        if j<10-m-1:
            print('{}'.format(da[i]),end=',')
        else:
            print('{}'.format(da[i]))
        j=j+1

It needs to be explained here about J < 10-m-1  , M is the number of common words. Why subtract 1 again is to remove the comma "," in the subsequent else statement segment.

#Cycle out 2018 specific
print('2018 Year specific:',end='')
j=0
for i in range(10):
    if db[i]!='':
        if j<10-m-1:
            print('{}'.format(db[i]),end=',')
        else:
            print('{}'.format(db[i]))
        j=j+1

3. Consolidation  

import jieba
f=open('data2019.txt','r')
txt=f.read()
f.close()
words=jieba.lcut(txt)
d={}#Use an empty dictionary to store all subject words in 2019
for word in words:
    if len(word)<2:
        continue
    else:
        d[word]=d.get(word,0)+1
ls=list(d.items())
ls.sort(key=lambda x:x[1],reverse=True)

da={}#Give a da dictionary to store the top ten subject words in 2019
for i in range(10):
    da[i]=ls[i][0]#Store the top ten subject words in the dictionary
#Then find out the theme words in 2018
f=open('data2018.txt','r')
txt=f.read()
f.close()
words=jieba.lcut(txt)
d={}#Use an empty dictionary to store all subject words in 2018
for word in words:
    if len(word)<2:
        continue
    else:
        d[word]=d.get(word,0)+1
ls=list(d.items())
ls.sort(key=lambda x:x[1],reverse=True)

db={}#Give a da dictionary to store the top ten subject words in 2018
for i in range(10):
    db[i]=ls[i][0]#Store the top ten subject words in the dictionary
#Find m common words and store them in gy, and change the common words in da and db to empty
gy={}
m=0
for i in range(10):
    for j in range(10):
        if da[i]==db[i]:
            gy[m]=da[i]
            da[i]=''
            db[i]=''#If there is a common word, it will be cleared after saving it
            m=m+1 #Bring the shared dictionary gy into the next storage space
            break
print('Common words:',end='')
    #Circular extraction of common words
for i in range(m):
    if i<m-1:
        print('{}'.format(gy[i]),end=',')
    else:
        print('{}'.format(gy[i]))
#Cycle out 2019 specific
print('2019 Year specific:',end='')
j=0
for i in range(10):
    if da[i]!='':
        if j<10-m-1:
            print('{}'.format(da[i]),end=',')
        else:
            print('{}'.format(da[i]))
        j=j+1
#Cycle out 2018 specific
print('2018 Year specific:',end='')
j=0
for i in range(10):
    if db[i]!='':
        if j<10-m-1:
            print('{}'.format(db[i]),end=',')
        else:
            print('{}'.format(db[i]))
        j=j+1





summary

Among the commonly used Python standard libraries, jieba library is widely used, such as generating interesting usages such as Wordcloud (word cloud). Beginners must skillfully use the skills of circular word frequency statistics and data association

Topics: Python Data Analysis Data Mining