There are two text documents, which are extracted from the government work reports in 2019 and 2018. Now it is necessary to count the ten words that appear most frequently in the two files as subject words, and the words are required to be no less than 2 characters.
Output example: 2019: Reform: 10, enterprise: 9........, deepening: 2
Then, data association is required to compare the differences between the two groups of words and output the common words and unique words of the two groups
Output example:
Common words: reform,..., deepening
2019: enterprises,..., strengthen
2018 unique: benefits,..., innovation
catalogue
1. jieba database word segmentation and traversal statistics word frequency
2.1 statistics of word frequency
two point three Circular extraction of common words
preface
In the first few questions, jieba library is used for simple word segmentation and word frequency statistics, while in this question, jieba library has the use of "Chuanxin": output the common words and unique words of two groups of subject words.
1, Train of thought
The method of counting character frequency has been described in sub topic 6. Children's shoes without ideas can read them by themselves. There is a second requirement for the main difficulty of this problem: Data Association. It is required to output common and unique words. Firstly, the empty dictionary is used to count the annual subject words, then the common words are filtered through simple traversal and comparison, and then the unique words are filtered through traversal and comparison between common words and subject words in a single year. The whole process is a little cumbersome.
2, Steps
1. jieba database word segmentation and traversal statistics word frequency
The code is as follows (example):
import jieba#call library f1=open('data2019.txt','r') d = {} txt=f1.read()#Read out the text content words=jieba.lcut(txt)#Word segmentation of the whole text for word in words: if len(word)==1: continue else: d[word]=d.get(word,0)+1 lt = list(d.items()) lt.sort(key = lambda x:x[1],reverse = True) print('2019:',end='') for i in range(10): if i <9: #Since the last character cannot be ',', the condition is used for printing print('{}:{}'.format(lt[i][0],lt[i][1]),end=',') else: print('{}:{}'.format(lt[i][0],lt[i][1])) f1.close() f2=open('data2018.txt','r')#Repeat the second operation d = {} txt=f2.read()#Read out the text content words=jieba.lcut(txt)#Word segmentation of the whole text for word in words: if len(word)==1: continue else: d[word]=d.get(word,0)+1 lt = list(d.items()) lt.sort(key = lambda x:x[1],reverse = True) print('2018:',end='') for i in range(10): if i <9: print('{}:{}'.format(lt[i][0],lt[i][1]),end=',') else: print('{}:{}'.format(lt[i][0],lt[i][1])) f2.close()
Although the above code is lengthy, it can be simply divided into two parts, namely 2018 and 2019,
In Code:
for i in range(10): if i <9: #Since the last character cannot be ',', the condition is used for printing print('{}:{}'.format(lt[i][0],lt[i][1]),end=',') else: print('{}:{}'.format(lt[i][0],lt[i][1]))
At the same time, in order to meet the title conditions, "the output format adopts English colon and English comma, there is no space before and after punctuation, each word is separated by comma, and there is no comma after the last word", the conditional branch is used for print.
Some children's shoes don't understand the usage of format. Let's talk about it here:
Starting with Python 2.6, a new function for formatting strings has been added str.format(), which enhances string formatting.
The basic grammar is through {} and : To replace the previous % .
The format function can accept unlimited parameters, and the positions can be out of order.
For example:
>>>"{} {}".format("hello", "world") # Do not set the specified location, in the default order 'hello world' >>> "{0} {1}".format("hello", "world") # Set specified location 'hello world' >>> "{1} {0} {1}".format("hello", "world") # Set specified location 'world hello world'
2. Data Association
2.1 statistics of word frequency
The content is similar to the first part, such as:
#2019 theme words import jieba f=open('data2019.txt','r') txt=f.read() f.close() words=jieba.lcut(txt) d={}#Use an empty dictionary to store all subject words in 2019 for word in words: if len(word)<2: continue else: d[word]=d.get(word,0)+1 ls=list(d.items()) ls.sort(key=lambda x:x[1],reverse=True) print('2019:',end='') da={}#Give a da dictionary to store the top ten subject words in 2019 for i in range(10): da[i]=ls[i][0]#Store the top ten subject words in the dictionary
Instead of reducing print on the original basis, the subject words are stored in variables to facilitate later comparison
#Find out the theme words in 2018 f=open('data2018.txt','r') txt=f.read() f.close() words=jieba.lcut(txt) d={}#Use an empty dictionary to store all subject words in 2018 for word in words: if len(word)<2: continue else: d[word]=d.get(word,0)+1 ls=list(d.items()) ls.sort(key=lambda x:x[1],reverse=True) print('2018:',end='') db={}#Give a da dictionary to store the top ten subject words in 2018 for i in range(10): db[i]=ls[i][0]#Store the top ten subject words in the dictionary
At this time, all the top ten subject words in 2019 and 2018 have been stored in the variables.
2.2 screening common words
#Find m common words and store them in gy, and change the common words in da and db to empty gy={} m=0 for i in range(10): for j in range(10): #Traverse two common words and compare them one by one if da[i]==db[i]: gy[m]=da[i] da[i]='' db[i]='' #If there is a common word, it will be cleared after saving it m=m+1 #Bring the shared dictionary gy into the next storage space break #If there is a common word, it is no longer compared and has no meaning print('Common words:',end='')
two point three Circular extraction of common words
#Circular extraction of common words for i in range(m): if i<m-1: print('{}'.format(gy[i]),end=',') else: print('{}'.format(gy[i]))
2.4 cyclic extraction of unique words
#Cycle out 2019 specific print('2019 Year specific:',end='') j=0 for i in range(10): if da[i]!='':#Remove items with empty data if j<10-m-1: print('{}'.format(da[i]),end=',') else: print('{}'.format(da[i])) j=j+1
It needs to be explained here about J < 10-m-1 , M is the number of common words. Why subtract 1 again is to remove the comma "," in the subsequent else statement segment.
#Cycle out 2018 specific print('2018 Year specific:',end='') j=0 for i in range(10): if db[i]!='': if j<10-m-1: print('{}'.format(db[i]),end=',') else: print('{}'.format(db[i])) j=j+1
3. Consolidation
import jieba f=open('data2019.txt','r') txt=f.read() f.close() words=jieba.lcut(txt) d={}#Use an empty dictionary to store all subject words in 2019 for word in words: if len(word)<2: continue else: d[word]=d.get(word,0)+1 ls=list(d.items()) ls.sort(key=lambda x:x[1],reverse=True) da={}#Give a da dictionary to store the top ten subject words in 2019 for i in range(10): da[i]=ls[i][0]#Store the top ten subject words in the dictionary #Then find out the theme words in 2018 f=open('data2018.txt','r') txt=f.read() f.close() words=jieba.lcut(txt) d={}#Use an empty dictionary to store all subject words in 2018 for word in words: if len(word)<2: continue else: d[word]=d.get(word,0)+1 ls=list(d.items()) ls.sort(key=lambda x:x[1],reverse=True) db={}#Give a da dictionary to store the top ten subject words in 2018 for i in range(10): db[i]=ls[i][0]#Store the top ten subject words in the dictionary #Find m common words and store them in gy, and change the common words in da and db to empty gy={} m=0 for i in range(10): for j in range(10): if da[i]==db[i]: gy[m]=da[i] da[i]='' db[i]=''#If there is a common word, it will be cleared after saving it m=m+1 #Bring the shared dictionary gy into the next storage space break print('Common words:',end='') #Circular extraction of common words for i in range(m): if i<m-1: print('{}'.format(gy[i]),end=',') else: print('{}'.format(gy[i])) #Cycle out 2019 specific print('2019 Year specific:',end='') j=0 for i in range(10): if da[i]!='': if j<10-m-1: print('{}'.format(da[i]),end=',') else: print('{}'.format(da[i])) j=j+1 #Cycle out 2018 specific print('2018 Year specific:',end='') j=0 for i in range(10): if db[i]!='': if j<10-m-1: print('{}'.format(db[i]),end=',') else: print('{}'.format(db[i])) j=j+1
summary
Among the commonly used Python standard libraries, jieba library is widely used, such as generating interesting usages such as Wordcloud (word cloud). Beginners must skillfully use the skills of circular word frequency statistics and data association