Import data
# jieba participle
import jieba # Chinese participle in jieba import re # Regularized expression
There is such a passage of text
content = '''Better Postgraduate Examination das Teaching videos, English recommendation for Zhu Wei's love ew Words, Political Recommendation Xiao Xiu.//Honorable, High Number of Recommendations[''. . Yu's'''
word_sep = jieba.cut(content) print(list(word_sep))
['compare', 'good', 'Of', 'Postgraduate Examination', 'das', 'teaching', 'video', ',', '\n', 'English', 'Recommend', 'Zhu Wei', 'Of', 'passionately attached', 'ew', 'Words', ',', '\n', 'Politics', 'Recommend', 'Xiao Xiu', '.', '/', '/', 'Rong', 'Of', ',', '\n', 'High Number', 'Recommend', 'Zhang', '[', "'", "'", '. ', '. ', 'Yu', 'Of']
There are some interfering characters in it, and we'll try to get rid of them
content = '''Better Postgraduate Examination das Teaching videos, English recommendation for Zhu Wei's love ew Words, Political Recommendation Xiao Xiu.//Honorable, High Number of Recommendations[''. . Yu's''' content = re.sub(r"[\s. .''\[a-zA-Z\],\[,/]",'',content) word_sep = jieba.cut(content) print(list(word_sep))
['compare', 'good', 'Of', 'Postgraduate Examination', 'teaching', 'video', 'English', 'Recommend', 'Zhu Wei', 'Of', 'passionately attached', 'Words', 'Politics', 'Recommend', 'Xiao Xiurong', 'Of', 'High Number', 'Recommend', 'Zhang Yu', 'Of']
Let's start with a small example
content = 'Xiao Gang and Xiao Qiang go to a nightclub for a bounce and meet Xiao Hong, Xiao Hong's girlfriend'
import jieba.posseg as posseg
for word,flag in posseg.cut(content): print(word,flag)
Building prefix dict from the default dictionary ... Dumping model to file cache C:\Users\kingS\AppData\Local\Temp\jieba.cache Loading model cost 0.779 seconds. Prefix dict has been built successfully. Xiao Gang nr and c Cockroach nr To the Night t store n Disco dancing v , x encounter v Yes ul Small a red a , x Little Red nr yes v Xiao Ming nr Of uj Girl friend n
Here nr stands for person name, c for conjunction, v for verb
See the meaning of part-of-speech markers for reference: http://www.cnblogs.com/adienhsuan/p/5674033.html
for word,flag in posseg.cut(content): if flag == 'nr': print(word,flag)# Get names only
Xiao Gang nr Cockroach nr Little Red nr Xiao Ming nr
Here is a novel: Descendants of Maoshan
Link Address: https://www.xiaobaipan.com/file-30111359.html
import pandas as pd content_story = pd.read_csv(r'F:\1 All Postgraduate Data\First year graduated school student\Download Content\Descendants of Maoshan.txt',error_bad_lines = False,encoding = 'gbk') content_story
b'Skipping line 10837: expected 2 fields, saw 4\nSkipping line 10838: expected 2 fields, saw 3\nSkipping line 10839: expected 2 fields, saw 4\nSkipping line 10840: expected 2 fields, saw 5\nSkipping line 10841: expected 2 fields, saw 4\nSkipping line 10842: expected 2 fields, saw 3\nSkipping line 10844: expected 2 fields, saw 8\nSkipping line 10846: expected 2 fields, saw 3\nSkipping line 10850: expected 2 fields, saw 5\n'
[Descendants of Maoshan/DaoGang] | |
---|---|
Txt Edition Reading of Black Dragon Novels | For more reading, please visit: http://www.hlj3.com |
Book Introduction: | NaN |
This is a novel describing Maoshan Taoism, a traditional Chinese mystery. The story tells the story of Zhang Guozhong and Zhang Yicheng's father and son who used Maoshan Taoism to set foot on the world. From exorcising evil spirits to burying graves, civil injustice cases and ancient puzzles will be uncovered one by one. Their footprints will even spread across different Asian and European regions, different cultures, different regions and different beliefs. Can Maoshandao, the most powerful technique in China, stretch its entire length? | NaN |
There is no battle with the shadow of swords or the miracle of eaves and walls in the book. This is not a fantastic visual blockbuster, but a real fantasy novel. It will take you to appreciate the broad and profound Maoshan Road, will take you to solve puzzles at foreign miracles, and a real feast of ideas, which will unfold from here! (starting point) | NaN |
-------Beginning of Chapter------- | NaN |
... | ... |
Notes: | NaN |
*Jurong: The historic city of Jiangsu, located in the south of Jiangsu, has a long history of more than 2000 years. Maoshan, a municipal Taoist resort (where the "Maoshan Taoism" described in this article originated), Baohua Mountain, a Buddhist holy place, Wawu Mountain, which is called "Jiuzhaigou" in Jiangsu, and other famous scenic spots. | NaN |
Copyright (C) 2000-2007 http://www.hlj3.com All Rights Reserved | NaN |
This book has been authorized by the author on Black Dragon Novels. http://www.hlj3.com) and Black Dragon Fiction Network Partners, do not reprint without authorship or permission of Black Dragon Fiction Network. | NaN |
The work itself only represents the author's own point of view and has nothing to do with the black dragon novel web position. Readers who find that the content of their work does conflict with the law can report it to Black Dragon Fiction. If any legal issues or consequences arise as a result, Black Dragon Fiction Network shall not be held responsible. | NaN |
11891 rows × 1 columns
Here error_bad_lines = False ignores parsing errors: errors marking data
Encoding ='gbk'because it is a Chinese encoding
Of course, it can also be read in context here
with open(r'F:\1 All Postgraduate Data\First year graduated school student\Download Content\Descendants of Maoshan.txt') as f: content = f.read() # Read as a string figure = [] for word,flag in posseg.cut(content): if flag == 'nr': figure.append(word) figure_forehead_20 = pd.Series(figure).value_counts()[:20] print(figure_forehead_20)
Zhang Guozhong 4062 Liu 2640 Qinge 1271 arise from the east 1 206 Zhang Yicheng 886 plum 745 Dai Jinshuang 526 Alison 519 Sun Ting 515 Old Liu Tou 298 Mazhen 287 Brothers 259 Master worker 240 Mr. Liu 228 National loyalty 221 even 206 Liu Dan 192 Understand? 188 Zhang Guoyi 185 Qin Dynasty 183 dtype: int64