[NLP Basic Chinese Processing] The elementary application of jieba participle and the main characters of Maoshan Descendants

Posted by chanfuterboy on Mon, 31 Jan 2022 18:48:12 +0100

Import data

# jieba participle
import jieba # Chinese participle in jieba
import re # Regularized expression

There is such a passage of text

content = '''Better Postgraduate Examination das Teaching videos,
English recommendation for Zhu Wei's love ew Words,
Political Recommendation Xiao Xiu.//Honorable,
High Number of Recommendations[''. . Yu's'''
word_sep = jieba.cut(content)
print(list(word_sep))
['compare', 'good', 'Of', 'Postgraduate Examination', 'das', 'teaching', 'video', ',', '\n', 'English', 'Recommend', 'Zhu Wei', 'Of', 'passionately attached', 'ew', 'Words', ',', '\n', 'Politics', 'Recommend', 'Xiao Xiu', '.', '/', '/', 'Rong', 'Of', ',', '\n', 'High Number', 'Recommend', 'Zhang', '[', "'", "'", '. ', '. ', 'Yu', 'Of']

There are some interfering characters in it, and we'll try to get rid of them

content = '''Better Postgraduate Examination das Teaching videos,
English recommendation for Zhu Wei's love ew Words,
Political Recommendation Xiao Xiu.//Honorable,
High Number of Recommendations[''. . Yu's'''
content = re.sub(r"[\s. .''\[a-zA-Z\],\[,/]",'',content)
word_sep = jieba.cut(content)
print(list(word_sep))
['compare', 'good', 'Of', 'Postgraduate Examination', 'teaching', 'video', 'English', 'Recommend', 'Zhu Wei', 'Of', 'passionately attached', 'Words', 'Politics', 'Recommend', 'Xiao Xiurong', 'Of', 'High Number', 'Recommend', 'Zhang Yu', 'Of']

Let's start with a small example

content = 'Xiao Gang and Xiao Qiang go to a nightclub for a bounce and meet Xiao Hong, Xiao Hong's girlfriend'
import jieba.posseg as posseg
for word,flag in posseg.cut(content):
    print(word,flag)
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\kingS\AppData\Local\Temp\jieba.cache
Loading model cost 0.779 seconds.
Prefix dict has been built successfully.


Xiao Gang nr
 and c
 Cockroach nr
 To the Night t
 store n
 Disco dancing v
, x
 encounter v
 Yes ul
 Small a
 red a
, x
 Little Red nr
 yes v
 Xiao Ming nr
 Of uj
 Girl friend n

Here nr stands for person name, c for conjunction, v for verb

See the meaning of part-of-speech markers for reference: http://www.cnblogs.com/adienhsuan/p/5674033.html

for word,flag in posseg.cut(content):
    if flag == 'nr':
        print(word,flag)# Get names only
    
Xiao Gang nr
 Cockroach nr
 Little Red nr
 Xiao Ming nr

Here is a novel: Descendants of Maoshan

Link Address: https://www.xiaobaipan.com/file-30111359.html

import pandas as pd
content_story = pd.read_csv(r'F:\1 All Postgraduate Data\First year graduated school student\Download Content\Descendants of Maoshan.txt',error_bad_lines = False,encoding = 'gbk') 
content_story
b'Skipping line 10837: expected 2 fields, saw 4\nSkipping line 10838: expected 2 fields, saw 3\nSkipping line 10839: expected 2 fields, saw 4\nSkipping line 10840: expected 2 fields, saw 5\nSkipping line 10841: expected 2 fields, saw 4\nSkipping line 10842: expected 2 fields, saw 3\nSkipping line 10844: expected 2 fields, saw 8\nSkipping line 10846: expected 2 fields, saw 3\nSkipping line 10850: expected 2 fields, saw 5\n'
[Descendants of Maoshan/DaoGang]
Txt Edition Reading of Black Dragon NovelsFor more reading, please visit: http://www.hlj3.com
Book Introduction:NaN
This is a novel describing Maoshan Taoism, a traditional Chinese mystery. The story tells the story of Zhang Guozhong and Zhang Yicheng's father and son who used Maoshan Taoism to set foot on the world. From exorcising evil spirits to burying graves, civil injustice cases and ancient puzzles will be uncovered one by one. Their footprints will even spread across different Asian and European regions, different cultures, different regions and different beliefs. Can Maoshandao, the most powerful technique in China, stretch its entire length?NaN
There is no battle with the shadow of swords or the miracle of eaves and walls in the book. This is not a fantastic visual blockbuster, but a real fantasy novel. It will take you to appreciate the broad and profound Maoshan Road, will take you to solve puzzles at foreign miracles, and a real feast of ideas, which will unfold from here! (starting point)NaN
-------Beginning of Chapter-------NaN
......
Notes:NaN
*Jurong: The historic city of Jiangsu, located in the south of Jiangsu, has a long history of more than 2000 years. Maoshan, a municipal Taoist resort (where the "Maoshan Taoism" described in this article originated), Baohua Mountain, a Buddhist holy place, Wawu Mountain, which is called "Jiuzhaigou" in Jiangsu, and other famous scenic spots.NaN
Copyright (C) 2000-2007 http://www.hlj3.com  All Rights ReservedNaN
This book has been authorized by the author on Black Dragon Novels. http://www.hlj3.com) and Black Dragon Fiction Network Partners, do not reprint without authorship or permission of Black Dragon Fiction Network.NaN
The work itself only represents the author's own point of view and has nothing to do with the black dragon novel web position. Readers who find that the content of their work does conflict with the law can report it to Black Dragon Fiction. If any legal issues or consequences arise as a result, Black Dragon Fiction Network shall not be held responsible.NaN

11891 rows × 1 columns

Here error_bad_lines = False ignores parsing errors: errors marking data
Encoding ='gbk'because it is a Chinese encoding

Of course, it can also be read in context here

with open(r'F:\1 All Postgraduate Data\First year graduated school student\Download Content\Descendants of Maoshan.txt') as f:
    content = f.read() # Read as a string
figure = []
for word,flag in posseg.cut(content):
    if flag == 'nr':
        figure.append(word)

figure_forehead_20 = pd.Series(figure).value_counts()[:20]
print(figure_forehead_20)
Zhang Guozhong    4062
 Liu      2640
 Qinge     1271
 arise from the east     1 206
 Zhang Yicheng     886
 plum       745
 Dai Jinshuang     526
 Alison     519
 Sun Ting      515
 Old Liu Tou     298
 Mazhen     287
 Brothers      259
 Master worker      240
 Mr. Liu     228
 National loyalty      221
 even       206
 Liu Dan      192
 Understand?      188
 Zhang Guoyi     185
 Qin Dynasty       183
dtype: int64

Topics: NLP