R language Chinese word segmentation package jiebaR

Posted by JasperBosch on Fri, 12 Nov 2021 10:30:17 +0100

R language Chinese word segmentation package jiebaR

R's geek ideal series , covering a series of key points of R's thought, use, tools, innovation, etc., to interpret the power of R with my personal learning and experience.

As a language of statistics, R language has been shining in the minority field. Until the outbreak of big data, R language became a hot tool for data analysis. With the participation of more and more people with engineering background, the R language community is expanding and growing rapidly. Now, R language is not only used in the field of statistics, but also in education, banking, e-commerce and the Internet.

To become an ideal geek, we should not stay in grammar, but master solid knowledge of mathematics, probability and statistics. At the same time, we should also have the spirit of innovation and give full play to the R language in various fields. Let's move together and start R's geek ideal.

About the author:

  • Conan Zhang, programmer, Java,R,PHP,Javascript
  • weibo: @Conan_Z
  • blog: http://blog.fens.me
  • email: bsspirit@gmail.com

Please indicate the source for Reprint:
http://blog.fens.me/r-word-jiebar/

preface

In this paper, data mining is a very important part of data mining, which has a very broad application scenario. For example, we can analyze news events and understand national events; You can also analyze the microblog information and see everyone's concerns through social public opinion. Find the hidden information in the article through text mining, analyze the structure of the article, and judge whether it is written by the same author; At the same time, we can analyze the e-mail, combined with bayes algorithm to judge which are spam and which are useful e-mail.

The first step of mining in this paper is word segmentation, which will directly affect the effect of text mining. R language has good support in word segmentation. Next, I will introduce a good R language Chinese word segmentation package "stuttering word segmentation" (jiebaR).

catalogue

  1. jiebaR package introduction
  2. five minutes to get started
  3. Word segmentation engine
  4. Configuration dictionary
  5. Stop word filtering
  6. Keyword extraction

one. Introduction to jiebar package

Jiebar is an efficient Chinese word segmentation package for R language. The bottom layer uses C + +. It is very efficient to call through Rcpp. Awesome segmentation is based on MIT protocol, which is free and open source. Thanks to the support of Chinese authors, R can handle Chinese text conveniently.

Official Github address: https://github.com/qinwf/jiebaR

The system environment used in this paper

  • Win one 0 64bit
  • R: 3.2.3 x86_64-w64-mingw32/x64 b4bit

jiebaR package is a standard library published in CRAN. It is very simple to install. Two commands are enough.

~ R
> install.packages("jiebaR")
> library("jiebaR")

If you want to install the development version, you can use devtools to install it. For the introduction of devtools, please refer to the article: Catalytic R package development on the shoulders of giants

> library(devtools)
> install_github("qinwf/jiebaRD")
> install_github("qinwf/jiebaR")
> library("jiebaR")

For the installation of the development version, it is officially recommended to compile using the Linux system GCC > = 4.6. Rtools needs to be installed for Windows.

2. Start in 5 minutes

5 minutes to get started, look at the first example directly, and segment a paragraph of text.

> wk = worker()

> wk["I am< R Author of the book geek ideal"]
[one] "I am" "R"    "of"   "Geek" "ideal" "books" "author"

> wk["I am R Depth of language users"]
[one] "I"   "yes"   "R"    "language" "of"   "depth" "user"

Very simple, 2 lines of code, completed the Chinese word segmentation.

jiebaR provides three ways to write word segmentation statements. For example, the syntax with [] symbol above, you can also use < = to conform to the syntax, or use the segment() function. A lt hough the forms are different, the effect of word segmentation is the same. The syntax for using the < = symbol is as follows

> wk<='Another consistent syntax'
[one] "another"   "one kind" "accord with" "of"   "grammar"

The syntax for using the segment() function is as follows

> segment( "segment()Writing method of function statement" , wk )
[one] "segment" "function"    "sentence"    "of"      "Writing method" 

If you think it's amazing and want to know how to customize the operator, you can check the source code of the project quick.R file.

# < = symbol definition
`<=.qseg`<-function(qseg, code){
  if(!exists("quick_worker",envir = .GlobalEnv ,inherits = F) || 
       .GlobalEnv$quick_worker$PrivateVarible$timestamp != TIMESTAMP){
    
    if(exists("qseg",envir = .GlobalEnv,inherits = FALSE ) ) 
      rm("qseg",envir = .GlobalEnv)
    
    modelpath  = file.path(find.package("jiebaR"),"model","model.rda")
    quickparam = readRDS(modelpath)
    
    if(quickparam$dict == "AUTO") quickparam$dict = DICTPATH
    if(quickparam$hmm == "AUTO") quickparam$hmm = HMMPATH
    if(quickparam$user == "AUTO") quickparam$user = USERPATH
    if(quickparam$stop_word == "AUTO") quickparam$stop_word = STOPPATH
    if(quickparam$idf == "AUTO") quickparam$idf = IDFPATH
    
    createquickworker(quickparam)
    setactive()
  } 

  //.. Code omitted
}

# [symbol definition
`[.qseg`<- `<=.qseg`

We can also segment the text file directly and create a new text file idea.txt in the current directory.

~ notepad idea.txt

R Geek ideal series, covering R A series of key points of thought, use, tools, innovation, etc. are explained by my personal learning and experience R Powerful.

R As a language of statistics, language has been shining in the minority field until the outbreak of big data, R Language has become a hot tool for data analysis. With the participation of more and more people with engineering background, R The language community is expanding and growing rapidly. Now it is not only in the field of statistics, education, banking, e-commerce, Internet.Are in use R Language.

Of course, when we run the word segmentation program, a new word segmentation result file will be generated in the current directory.

> wk['./idea.txt']
[one] "./idea.segment.20 one 6-07-20_23_25_34.txt"

Open the file idea.segment.20 one 6-07-20_23_25_34.txt, and the whole article is segmented with spaces.

~ notepad idea.segment.20 one 6-07-20_23_25_34.txt

R The geek ideal series covers R A series of key points such as thinking, using tools and innovation are explained by my personal learning and experience R Powerful R As a language of statistics, language has been shining in the minority field until the outbreak of big data R Language has become a hot tool for data analysis. With the addition of more and more people with engineering background R The language community is expanding and growing rapidly. Now it is not only in the field of statistics, but also in the E-commerce Internet of education bank R language

Is it very simple? You can complete the task of word segmentation in 5 minutes.

3. Word segmentation engine

When calling the worker() function, we are actually loading the word segmentation engine of the jiebaR library, which provides seven word segmentation engines.

  • Mixsegment: it is a class with good word segmentation effect in the four word segmentation engines. It combines the maximum probability method and implicit Markov model.
  • Maximum probability method (MPSegment): it is responsible for constructing directed acyclic graph and dynamic programming algorithm according to Trie tree. It is the core of word segmentation algorithm.
  • Hidden Markov model (HMMSegment): word segmentation is based on HMM model built based on corpus such as people's daily. The main algorithm idea is to represent the hidden state of each word according to four states (B,E,M,S). HMM model is provided by dict/hmm_model.utf8. Word segmentation algorithm is viterbi algorithm.
  • Query segment: first use the hybrid model for word segmentation, and then enumerate all possible words in the sentence for the long words cut out, so as to find out the words in the thesaurus.
  • Tag model (tag)
  • Simhash model (simhash)
  • Keywords model

If you don't care much about the engine, just use the officially recommended hybrid model (the default choice). Check the definition of the worker() function.

worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH,
  idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5,
  encoding = "UTF-8", detect = T, symbol = F, lines = one e+05,
  output = NULL, bylines = F, user_weight = "max")

Parameter list:

  • Type, engine type
  • dict, system dictionary
  • hmm, HMM model path
  • User, user dictionary
  • idf, IDF dictionary
  • stop_word
  • Write, write the file segmentation result to the file, default FALSE
  • qmax, the maximum number of characters into words, 20 characters by default
  • topn, number of key words, 5 by default
  • encoding: enter the file code. The default is UTF-8
  • detect, whether to check the encoding. The default is TRUE
  • Symbol, whether to keep the symbol. The default is FALSE
  • Lines, the maximum number of lines to read the file each time, which is used to control the length of the read file. Large files will be read in several times.
  • Output, output path
  • bylines, output by line
  • user_ User weight

When we call worker(), the word segmentation engine is loaded. You can print it and view the configuration of the word segmentation engine.

> wk = worker()
> wk
Worker Type:  Jieba Segment

Default Method  :  mix     # hybrid model 
Detect Encoding :  TRUE    # Check coding
Default Encoding:  UTF-8   # UTF-8
Keep Symbols    :  FALSE   # Do not keep symbols
Output Path     :          # Output file directory
Write File      :  TRUE    # Write files
By Lines        :  FALSE   # No output
Max Word Length :  20      # Maximum single word length
Max Read Lines  :  one e+05   # Maximum read in file lines

Fixed Model Components:  

$dict                      # System dictionary
[one] "D:/tool/R-3.2.3/library/jiebaRD/dict/jieba.dict.utf8"

$user                      # User dictionary
[one] "D:/tool/R-3.2.3/library/jiebaRD/dict/user.dict.utf8"

$hmm                       # Implicit Markov model
[one] "D:/tool/R-3.2.3/library/jiebaRD/dict/hmm_model.utf8"

$stop_word                 # Stop word, none
NULL

$user_weight               # User dictionary weight
[one] "max"

$timestamp                 # time stamp
[one] one 469027302

$default $detect $encoding $symbol $output $write $lines $bylines can be reset.

If we want to change the configuration item of the word segmentation engine, we can also set it through wk$XX when calling worker() to create the word segmentation engine. If you want to know what type of object wk is, we check the type of wk object through the otype function of the pryr package. For details on the use of pryr package, please refer to the article Pry the advanced toolkit pryr of R kernel

# Load pryr package
> library(pryr)
> otype(wk)  # Object oriented type checking
[one] "S3"

> class(wk)  # View class properties
[one] "jiebar"  "segment" "jieba" 

4. Configuration dictionary

The key factor for the result of word segmentation is the dictionary. jiebaR has a standard dictionary by default. For our use, it is best to use a special word segmentation dictionary for different industries or different text types. Show in jiebaR_ The dictpath () function can view the default standard dictionary, and specify our own dictionary through the configuration items described in the previous section. Common dictionaries for daily conversation, such as the thesaurus of Sogou input method.

# View default thesaurus location
> show_dictpath()
[1] "D:/tool/R-3.2.3/library/jiebaRD/dict"

# View directory
> dir(show_dictpath())
[1] "D:/tool/R-3.2.3/library/jiebaRD/dict"
 [1] "backup.rda"      "hmm_model.utf8"  "hmm_model.zip"  
 [4] "idf.utf8"        "idf.zip"         "jieba.dict.utf8"
 [7] "jieba.dict.zip"  "model.rda"       "README.md"      
[10] "stop_words.utf8" "user.dict.utf8" 

You can see that multiple files are included in the dictionary directory.

  • jieba.dict.utf8, system dictionary file, maximum probability method, utf8 encoded
  • hmm_model.utf8, system dictionary file, implicit Markov model, utf8 encoded
  • user.dict.utf8, user dictionary file, utf8 encoded
  • stop_words.utf8, stop word file, utf8 encoded
  • idf.utf8, IDF corpus, utf8 encoded
  • jieba.dict.zip, compressed package of jieba.dict.utf8
  • hmm_model.zip,hmm_ Compressed package of model.utf8
  • idf.zip, compressed package of idf.utf8
  • backup.rda, no comments
  • model.rda, no comments
  • README.md, description file

Open the system dictionary file jieba.dict.utf8 and print the first 50 lines.

> scan(file="D:/tool/R-3.2.3/library/jiebaRD/dict/jieba.dict.utf8",
+           what=character(),nlines=50,sep='\n',
+           encoding='utf-8',fileEncoding='utf-8')
Read 50 items
 [1] "1 Shop 3 n"  "1 Shop 3 n"  "4S Shop 3 n"   "4s Shop 3 n"  
 [5] "AA System 3 n"   "AB Type 3 n"   "AT&T 3 nz"  "A Type 3 n"   
 [9] "A Seat 3 n"    "A Shares 3 n"    "A Wheel 3 n"    "A Wheel 3 n"   
[13] "BB Machine 3 n"   "BB Machine 3 n"   "BP Machine 3 n"   "BP Machine 3 n"  
[17] "B Type 3 n"    "B Seat 3 n"    "B Unit 3 n"    "B Super 3 n"   
[21] "B Wheel 3 n"    "B Wheel 3 n"    "C# 3 nz"    "C++ 3 nz"  
[25] "CALL Machine 3 n" "CALL Machine 3 n" "CD Machine 3 n"   "CD Machine 3 n"  
[29] "CD Box 3 n"   "C Seat 3 n"    "C Disc 3 n"    "C Disc 3 n"   
[33] "C Language 3 n"  "C Language 3 n"  "D Seat 3 n"    "D Version 3 n"   
[37] "D Disc 3 n"    "D Disc 3 n"    "E Chemical 3 n"    "E Seat 3 n"   
[41] "E Disc 3 n"    "E Disc 3 n"    "E Pass 3 n"    "F Seat 3 n"   
[45] "F Disc 3 n"    "F Disc 3 n"    "G Disc 3 n"    "G Disc 3 n"   
[49] "H Disc 3 n"    "H Disc 3 n"

We found that each line of the system dictionary has three columns, which are divided by spaces. The first column is the word item, the second column is the word frequency, and the third column is the part of speech marker.

Open the user dictionary file user.dict.utf8 and print the first 50 lines.

> scan(file="D:/tool/R-3.2.3/library/jiebaRD/dict/user.dict.utf8",
+      what=character(),nlines=50,sep='\n',
+      encoding='utf-8',fileEncoding='utf-8')
Read 5 items
[1] "cloud computing"   "Han Yu appreciation" "Lan Xiang nz"  "CEO"      "River Bridge"  

The first row of the user dictionary has two columns. The first column is a word item, the second column is a part of speech mark, and there is no column of word frequency. The default word frequency of the user dictionary is the maximum word frequency in the system thesaurus.

jiebaR package uses ICTCLAS marking method for dictionary part of speech marking. ICTCLAS Chinese part of speech tagging set.

code name Interpretation of helping memory
Ag Morphological morpheme Adjective morpheme. The adjective code is a, and the morpheme code G is preceded by A.
a adjective Take the first letter of the English adjective addictive.
ad Adverbial words Adjectives that act directly as adverbials. Adjective code a and adverb code d are combined.
an Noun form words Adjectives with noun function. Adjective code a and noun code n are combined.
b Distinguishing words Take the initial consonant of the Chinese character "BIE".
c conjunction Take the first letter of the English conjunction conjunction conjunction.
Dg Paramorpheme Adverbial morpheme. The adverb code is D, and the morpheme code G is preceded by D.
d adverb Take the second letter of adverb because its first letter has been used as an adjective.
e interjection Take the first letter of the English exclamation.
f Location word Take the initial consonant of the Chinese character "Fang".
g morpheme Most morphemes can be used as the "root" of synthetic words and take the initial consonant of Chinese character "root".
h Anterior component Take the first letter of English head.
i idiom Take the first letter of the English idiom idiom.
j Abbreviation Take the initial consonant of the Chinese character "Jian".
k Subsequent component
l idiom Idioms have not yet become idioms. They are a little "temporary" and take the initial consonant of "pro".
m numeral Take the third letter of English numerical, n, u, which has been used by others.
Ng Nominal morpheme Nominal morpheme. The noun code is n, and the morpheme code G is preceded by n.
n noun Take the first letter of the English noun noun noun.
nr name The noun code n is combined with the initials of "Ren".
ns place name Noun code n is combined with locative code s.
nt Institutional groups The initial consonant of "Tuan" is t, and the noun codes n and T are combined.
nz Other proper names The first letter of the initial consonant of "Zhuan" is z, and the noun codes n and z are combined together.
o an onomatopoeia Take the first letter of the English onomatopoeia.
p preposition Take the first letter of the English preposition prepositional.
q classifier Take the first letter of English quantity.
r pronoun Take the second letter of the English pronoun pronoun because p has been used in the preposition.
s place Take the first letter of English space.
Tg Tense morpheme Time morpheme. The time word code is T, and T is placed in front of the morpheme code g.
t Time word Take the first letter of English time.
u auxiliary word Take the second letter of the English auxiliary word auxiliary, because a has been used as an adjective.
Vg Verb morpheme Verb morpheme. The verb code is v. Precede the morpheme code g with V.
v verb Take the first letter of the English verb verb verb verb.
vd coverb A verb used directly as an adverbial. The codes of verbs and adverbs are combined.
vn Noun verb A verb that has the function of a noun. The codes of verbs and nouns are combined.
w punctuation
x Non morpheme words A non morpheme word is just a symbol. The letter x is usually used to represent unknown numbers and symbols.
y statement label designator Take the initial consonant of the Chinese character "Yu".
z State word Take the first letter of the initial consonant of the Chinese character "shape".

Let's customize a user dictionary to try the effect. Write a dictionary file, user.utf8.

~ notepad user.utf8

R language
R Geek ideal
 big data
 data

Use our custom user dictionary to segment the text just now.

> wk = worker(user='user.utf8')
> wk['./idea.txt']
[1] "./idea.segment.2016-07-21_11_14_24.txt"

Compare the word segmentation results generated twice, idea.segment.2016-07-20_23_25_34.txt and idea.segment.2016-07-21_11_14_24.txt.

In practice, the user dictionary provided by jeebar by default has only 5 words. It is too simple to use. We can use Sogou dictionary to enrich users' own thesaurus. Next, let's configure Sogou dictionary. You need to install a Sogou input method. The specific installation process will not be explained.

I installed the Sogou Wubi input method, found the Sogou installation directory and found the dictionary file. My Sogou dictionary is installed in the following location.

C:\Program Files (x86)\SogouWBInput\2.1.0.1288\scd\17960.scel

Copy the 17960.scel file to your project directory, open the file with a text editor, and find that it is binary. Then I need a tool to convert the binary dictionary into a text file that we can use. The author of jiebaR package has also developed a cidian project, which can convert Sogou's dictionary, so we only need to install the cidian package.

Install cidian project

> install.packages("devtools")
> install.packages("stringi")
> install.packages("pbapply")
> install.packages("Rcpp")
> install.packages("RcppProgress")
> library(devtools)
> install_github("qinwf/cidian")
> library(cidian)

Convert binary dictionary to text file.

# transformation
> decode_scel(scel = "./17960.scel",cpp = TRUE)
output file: ./17960.scel_2016-07-21_00_22_11.dict

# View the generated dictionary file
> scan(file="./17960.scel_2016-07-21_00_22_11.dict",
+      what=character(),nlines=50,sep='\n',
+      encoding='utf-8',fileEncoding='utf-8')
Read 50 items
 [1] "Aba Prefecture n"         "A Baichuan n"         "Aban n"          
 [4] "Abin n"           "Apophis n"       "Abudureshiti n"  
 [7] "Abudusikur n"   "Abulik wood n"     "Almgren n"    
[10] "Andrey ARSHAVIN  n"       "A Feixing n"         "The true story of Alfie n"      
[13] "A-mit  n"         "Amu n"           "Amuron n"        
[16] "Apalusa town n"     "abhisit  n"         "Ah Shuai n"          
[19] "A Xia n"           "Iowa n"         "Love without guilt n"        
[22] "Dislocation of love n"       "Dirty Love  n"       "Flame of love n"      
[25] "Ai no rukeichi  n"     "Can love n"         "Efron  n"        
[28] "Love net n"         "Patriotic hearts n"       "Aihu n"          
[31] "Love is together n"     "Ekhil n"       "There is nothing wrong with love n"      
[34] "Emmons n"         "New biography of Ainu n"       "Love starting point n"        
[37] "Teeth of love n"     "Love beach n"       "Love festival n"        
[40] "The beauty of love n"   "Infinite spectrum of love n"     "Love is busy n"      
[43] "Love transfer n"       "Love left light right line n"   "Falling in love with you is a mistake n"
[46] "Dwarf sentry n"         "Love is compromise n"       "Love is like Narcissus n"      
[49] "Love hurts too much n"         "Love is boundless n"    

Next, directly configure Sogou dictionary into our word segmentation database, and you can use it directly. Rename the Sogou dictionary file from 17960.scel_2016-07-21_00_22_11.dict to user.dict.utf8, and then replace user.dict.utf8 under D:\tool\R-3.2.3\library\jiebaRD\dict directory. The default user dictionary is Sogou dictionary. Cool!

5. Stop word filtering

Stop words are words that we don't need as a result in the process of word segmentation. For example, there are many a,the,or,and in English sentences, and many in Chinese language, such as, Di, De, me, you, him. Because these words are used too frequently, they will appear in a large number in a text. For the results after word segmentation, a lot of noise will be added when counting the word frequency, so we usually filter these words.

In jiebaR, there are two methods to filter stop words. One is to configure stop_word file, the other is to use filter_segment() function.

Let's start by configuring stop_ Method of word file. Create a new stop_word.txt file.

~ notepad stop_word.txt

I
 I am

Load the word segmentation engine and configure stop word filtering.

> wk = worker(stop_word='stop_word.txt')
> segment<-wk["I am< R Author of the book geek ideal"]
> segment
[1] "R"    "of"   "Geek" "ideal" "books" "author"

In the above text, we filter "I am" through the stop word. If you want to filter the word "author", you can call filter dynamically_ Segment() function.

> filter<-c("author")
> filter_segment(segment,filter)
[1] "R"    "of"   "Geek" "ideal" "books"

6. Keyword extraction

Keyword extraction is a very important part of text processing. A classical algorithm is TF-IDF algorithm. Where TF (Term Frequency) represents word frequency and IDF (Inverse Document Frequency) represents Inverse Document Frequency. If a word appears many times in the article and is not a stop word, it is likely to reflect the characteristics of the article. This is the keyword we are looking for. Then calculate the weight of each word through IDF. The higher the frequency of uncommon words, the greater the weight. The formula for calculating TF-IDF is:

TF-IDF = TF(word frequency) * Inverse document frequency(IDF)

Calculate the TF-IDF value for each word in the document and sort the results from large to small to get the key ranking list of the document. For the explanation of IF-IDF, refer to the article Application of TF-IDF and cosine similarity (I): automatic keyword extraction.

The keyword extraction implementation of jiebaR package also uses the TF-IDF algorithm. The idf.utf8 file in the installation directory is the IDF corpus. View the contents of idf.utf8.

> scan(file="D:/tool/R-3.2.3/library/jiebaRD/dict/idf.utf8",
+      what=character(),nlines=50,sep='\n',
+      encoding='utf-8',fileEncoding='utf-8')
Read 50 items
 [1] "Labor protection 13.900677652"      "Biochemistry 13.900677652"       
 [3] "Osabel 13.900677652"      "Investigation team 13.900677652"     
 [5] "Post 11.5027823792"         "Reverse gear 12.2912397395"      
 [7] "Compile 9.21854642485"         "Butterfly 11.1926274509"        
 [9] "Outsourcing 11.8212361103"         "Pretend to be profound 11.9547675029"    
[11] "Wei Suicheng 13.2075304714"       "Cardiogenic 11.1926274509"      
[13] "Active servicemen 10.642581114"      "Dubliu13.2075304714"      
[15] "Bao Tianxiao 13.900677652"        "Jia Zhengpei 13.2075304714"      
[17] "TOL Bay 13.900677652"        "Dowa 12.5143832909"        
[19] "Multiple lobes 13.900677652"          "Baster 11.598092559"     
[21] "Emperor Liu 12.8020653633"       "Alexandrov 13.2075304714"
[23] "Public 8.90346537821"     "Five hundred 12.8020653633"      
[25] "Two point threshold 12.5143832909"       "Multi bottle 13.900677652"         
[27] "Ice sky 12.2912397395"         "Kubuzi 11.598092559"       
[29] "Longchuan County 12.8020653633"       "Yinyan 11.9547675029"        
[31] "Historical features 11.8212361103"     "Faithfulness 13.2075304714"    
[33] "Lecherous 10.0088573539"         "Walk slowly 12.5143832909"    
[35] "Stool 8.36728816325"         "Part II 9.93038573842"        
[37] "Luba 12.1089181827"         "Five hundred and fifty-three.2075304714"      
[39] "Talk freely 11.598092559"          "Wu Zhezi 13.2075304714"      
[41] "Quiz 13.900677652"      "Kubon 13.2075304714"        
[43] "Injustice 11.3357282945"       "Compilation 10.2897597393"        
[45] "Sorrow 12.8020653633"         "Chen zhuangta 13.2075304714"      
[47] "Erlang 9.62401153296"         "Lightning 11.8212361103"    
[49] "Grab the ball 11.9547675029"         "South Australia 10.9562386728"  

Each line of idf.utf8 file has two columns. The first column is word items and the second column is weights. Then, I calculate the word frequency (TF) of the document and multiply it by the IDF value of the corpus to obtain the TF-IDF value, so as to extract the keywords of the document.

For example, we extract keywords from the following text content.

> wk = worker()
> segment<-wk["R Geek ideal series, covering R A series of key points of thought, use, tools, innovation, etc. are explained by my personal learning and experience R Powerful."]

# term frequency 
> freq(segment)
     char freq
1    innovate    1
2      Yes    1
3    article    1
4    powerful    1
5       R    3
6    personal    1
7      of    5
8    annotation    1
9      and    1
10 A series    1
11   use    1
12     with    1
13     etc.    1
14   Geek    1
15   ideal    1
16   thought    1
17   cover    1
18   series    1
19     go    1
20     I    1
21   tool    1
22   study    1
23   experience    1
24   main points    1

# Take the top 5 keywords of TF-IDF
> keys = worker("keywords",topn=5)

# Calculation keywords
> vector_keywords(segment,keys)
11.7392 8.97342 8.23425  8.2137 7.43298 
 "Geek"  "annotation"  "main points"  "cover"  "experience" 

Using jiebaR package to process word segmentation is really simple. A few lines of code can realize various algorithm operations of word segmentation. With this tool, we can find various language rules in documents for text mining. In the next article, let's dig into the announcements of listed companies. Maybe we can find some market rules.

This article only introduces the use of jiebaR package. For detailed operation, please refer to the official introduction of the package author. Thank jiebaR author @ qinwenfeng again for providing a very good toolkit for R language in Chinese word segmentation!

Please indicate the source for Reprint:
http://blog.fens.me/r-word-jiebar/