[NLP] summary of Chinese and English keyword extraction technology in python

Posted by theCro on Mon, 25 Oct 2021 04:40:26 +0200

[NLP] summary of Chinese and English keyword extraction technology in python

Whether in Chinese or English, keyword extraction technology has very important application value and analysis value. The following introduces several common keyword extraction methods for Chinese and English in python environment.

1. English

Several methods of extracting English keywords:

1.1 spaCy

spaCy is an integrated industrial natural language processing tool. Its main functions include word segmentation, part of speech tagging, word stemming, named entity recognition, noun phrase extraction and so on.

text = "Private investment firm Carlyle Group,which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
## spaCy
import spaCy
spacy_nlp = spacy.load("en_core_web_sm")
doc = spacy_nlp(text)
for ent in doc.ents:
    print(ent.text,ent.label_)
    
## output
Carlyle Group ORG

The output of doc.ents may be 1-gram, 2-gram, 3-gram, etc., which cannot be adjusted manually.

1.2 yake

yake uses the text statistical feature method to select the most important keywords from the articles.

## yake
import yake
kw_extractor = yake.KeywordExtractor()
language = "en"
max_ngram_size = 2 #Maximum key word length
deduplication_threshold = 0.9 #Set whether words can be repeated in keywords
numOfKeywords = 20 
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(text)

## output
[('Carlyle Group,which', 0.007444681613352736),
 ('firm Carlyle', 0.013797198203993007),
 ('Private investment', 0.015380821171891606),
 ('defense industry', 0.015380821171891606),
 ('investment firm', 0.02570861714399338),
 ('making well-timed', 0.02570861714399338),
 ('occasionally controversial', 0.02570861714399338),
 ('controversial plays', 0.02570861714399338),
 ('Carlyle', 0.08596317751626563),
 ('Group,which', 0.08596317751626563),
 ('Private', 0.09568045026443411),
 ('industry', 0.09568045026443411),
 ('market', 0.09568045026443411),
 ('investment', 0.15831692877998726),
 ('firm', 0.15831692877998726),
 ('reputation', 0.15831692877998726),
 ('making', 0.15831692877998726),
 ('well-timed', 0.15831692877998726),
 ('occasionally', 0.15831692877998726),
 ('controversial', 0.15831692877998726)]

1.3 rare_nltk

Rapid Automatic Keyword Extraction (RAKE), a fast automatic keyword extraction technology integrated with NLTK tool.

## Rake
from rake_nltk import Rake
rake_nltk_var = Rake(max_length=2)
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()

## output
['quietly placed',
 'making well',
 'defense industry',
 'another part',
 'timed',
 'reputation',
 'market',
 'bets']

Rake nltk has the same performance as spaCy

1.4 gensim

## gensim
import gensim
gensim_kw = gensim.summarization.keywords(text,words=10,split=True,scores=True)

## output
[('carlyle', 0.3454150914587271),
 ('investment firm', 0.3405754844080691),
 ('occasionally controversial plays', 0.28088699744995504),
 ('defense', 0.2808869974499539),
 ('industry', 0.28088699744995377),
 ('quietly placed', 0.2808869974499534)]

genisim's performance in keyword extraction has not reached the level of spaCy and ` ` rakenltk. gensim ` still has room for improvement in keyword extraction tasks

2. Chinese

2.1 TextRank

TextRank is an algorithm inspired by PageRank. It was first used in abstract extraction. The existing python library can extract keywords and abstracts at the same time. The effect of TextRank is not necessarily better than TF-IDF (with appropriate corpus), but its advantage is that it can analyze a single article without corpus training, and can be used as a method when the corpus is insufficient.

test_text = 'Knowledge map( Knowledge Graph)´╝îIn the field of Library and information, it is called knowledge domain visualization or knowledge domain mapping map. It is a series of different graphics that show the relationship between knowledge development process and structure. It uses visualization technology to describe knowledge resources and their carriers, mine, analyze, construct, draw and display knowledge and the relationship between them. Knowledge atlas is through the visualization of Applied Mathematics, graphics and information The theories and methods of chemical technology, information science and other disciplines are combined with metrology citation analysis, co-occurrence analysis and other methods, and the visual atlas is used to vividly display the core structure, development history, frontier fields and overall knowledge structure of the discipline, so as to achieve the purpose of multi-disciplinary integration.'

tr4w = TextRank4Keyword()
tr4w.analyze(text=test_text,lower=True,window=5)
print('Test text keywords are as follows:\n')
for item in tr4w.get_keywords(5,word_min_len=1):
    print(item['word'],item['weight'])

tr4s = TextRank4Sentence()
tr4s.analyze(text=test_text, lower=True, source='no_stop_words')
key_sentences = tr4s.get_key_sentences(num=5,sentence_min_len=2)
for sentence in key_sentences:
    print(sentence['weight'],sentence['sentence'])
    
## output
 Knowledge 0.07136000358189572
 Visualization 0.050885665487685895
 Discipline 0.04382397117871283
 Atlas 0.034505489704179104
 Analysis 0.032669435513839745

2.2 TF-IDF

TF-IDF is an algorithm based on word frequency and inverse document frequency. TF-IDF is relatively simple and fast, and the effect is relatively good. However, the IDF in TF-IDF should be trained based on the existing corpus. The quality of the corpus has a great impact on the results. If applied to the current project, the corpus is a problem that must be solved.

keywords_tfidf = jieba.analyse.extract_tags(test_text, topK = 20, withWeight=True)
for item in keywords_tfidf:
    print(item[0],item[1])
## output
 Knowledge 0.6291566258975609
 Visualization 0.5459818268731708
 Atlas 0.36000858444256095
 Discipline 0.24579239780817075
 Analysis 0.17720506157560972
 Intelligence community 0.16952045917073172
 Graphics 0.16952045917073172
Knowledge 0.14578984759634145
Graph 0.14578984759634145
 Co occurrence 0.14578984759634145
 Information science 0.1364954567182927
 Domain 0.1320072083514634
 Metrology 0.12932732665853658
 Citation 0.1284558758780488
 Theory 0.12825460906317074
 Method 0.12116597634243903
 Map 0.12043080440536585
 Structure 0.1178287709707317
 Technology 0.11510871167243902
 Show 0.11022549280878048

2.3 Textrank(jieba)

keywords_textrank_jieba = jieba.analyse.textrank(test_text, topK=20, withWeight=True)
for item in keywords_textrank_jieba:
    print(item[0],item[1])
tr4w = TextRank4Keyword()
## output
 Knowledge 1.0
 Visualization 0.6295270480086527
 Analysis 0.48002211534304284
 Discipline 0.448184981430669
 Technology 0.33755106917904004
 Structure 0.3218718222753201
 Development 0.3176518370007447
 Method 0.3153386943262916
 Domain 0.3120507490347102
 Atlas 0.2845462026540273
 Show 0.28294240822624933
 Graphics 0.23556758861253835
 Reach 0.22453237874888934
 Resource 0.2231116119429888
 Schema 0.21777848096566976
 Co occurrence 0.21498592836125222
 Fusion 0.1941284901445365
 Theory 0.19230572366470158
 Carrier 0.18945172483525918
 Core 0.18851788164895478

reference material:

Keyword Extraction process in Python with Natural Language Processing(NLP),

Understand TextRank for Keyword Extraction by Python

Topics: Python NLP CODING