[NLP] summary of Chinese and English keyword extraction technology in python
Whether in Chinese or English, keyword extraction technology has very important application value and analysis value. The following introduces several common keyword extraction methods for Chinese and English in python environment.
1. English
Several methods of extracting English keywords:
1.1 spaCy
spaCy is an integrated industrial natural language processing tool. Its main functions include word segmentation, part of speech tagging, word stemming, named entity recognition, noun phrase extraction and so on.
text = "Private investment firm Carlyle Group,which has a reputation for making well-timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market." ## spaCy import spaCy spacy_nlp = spacy.load("en_core_web_sm") doc = spacy_nlp(text) for ent in doc.ents: print(ent.text,ent.label_) ## output Carlyle Group ORG
The output of doc.ents may be 1-gram, 2-gram, 3-gram, etc., which cannot be adjusted manually.
1.2 yake
yake uses the text statistical feature method to select the most important keywords from the articles.
## yake import yake kw_extractor = yake.KeywordExtractor() language = "en" max_ngram_size = 2 #Maximum key word length deduplication_threshold = 0.9 #Set whether words can be repeated in keywords numOfKeywords = 20 custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None) keywords = custom_kw_extractor.extract_keywords(text) ## output [('Carlyle Group,which', 0.007444681613352736), ('firm Carlyle', 0.013797198203993007), ('Private investment', 0.015380821171891606), ('defense industry', 0.015380821171891606), ('investment firm', 0.02570861714399338), ('making well-timed', 0.02570861714399338), ('occasionally controversial', 0.02570861714399338), ('controversial plays', 0.02570861714399338), ('Carlyle', 0.08596317751626563), ('Group,which', 0.08596317751626563), ('Private', 0.09568045026443411), ('industry', 0.09568045026443411), ('market', 0.09568045026443411), ('investment', 0.15831692877998726), ('firm', 0.15831692877998726), ('reputation', 0.15831692877998726), ('making', 0.15831692877998726), ('well-timed', 0.15831692877998726), ('occasionally', 0.15831692877998726), ('controversial', 0.15831692877998726)]
1.3 rare_nltk
Rapid Automatic Keyword Extraction (RAKE), a fast automatic keyword extraction technology integrated with NLTK tool.
## Rake from rake_nltk import Rake rake_nltk_var = Rake(max_length=2) rake_nltk_var.extract_keywords_from_text(text) keyword_extracted = rake_nltk_var.get_ranked_phrases() ## output ['quietly placed', 'making well', 'defense industry', 'another part', 'timed', 'reputation', 'market', 'bets']
Rake nltk has the same performance as spaCy
1.4 gensim
## gensim import gensim gensim_kw = gensim.summarization.keywords(text,words=10,split=True,scores=True) ## output [('carlyle', 0.3454150914587271), ('investment firm', 0.3405754844080691), ('occasionally controversial plays', 0.28088699744995504), ('defense', 0.2808869974499539), ('industry', 0.28088699744995377), ('quietly placed', 0.2808869974499534)]
genisim's performance in keyword extraction has not reached the level of spaCy and ` ` rakenltk. gensim ` still has room for improvement in keyword extraction tasks
2. Chinese
2.1 TextRank
TextRank is an algorithm inspired by PageRank. It was first used in abstract extraction. The existing python library can extract keywords and abstracts at the same time. The effect of TextRank is not necessarily better than TF-IDF (with appropriate corpus), but its advantage is that it can analyze a single article without corpus training, and can be used as a method when the corpus is insufficient.
test_text = 'Knowledge map( Knowledge Graph),In the field of Library and information, it is called knowledge domain visualization or knowledge domain mapping map. It is a series of different graphics that show the relationship between knowledge development process and structure. It uses visualization technology to describe knowledge resources and their carriers, mine, analyze, construct, draw and display knowledge and the relationship between them. Knowledge atlas is through the visualization of Applied Mathematics, graphics and information The theories and methods of chemical technology, information science and other disciplines are combined with metrology citation analysis, co-occurrence analysis and other methods, and the visual atlas is used to vividly display the core structure, development history, frontier fields and overall knowledge structure of the discipline, so as to achieve the purpose of multi-disciplinary integration.' tr4w = TextRank4Keyword() tr4w.analyze(text=test_text,lower=True,window=5) print('Test text keywords are as follows:\n') for item in tr4w.get_keywords(5,word_min_len=1): print(item['word'],item['weight']) tr4s = TextRank4Sentence() tr4s.analyze(text=test_text, lower=True, source='no_stop_words') key_sentences = tr4s.get_key_sentences(num=5,sentence_min_len=2) for sentence in key_sentences: print(sentence['weight'],sentence['sentence']) ## output Knowledge 0.07136000358189572 Visualization 0.050885665487685895 Discipline 0.04382397117871283 Atlas 0.034505489704179104 Analysis 0.032669435513839745
2.2 TF-IDF
TF-IDF is an algorithm based on word frequency and inverse document frequency. TF-IDF is relatively simple and fast, and the effect is relatively good. However, the IDF in TF-IDF should be trained based on the existing corpus. The quality of the corpus has a great impact on the results. If applied to the current project, the corpus is a problem that must be solved.
keywords_tfidf = jieba.analyse.extract_tags(test_text, topK = 20, withWeight=True) for item in keywords_tfidf: print(item[0],item[1]) ## output Knowledge 0.6291566258975609 Visualization 0.5459818268731708 Atlas 0.36000858444256095 Discipline 0.24579239780817075 Analysis 0.17720506157560972 Intelligence community 0.16952045917073172 Graphics 0.16952045917073172 Knowledge 0.14578984759634145 Graph 0.14578984759634145 Co occurrence 0.14578984759634145 Information science 0.1364954567182927 Domain 0.1320072083514634 Metrology 0.12932732665853658 Citation 0.1284558758780488 Theory 0.12825460906317074 Method 0.12116597634243903 Map 0.12043080440536585 Structure 0.1178287709707317 Technology 0.11510871167243902 Show 0.11022549280878048
2.3 Textrank(jieba)
keywords_textrank_jieba = jieba.analyse.textrank(test_text, topK=20, withWeight=True) for item in keywords_textrank_jieba: print(item[0],item[1]) tr4w = TextRank4Keyword() ## output Knowledge 1.0 Visualization 0.6295270480086527 Analysis 0.48002211534304284 Discipline 0.448184981430669 Technology 0.33755106917904004 Structure 0.3218718222753201 Development 0.3176518370007447 Method 0.3153386943262916 Domain 0.3120507490347102 Atlas 0.2845462026540273 Show 0.28294240822624933 Graphics 0.23556758861253835 Reach 0.22453237874888934 Resource 0.2231116119429888 Schema 0.21777848096566976 Co occurrence 0.21498592836125222 Fusion 0.1941284901445365 Theory 0.19230572366470158 Carrier 0.18945172483525918 Core 0.18851788164895478
reference material:
Keyword Extraction process in Python with Natural Language Processing(NLP),