[NLP] ⚠️ Learn not to hit me! Learn basic operation in half an hour 2 ⚠️ key word

Posted by kundan on Mon, 06 Sep 2021 05:57:58 +0200

summary

From today on, we will start a journey of natural language processing (NLP). NLP can let us process, understand and use human language to realize the communication bridge between machine language and human language

key word

Keywords, i.e. key words, can describe the essence of an article and have important applications in document retrieval, automatic summarization, text clustering / classification and so on

Keyword extraction method

Keyword extraction: for a new document, extract some words in the document as the keyword of the document through algorithm analysis
Keyword allocation: given the existing key thesaurus, for a new document, several words are allocated from the thesaurus as the keywords of the document

TF-IDF keyword extraction

TF-IDF (term frequency inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF-IDF can help us mine keywords in articles. Through numerical statistics, it reflects the importance of a word to an article in the corpus

TF

TF (Term Frequency), i.e. word frequency, indicates the frequency of words in the text

Formula:

IDF

IDF (Inverse Document Frequency), i.e. inverse document frequency, represents the reciprocal of the number of documents containing words in the corpus

Formula:

TF-IDF

Formula:

TF-IDF = (frequency of words / total words of sentences) × ( Total documents / documents containing the word)

If a word is very common, the IDF will be very low, otherwise it will be very high. TF-IDF can help us filter common words and extract keywords

Jieba TF IDF keyword extraction

Format:

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

Parameters:

sentence: text corpus to be extracted
topK: the number of returned keywords. The default value is 20
withWeight: whether to return keyword weight. The default value is False
allowPOS: only the words with the specified part of speech are included. It is empty by default, that is, it is not filtered

jieba part of speech

number	Part of speech	describe
Ag	Morphological morpheme	Adjective morpheme. The adjective code is a, and the morpheme code G is preceded by A.
a	adjective	Take the first letter of the English adjective addictive.
ad	Adverbial words	Adjectives that act directly as adverbials. Adjective code a and adverb code d are combined.
an	Noun form words	Adjectives with noun function. Adjective code a and noun code n are combined.
b	Distinguishing words	Take the initial consonant of the Chinese character "BIE".
c	conjunction	Take the first letter of the English conjunction conjunction conjunction.
dg	Paramorpheme	Adverbial morpheme. The adverb code is D, and the morpheme code G is preceded by D.
d	adverb	Take the second letter of adverb because its first letter has been used as an adjective.
e	interjection	Take the first letter of the English exclamation.
f	Location word	Take the Chinese character "Fang"
g	morpheme	Most morphemes can be used as the "root" of synthetic words and take the initial consonant of Chinese character "root".
h	Anterior component	Take the first letter of English head.
i	idiom	Take the first letter of the English idiom idiom.
j	Abbreviation	Take the initial consonant of the Chinese character "Jian".
k	Subsequent component
l	idiom	Idioms have not yet become idioms. They are a little "temporary" and take the initial consonant of "pro".
m	numeral	Take the third letter of English numerical, n, u, which has been used by others.
Ng	Nominal morpheme	Nominal morpheme. The noun code is N, and the morpheme code G is preceded by N.
n	noun	Take the first letter of the English noun noun noun.
nr	name	The noun code n is combined with the initials of "Ren".
ns	place name	Noun code n is combined with locative code s.
nt	Institutional groups	The initial consonant of "Tuan" is t, and the noun codes n and T are combined.
nz	Other proper names	The first letter of the initial consonant of "Zhuan" is z, and the noun codes n and z are combined together.
o	an onomatopoeia	Take the first letter of the English onomatopoeia.
p	preposition	Take the first letter of the English preposition prepositional.
q	classifier	Take the first letter of English quantity.
r	pronoun	Take the second letter of the English pronoun pronoun because p has been used in the preposition.
s	place	Take the first letter of English space.
tg	Tense morpheme	Time morpheme. The time word code is T, and T is placed in front of the morpheme code g.
t	Time word	Take the first letter of English time.
u	auxiliary word	Take the English auxiliary word auxiliary
vg	Verb morpheme	Verb morpheme. The verb code is v. Precede the morpheme code g with V.
v	verb	Take the first letter of the English verb verb verb verb.
vd	coverb	A verb used directly as an adverbial. The codes of verbs and adverbs are combined.
vn	Noun verb	A verb that has the function of a noun. The codes of verbs and nouns are combined.
w	punctuation
x	Non morpheme words	A non morpheme word is just a symbol. The letter x is usually used to represent unknown numbers and symbols.
y	statement label designator	Take the initial consonant of the Chinese character "Yu".
z	State word	Take the first letter of the initial consonant of the Chinese character "shape".
un	Unknown word

Without keyword weight

example:

import jieba.analyse

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# Extract keywords
keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=False)

# Debug output
print([i for i in keywords])

Output results:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 0.890 seconds.
Prefix dict has been built successfully.
['natural language', 'computer science', 'linguistics', 'Research', 'field', 'handle', 'signal communication', 'Effective', 'software system', 'artificial intelligence', 'realization', 'computer system', 'important', 'one', 'One door', 'daily', 'computer', 'close', 'mathematics', 'development']

With keyword weight

import jieba.analyse

# Define text
content = "Natural language processing is a branch of artificial intelligence and linguistics. This field discusses how to deal with and use natural language; Natural language processing includes many aspects and steps, including cognition, understanding, generation and so on."

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# Extract keywords (with weight)
keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=True)

# Debug output
print([i for i in keywords])

Output results:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 1.110 seconds.
Prefix dict has been built successfully.
[('natural language', 1.1237629576061539), ('computer science', 0.4503481350267692), ('linguistics', 0.27566262244215384), ('Research', 0.2660770221507693), ('field', 0.24979825580353845), ('handle', 0.24973179957046154), ('signal communication', 0.2043557391963077), ('Effective', 0.16296019853692306), ('software system', 0.16102600688461538), ('artificial intelligence', 0.14550809839215384), ('realization', 0.14389939312584615), ('computer system', 0.1402028601413846), ('important', 0.12347581087876922), ('one', 0.11349408224353846), ('One door', 0.11300493477184616), ('daily', 0.10913612756276922), ('computer', 0.1046889912443077), ('close', 0.10181409957492307), ('mathematics', 0.10166677655076924), ('development', 0.09868653898630769)]

TextRank

TextRank builds a network through the adjacent relationship between words, and then uses PageRank to iteratively calculate the rank value of each node. The keywords can be obtained by sorting the rank values

import jieba.analyse

# Define text
content = "Natural language processing is a branch of artificial intelligence and linguistics. This field discusses how to deal with and use natural language; Natural language processing includes many aspects and steps, including cognition, understanding, generation and so on."

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# TextRank keyword extraction
keywords = jieba.analyse.textrank(text, topK=20, withWeight=False)

# Debug output
print([i for i in keywords])

Commissioning output:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
['Research', 'field', 'computer science', 'realization', 'handle', 'linguistics', 'mathematics', 'people', 'computer', 'involve', 'Have', 'one', 'method', 'language', 'development', 'use', 'artificial intelligence', 'lie in', 'contact', 'science']
Loading model cost 1.062 seconds.
Prefix dict has been built successfully.

Topics: Algorithm Machine Learning NLP

Programmer Think

[NLP] ⚠️ Learn not to hit me! Learn basic operation in half an hour 2 ⚠️ key word

summary

key word

TF-IDF keyword extraction

TF

IDF

TF-IDF

Jieba TF IDF keyword extraction

jieba part of speech

Without keyword weight

With keyword weight

TextRank

Hot Topics