[NLP] ⚠️ Learn not to hit me! Learn basic operation in half an hour 2 ⚠️ key word

Posted by kundan on Mon, 06 Sep 2021 05:57:58 +0200

summary

From today on, we will start a journey of natural language processing (NLP). NLP can let us process, understand and use human language to realize the communication bridge between machine language and human language

key word

Keywords, i.e. key words, can describe the essence of an article and have important applications in document retrieval, automatic summarization, text clustering / classification and so on

Keyword extraction method

  1. Keyword extraction: for a new document, extract some words in the document as the keyword of the document through algorithm analysis
  2. Keyword allocation: given the existing key thesaurus, for a new document, several words are allocated from the thesaurus as the keywords of the document

TF-IDF keyword extraction

TF-IDF (term frequency inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF-IDF can help us mine keywords in articles. Through numerical statistics, it reflects the importance of a word to an article in the corpus

TF

TF (Term Frequency), i.e. word frequency, indicates the frequency of words in the text

Formula:

IDF

IDF (Inverse Document Frequency), i.e. inverse document frequency, represents the reciprocal of the number of documents containing words in the corpus

Formula:

TF-IDF

Formula:

TF-IDF = (frequency of words / total words of sentences) × ( Total documents / documents containing the word)

If a word is very common, the IDF will be very low, otherwise it will be very high. TF-IDF can help us filter common words and extract keywords

Jieba TF IDF keyword extraction

Format:

jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

Parameters:

  • sentence: text corpus to be extracted
  • topK: the number of returned keywords. The default value is 20
  • withWeight: whether to return keyword weight. The default value is False
  • allowPOS: only the words with the specified part of speech are included. It is empty by default, that is, it is not filtered

jieba part of speech

numberPart of speechdescribe
AgMorphological morphemeAdjective morpheme. The adjective code is a, and the morpheme code G is preceded by A.
aadjectiveTake the first letter of the English adjective addictive.
adAdverbial wordsAdjectives that act directly as adverbials. Adjective code a and adverb code d are combined.
anNoun form wordsAdjectives with noun function. Adjective code a and noun code n are combined.
bDistinguishing wordsTake the initial consonant of the Chinese character "BIE".
cconjunctionTake the first letter of the English conjunction conjunction conjunction.
dgParamorphemeAdverbial morpheme. The adverb code is D, and the morpheme code G is preceded by D.
dadverbTake the second letter of adverb because its first letter has been used as an adjective.
einterjectionTake the first letter of the English exclamation.
fLocation wordTake the Chinese character "Fang"
gmorphemeMost morphemes can be used as the "root" of synthetic words and take the initial consonant of Chinese character "root".
hAnterior componentTake the first letter of English head.
iidiomTake the first letter of the English idiom idiom.
jAbbreviationTake the initial consonant of the Chinese character "Jian".
kSubsequent component
lidiomIdioms have not yet become idioms. They are a little "temporary" and take the initial consonant of "pro".
mnumeralTake the third letter of English numerical, n, u, which has been used by others.
NgNominal morphemeNominal morpheme. The noun code is N, and the morpheme code G is preceded by N.
nnounTake the first letter of the English noun noun noun.
nrnameThe noun code n is combined with the initials of "Ren".
nsplace nameNoun code n is combined with locative code s.
ntInstitutional groupsThe initial consonant of "Tuan" is t, and the noun codes n and T are combined.
nzOther proper namesThe first letter of the initial consonant of "Zhuan" is z, and the noun codes n and z are combined together.
oan onomatopoeiaTake the first letter of the English onomatopoeia.
pprepositionTake the first letter of the English preposition prepositional.
qclassifierTake the first letter of English quantity.
rpronounTake the second letter of the English pronoun pronoun because p has been used in the preposition.
splaceTake the first letter of English space.
tgTense morphemeTime morpheme. The time word code is T, and T is placed in front of the morpheme code g.
tTime wordTake the first letter of English time.
uauxiliary wordTake the English auxiliary word auxiliary
vgVerb morphemeVerb morpheme. The verb code is v. Precede the morpheme code g with V.
vverbTake the first letter of the English verb verb verb verb.
vdcoverbA verb used directly as an adverbial. The codes of verbs and adverbs are combined.
vnNoun verbA verb that has the function of a noun. The codes of verbs and nouns are combined.
wpunctuation
xNon morpheme wordsA non morpheme word is just a symbol. The letter x is usually used to represent unknown numbers and symbols.
ystatement label designatorTake the initial consonant of the Chinese character "Yu".
zState wordTake the first letter of the initial consonant of the Chinese character "shape".
unUnknown word

Without keyword weight

example:

import jieba.analyse

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# Extract keywords
keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=False)

# Debug output
print([i for i in keywords])

Output results:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 0.890 seconds.
Prefix dict has been built successfully.
['natural language', 'computer science', 'linguistics', 'Research', 'field', 'handle', 'signal communication', 'Effective', 'software system', 'artificial intelligence', 'realization', 'computer system', 'important', 'one', 'One door', 'daily', 'computer', 'close', 'mathematics', 'development']

With keyword weight

import jieba.analyse

# Define text
content = "Natural language processing is a branch of artificial intelligence and linguistics. This field discusses how to deal with and use natural language; Natural language processing includes many aspects and steps, including cognition, understanding, generation and so on."

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# Extract keywords (with weight)
keywords = jieba.analyse.extract_tags(text, topK=20, withWeight=True)

# Debug output
print([i for i in keywords])

Output results:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 1.110 seconds.
Prefix dict has been built successfully.
[('natural language', 1.1237629576061539), ('computer science', 0.4503481350267692), ('linguistics', 0.27566262244215384), ('Research', 0.2660770221507693), ('field', 0.24979825580353845), ('handle', 0.24973179957046154), ('signal communication', 0.2043557391963077), ('Effective', 0.16296019853692306), ('software system', 0.16102600688461538), ('artificial intelligence', 0.14550809839215384), ('realization', 0.14389939312584615), ('computer system', 0.1402028601413846), ('important', 0.12347581087876922), ('one', 0.11349408224353846), ('One door', 0.11300493477184616), ('daily', 0.10913612756276922), ('computer', 0.1046889912443077), ('close', 0.10181409957492307), ('mathematics', 0.10166677655076924), ('development', 0.09868653898630769)]

TextRank

TextRank builds a network through the adjacent relationship between words, and then uses PageRank to iteratively calculate the rank value of each node. The keywords can be obtained by sorting the rank values

import jieba.analyse

# Define text
content = "Natural language processing is a branch of artificial intelligence and linguistics. This field discusses how to deal with and use natural language; Natural language processing includes many aspects and steps, including cognition, understanding, generation and so on."

# Define text
text = "Natural language processing is an important direction in the field of computer science and artificial intelligence." \
       "It studies various theories and methods that can realize effective communication between human and computer with natural language." \
       "Natural language processing is a science integrating linguistics, computer science and mathematics." \
       "Therefore, the research in this field will involve natural language, that is, people's daily language," \
       "Therefore, it is closely related to the study of linguistics, but there are important differences." \
       "Natural language processing is not a general study of natural language," \
       "It is to develop a computer system that can effectively realize natural language communication, especially the software system." \
       "So it is a part of computer science"

# TextRank keyword extraction
keywords = jieba.analyse.textrank(text, topK=20, withWeight=False)

# Debug output
print([i for i in keywords])

Commissioning output:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
['Research', 'field', 'computer science', 'realization', 'handle', 'linguistics', 'mathematics', 'people', 'computer', 'involve', 'Have', 'one', 'method', 'language', 'development', 'use', 'artificial intelligence', 'lie in', 'contact', 'science']
Loading model cost 1.062 seconds.
Prefix dict has been built successfully.

Topics: Algorithm Machine Learning NLP