Similar to other data preprocessing of machine learning, natural language processing also needs text preprocessing, such as word segmentation in Chinese and part of speech restoration in English.
Common pretreatment methods
1. Text normalization
- Capital to lowercase
output_str = input_str.lower()
- Digital processing
Remove all numbers
import re output_str = re.sub(r'\d','',input_str)
- Punctuation processing
One method of string type in python is translate, which can replace specific characters in the string with other specified characters. Its character table can be a dictionary. The key of the dictionary must be the unicode code of the character, and the value value is the character; You can also use the maketrans method to help build this character table. The maketrans method can accept three parameters:- When there is only one parameter, the parameter should be a dictionary, which is required to be the same as the character table in translate
translation = s.maketrans({ord('A'): 'a', ord('B'): ord('b')})
- When there are two parameters, the replaced character and the specified character must be the same length
translation = s.maketrans('A', 'a')
- When there are three parameters, the first two parameters have the same meaning. The third parameter means that if the characters in it appear, they will be directly replaced with None, and their level is higher than the first two parameters
translation = s.maketrans('AB', 'ab', 'ACD')
import string output_str = input_str.translate(input_str.maketrans("","",string.punctuation))
- Blank processing
output_str = input_str.strip()
- Stem extraction
English has many forms, such as plural, continuous tense and so on. Extracting the main part of words can be used in the field of information retrieval, but it may not still express the complete semantics, because some words do not have the original meaning after extracting the stem.
from nltk.stem import PorterStemmer from nltk.tokenize import word_tokenize stemmer = PorterStemmer() output_str = word_tokenize(input_str) for word in output_str: print(stemmer.stem(word))
- Morphological reduction
This part is to restore words in different time states to their original form, such as was into is.
from nltk.stem import WordNetLemmatizer from nltk.tokenize import word_tokenize lemmatizer = WordNetLemmatizer() output_str = word_tokenize(input_str) for word in output_str: print(lemmatizer.lemmatize(word))
2. Semantic analysis
- Part of speech tagging
Generally speaking, the verbs in the text are more important, while the auxiliary words are less important, so part of speech tagging can provide key information for text processing. Part of speech tagging can be realized by using common natural language tool packages.
import nltk tokens = nltk.work_tokenize(ubput_str) output = nltk.pos_tag(tokens)
- Named entity recognition
It is used to identify entities with specific meaning in the text, such as person name, place name, organization name, time, etc.
from nltk import word_tokenize,pos_tag,ne_chunk output = ne_chunk(pos_tag(word_tokenize(input_str))))
- Phrase extraction
Extract fixed common collocations, such as keep in mind, speed up, etc
from ICE import CollocationExtractor extractor = CollocationRxtractor.with_collocation_pipeline("Tl",bing_key="Temp",pos_check=False)
3. Participle
For Chinese, there is a special problem that there is no segmentation and coincidence between words in sentences, which leads to the problem of word segmentation. There are several difficulties in word segmentation, such as new word recognition, no unified standard for word boundaries, ambiguity caused by word segmentation, etc.
-
Mechanical word segmentation
Based on the dictionary, segmentation is carried out according to certain strategies. The common strategies include maximum matching method, minimum segmentation method and so on. -
Word segmentation based on N-gram
The first step of word segmentation based on N-gram is to find out all possible word segmentation situations, and then calculate the probability of word segmentation sequence based on N-gram language model to find out the most likely word segmentation sequence -
Based on Hidden Markov model
HMM describes the process of randomly generating the random sequence of unobservable states from a hidden Markov chain, and then generating the observation variables from each state to generate the hanging measurement sequence. Then the word segmentation problem is transformed into a sequence annotation problem. Each word is represented by B, M and E.
There are also word segmentation based on conditional random field model, deep learning and so on.
Common word segmentation tools include Stanford corenlp, HanLP, THULAC, SnowNLP, jieba, etc. The existing tools can achieve good word segmentation effect, and the special words in some special fields can be improved by adding dictionaries.
4. Text error correction
Text error correction can be divided into non word spelling errors and real word spelling errors. The main knowledge points used are Bayesian concentration, language model, editing distance, thesaurus construction, corpus statistics and other technologies.
Reference: natural language processing from introduction to practice [M]