Natural language processing - text preprocessing

Posted by khaitan_anuj on Tue, 12 Oct 2021 22:15:07 +0200

Similar to other data preprocessing of machine learning, natural language processing also needs text preprocessing, such as word segmentation in Chinese and part of speech restoration in English.

Common pretreatment methods

1. Text normalization

  • Capital to lowercase
output_str = input_str.lower()
  • Digital processing
    Remove all numbers
import re
output_str = re.sub(r'\d','',input_str)
  • Punctuation processing
    One method of string type in python is translate, which can replace specific characters in the string with other specified characters. Its character table can be a dictionary. The key of the dictionary must be the unicode code of the character, and the value value is the character; You can also use the maketrans method to help build this character table. The maketrans method can accept three parameters:
    • When there is only one parameter, the parameter should be a dictionary, which is required to be the same as the character table in translate
    translation = s.maketrans({ord('A'): 'a', ord('B'): ord('b')}) 
    
    • When there are two parameters, the replaced character and the specified character must be the same length
    translation = s.maketrans('A', 'a')
    
    • When there are three parameters, the first two parameters have the same meaning. The third parameter means that if the characters in it appear, they will be directly replaced with None, and their level is higher than the first two parameters
    translation = s.maketrans('AB', 'ab', 'ACD')
    
import string
output_str = input_str.translate(input_str.maketrans("","",string.punctuation))
  • Blank processing
output_str = input_str.strip()
  • Stem extraction
    English has many forms, such as plural, continuous tense and so on. Extracting the main part of words can be used in the field of information retrieval, but it may not still express the complete semantics, because some words do not have the original meaning after extracting the stem.
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
output_str = word_tokenize(input_str)
for word in output_str:
	print(stemmer.stem(word))
  • Morphological reduction
    This part is to restore words in different time states to their original form, such as was into is.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer = WordNetLemmatizer()
output_str = word_tokenize(input_str)
for word in output_str:
	print(lemmatizer.lemmatize(word))

2. Semantic analysis

  1. Part of speech tagging
    Generally speaking, the verbs in the text are more important, while the auxiliary words are less important, so part of speech tagging can provide key information for text processing. Part of speech tagging can be realized by using common natural language tool packages.
import nltk
tokens = nltk.work_tokenize(ubput_str)
output = nltk.pos_tag(tokens)
  1. Named entity recognition
    It is used to identify entities with specific meaning in the text, such as person name, place name, organization name, time, etc.
from nltk import word_tokenize,pos_tag,ne_chunk
output = ne_chunk(pos_tag(word_tokenize(input_str))))
  1. Phrase extraction
    Extract fixed common collocations, such as keep in mind, speed up, etc
from ICE import CollocationExtractor
extractor = CollocationRxtractor.with_collocation_pipeline("Tl",bing_key="Temp",pos_check=False)

3. Participle

For Chinese, there is a special problem that there is no segmentation and coincidence between words in sentences, which leads to the problem of word segmentation. There are several difficulties in word segmentation, such as new word recognition, no unified standard for word boundaries, ambiguity caused by word segmentation, etc.

  1. Mechanical word segmentation
    Based on the dictionary, segmentation is carried out according to certain strategies. The common strategies include maximum matching method, minimum segmentation method and so on.

  2. Word segmentation based on N-gram
    The first step of word segmentation based on N-gram is to find out all possible word segmentation situations, and then calculate the probability of word segmentation sequence based on N-gram language model to find out the most likely word segmentation sequence

  3. Based on Hidden Markov model
    HMM describes the process of randomly generating the random sequence of unobservable states from a hidden Markov chain, and then generating the observation variables from each state to generate the hanging measurement sequence. Then the word segmentation problem is transformed into a sequence annotation problem. Each word is represented by B, M and E.

There are also word segmentation based on conditional random field model, deep learning and so on.
Common word segmentation tools include Stanford corenlp, HanLP, THULAC, SnowNLP, jieba, etc. The existing tools can achieve good word segmentation effect, and the special words in some special fields can be improved by adding dictionaries.

4. Text error correction

Text error correction can be divided into non word spelling errors and real word spelling errors. The main knowledge points used are Bayesian concentration, language model, editing distance, thesaurus construction, corpus statistics and other technologies.

Reference: natural language processing from introduction to practice [M]

Topics: Python Machine Learning NLP