Chengzhihe: artificial intelligence and machine learning series NLTK natural language processing

Posted by phpfolk2003 on Sun, 19 Dec 2021 17:16:21 +0100

This article is the sixth module in our series on learning Python and its applications in Machine Science (ML) and artificial intelligence (AI) In the previous module , we discussed image recognition using OpenCV. Now let's see what the natural language Toolkit (NLTK) can do?

install

In the first method, you can use Anaconda to install NLTK:

conda install nltk

In the second method, you can use pip to run and install NLTK in the unit of Jupiter Notebook:

!pip install --upgrade nltk

If the following Python code runs without errors, the installation is successful:

import nltk

NLTK comes with a large amount of data that can be downloaded (corpus, syntax, model, etc.), so just run the following Python command and an interactive download window will appear:

ntlk.download()

For this module, you also need to install the corpus of "stop words". After downloading, create another one named NLTK_DATA contains the environment variables of the download directory path (not required if you have a centralized installation; for a complete guide to installing data, see file).

Text classification

Classifying text means assigning labels to it. We can classify text in many ways, such as emotion analysis (positive / negative / neutral), spam classification (spam / non spam), by document subject, etc.

In this module, we will use Large film review dataset Walkthrough text classification example data set Provide 25000 film reviews (positive and negative) for training and the same number of tests.

NLTK provides a naive Bayesian classifier to deal with machine learning. Our work is mainly to write a function to extract "features" from text. The classifier uses these features to perform its classification.

Our function is called feature extractor, which takes a string (text) as a parameter and returns a dictionary that maps feature names to their values, called feature set.

For movie reviews, our feature will be the first N words (excluding stop words). Therefore, the feature extractor will return a feature set containing these N words as keys and a Boolean value indicating their presence or absence as values.

The first step is to browse the comments, store all words (except stop words), and find the most commonly used words.

First, this auxiliary function accepts a text and outputs its non stop words:

import nltk
import nltk.sentiment.util
from nltk.corpus import stopwords

import nltk.sentiment.util
stop = set(stopwords.words("english"))
def extract_words_from_text(text):
    tokens = nltk.word_tokenize(text)
    tokens_neg_marked = nltk.sentiment.util.mark_negation(tokens)
    return [t for t in tokens_neg_marked
             if t.replace("_NEG", "").isalnum() and
             t.replace("_NEG", "") not in stop]

word_tokenize splits the text into a list of tags (still retaining punctuation).

mark_ For negation_ NEG mark the mark after negation. So, for example, "I don't like this." After tokenization and negation, it becomes this:

["I", "did", "not", "enjoy_NEG", "this_NEG", "."].

In the last line, delete all stop words (including negative words) and punctuation. There are many useless words, such as "I" or "this", but this filter is enough for us to demonstrate.

Next, we build a list of all the words read from the comment file. We keep a separate list of positive and negative words to ensure a balance when we choose the most important words. (I also tested the word list without separating it, and found that most positive comments are classified as negative comments.) at the same time, we can also create a list of all positive comments and all negative comments.

import os

positive_files = os.listdir("aclImdb/train/pos")
negative_files = os.listdir("aclImdb/train/neg")

positive_words = []
negative_words = []

positive_reviews = []
negative_reviews = []

for pos_file in positive_files:
    with open("aclImdb/train/pos/" + pos_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        positive_reviews.append(txt)
        positive_words.extend(extract_words_from_text(txt))
for neg_file in negative_files:
    with open("aclImdb/train/neg/" + neg_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        negative_reviews.append(txt)
        negative_words.extend(extract_words_from_text(txt))

It may take some time to run this code because there are many files.

Then, we only keep the first N words in the positive and negative word list (2000 words in this case) and combine them.

N = 2000

freq_pos = nltk.FreqDist(positive_words)
top_word_counts_pos = sorted(freq_pos.items(), key=lambda kv: kv[1], reverse=True)[:N]
top_words_pos = [twc[0] for twc in top_word_counts_pos]

freq_neg = nltk.FreqDist(negative_words)
top_word_counts_neg = sorted(freq_neg.items(), key=lambda kv: kv[1], reverse=True)[:N]
top_words_neg = [twc[0] for twc in top_word_counts_neg]

top_words = list(set(top_words_pos + top_words_neg))

Now we can write a feature extractor. As mentioned earlier, it should return a dictionary with each uppermost word as a key and True or False as a value, depending on whether the word exists in the text.

def extract_features(text):
    text_words = extract_words_from_text(text)
    return { w: w in text_words for w in top_words }

Then we create a training set, which we provide to the naive Bayesian classifier. The training set should be a tuple list, in which the first element of each tuple is the feature set and the second element is the label.

training = [(extract_features(review), "pos") for review in positive_reviews] + [(extract_features(review), "neg") for review in negative_reviews]

The lines above take up a lot of RAM and are slow, so you may want to use a subset of comments by getting part of the comment list.

Training classifiers is simple:

classifier = nltk.NaiveBayesClassifier.train(training)

To categorize comments immediately, use this method on the new feature set:

print(classifier.classify(extract_features("Your review goes here.")))

If you want to see the probability of each tag, you can use prob_classify override:

def get_prob_dist(text):
    prob_dist = classifier.prob_classify(extract_features(text))
    return { "pos": prob_dist.prob("pos"), "neg": prob_dist.prob("neg") }

print(get_prob_dist("Your review goes here."))

The classifier has a built-in method to determine the accuracy of the model based on the test set. The shape of the test set is the same as that of the training set. The movie review dataset has a separate directory that contains reviews that can be used for this purpose.

test_positive = os.listdir("aclImdb/test/pos")[:2500]
test_negative = os.listdir("aclImdb/test/neg")[:2500]

test = []

for pos_file in test_positive:
    with open("aclImdb/test/pos/" + pos_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        test.append((extract_features(txt), "pos"))
for neg_file in test_negative:
    with open("aclImdb/test/neg/" + neg_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        test.append((extract_features(txt), "neg"))

print(nltk.classify.accuracy(classifier, test))

Using N = 2000, there were 5000 positive comments and 5000 negative comments in the training set. I obtained about 85% accuracy with this code.

Topics: Python