Spam detection - Part 1

Posted by BigBrother on Wed, 10 Nov 2021 12:39:05 +0100

Pan Chuang AI sharing

Author | Md Sohel Mahmood Compile | VK Source: Towards Data Science

Spam detection is an important application of machine learning algorithm in filtering spam. In the field of natural language processing, several algorithms can be used for such classification. Usually spam has some typical words.

In this article, we will use the nltk package to introduce spam and non spam text processing in detail. In particular, we will see the stem analysis and morphological reduction process of NLP. We will also implement NB classifier, SVC and random forest classifier to detect spam, and compare the accuracy of the classifier. Let's start.

According to the nltk document, "nltk is the leading platform for building Python programs to process human language data". Using nltk to process and mark up text is very simple, such as stemming and morphological reduction, which we will see later.

First, we need to import the necessary packages.

import numpy as np 
import pandas as pd 
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import re
from nltk.tokenize import word_tokenize as wt
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

Data label

After importing the csv file containing spam and non spam tag text, I created two data frames: one for real email and the other for spam, which we will use for analysis.

spam_data = spam[spam['label'] == 'spam']
real_data = spam[spam['label'] == 'ham']

Stem analysis and form reduction

Let's do stem analysis first, and then restore their forms. According to Stanford NLP team, "stemming analysis usually refers to a rough heuristic process, that is, cutting off the suffix, hoping to achieve this goal correctly in most cases, usually including removing derived affixes.

Word form reduction usually refers to the correct handling of things by using vocabulary and word form analysis. Usually, the purpose is to remove the ending of word form changes and return the basic form or dictionary form of a word, which is called word form. "Here, stem analysis is applied to all data, spam data and real data respectively.

all_data_stem = []
spam_data_stem = []
real_data_stem = []
stemmer = PorterStemmer()
for i in range(spam.shape[0]):
    sms = spam.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(stemmer.stem(word))
    sms_text = " ".join(sms_processed)
    all_data_stem.append(sms_text)

for i in range(spam_data.shape[0]):
    sms = spam_data.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(stemmer.stem(word))
    sms_text = " ".join(sms_processed)
    spam_data_stem.append(sms_text)

for i in range(real_data.shape[0]):
    sms = real_data.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(stemmer.stem(word))
    sms_text = " ".join(sms_processed)
    real_data_stem.append(sms_text)
all_data_stem[0]

Then word form restoration is applied to all data, spam data and real data respectively.

all_data_lemma = []
spam_data_lemma = []
real_data_lemma = []
lemmatizer = WordNetLemmatizer()
for i in range(spam.shape[0]):
    sms = spam.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(lemmatizer.lemmatize(word))
    sms_text = " ".join(sms_processed)
    all_data_lemma.append(sms_text)

for i in range(spam_data.shape[0]):
    sms = spam_data.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(lemmatizer.lemmatize(word))
    sms_text = " ".join(sms_processed)
    spam_data_lemma.append(sms_text)

for i in range(real_data.shape[0]):
    sms = real_data.iloc[i, 1]
    sms = re.sub('[^A-Za-z]', ' ', sms)
    sms = sms.lower()
    tokenized_sms = wt(sms)
    sms_processed = []
    for word in tokenized_sms:
        if word not in set(stopwords.words('english')):
            sms_processed.append(lemmatizer.lemmatize(word))
    sms_text = " ".join(sms_processed)
    real_data_lemma.append(sms_text)
all_data_lemma[0]

If we look at the stemming analysis of the first data text, we will get:

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'.

Word form restore provides:

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

It is obvious from the first batch of data that stem analysis and morphological reduction work in different ways. For example, the stem of the word "availability" is "available", but the word form is "available".

Wordcloud

After marking with numbers, let's create a wordcloud to see the most common words.

spam['num_label'] = spam['label'].map({'ham': 0, 'spam': 1})
spam_words = ' '.join(list(spam[spam['num_label'] == 1]['text']))
spam_wc = WordCloud(width = 600,height = 512).generate(spam_words)
plt.figure(figsize = (12, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

There are many attractive words in spam, while the text in real e-mail is very random, as shown below.

frequency distribution

We may be interested in looking at the most commonly used words in spam. It can be obtained by the following frequency distribution

from nltk import FreqDist
spam_token = nltk.tokenize.word_tokenize(spam_words)
spam_freq = FreqDist(spam_token)
spam_freq

FreqDist({'.': 1004, 'to': 608, '!': 542, ',': 371, 'a': 358, 'you': 189, 'call': 187, 'your': 187, 'or': 185, '&': 178, ...})

At other times, we may be interested in focusing on the most common parts of repeated sentences in spam.

FreqDist(spam_data_lemma).most_common(5)

Obviously, sentence parts such as "private account statement" or "secret admirer" are one of the most attractive phrases in spam.

Discrete graph

We can get the dispersion diagram of the target words to see the distribution. It will provide information about the occurrence of a specific word based on the total number of words. I chose the words "free", "private", "account", "contact" as the presentation words.

classifier

Here, I created a classifier from the scikit learn library. We need to convert the text into a token count matrix. The CountVectorizer of scikit learn is very convenient.

We will first try the NaiveBayes function, which is easy to implement and has shorter training time. In order to facilitate training, I chose 80% of the data.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

matrix = CountVectorizer(max_features=1000)
X = matrix.fit_transform(all_data_lemma).toarray()
y = spam.iloc[:, 0]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
rep = classification_report(y_test, y_pred)

Accuracy is the ratio of TP (true positive) to the sum of TP and FP (false positive). Recall rate is the ratio of TP to the sum of TP and FN (false negative). If a real spam is mistakenly recognized as a real email, it is a false positive. On the other hand, if a real email is identified as spam, it is false negative.

The report shows that the model performs well in detecting real mail, but not in detecting SPAM. The accuracy of SPAM is ~ 0.38, indicating that a large number of false positives are obtained from the model.

Although the accuracy of the model is 0.79, it may be misleading. The recall rate of spam is high and the accuracy is low. This shows that the model is biased towards spam. It can correctly identify most spam, but it also mistakenly identifies some normal emails as spam.

array([[744, 224],
       [ 12, 135]], dtype=int64)

The confusion matrix shows a similar scenario. There are no large numbers on the diagonal. This means that the performance is not good enough for naive Bayes.

Let's try support vector classification and random forest algorithm.

from sklearn.svm import SVC

classifier = SVC()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
rep = classification_report(y_test, y_pred)

The performance of the model is good. The accuracy and recall rate of normal and spam are very high. Finally, let's try to use random forest as classifier.

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
rep = classification_report(y_test, y_pred)

In this spam detection case, Random Forest is also a good classifier, which can provide high precision and recall for real and spam.

conclusion

In this paper, nltk library is used to demonstrate the process of word stem analysis and word form restoration of NLP, and several binary classifier algorithms are compared.

Naive Bayes has lower accuracy, while SVC and random forest provide higher accuracy, recall and accuracy.

Cross validation techniques can be used to evaluate the skills of these classifiers. Now, there are many open source platforms that can be trained or cross validated without any code, which I will discuss in another article.

Github provides code blocks: https://mdsohelmahmood.github.io/2021/06/23/Spam-email-classification-using-NB-SVC-Random-Forest.html

Thanks for reading!