Notes of "deep learning practice of Web security": Chapter 8 harassment message recognition

Posted by neo926 on Thu, 10 Mar 2022 12:44:26 +0100

This chapter mainly takes SMS Spam Collection data set as an example to introduce the identification technology of harassing SMS. This paper introduces the feature extraction methods used to identify harassing short messages, including word bag and TF-IDF model, vocabulary model, Word2Vec and Doc2Vec model, and introduces the models used and the corresponding verification results, including naive Bayes, support vector machine, XGBoost and MLP algorithm. This section is similar to the spam in Chapter 6 and the negative comments in Chapter 7, but the identified content turns into harassing SMS, which are 2 classification problems.

1, Data set

The test data comes from the SMS Spam Collection data set. SMS Spam Collection is a classic data set for harassing SMS identification, which is completely from the real SMS content, including 4831 normal SMS and 747 harassing SMS. Download the data set compression package from the official website and decompress it. The normal SMS and harassing SMS are saved in a text file. As shown below, the smsspamcollection in the following figure Txt file is the test data set.

Read the data file smsspamcollection line by line Txt, because each line of data is composed of tags and SMS contents, which are separated by tabs, they can be segmented through the split function to directly obtain the tags and SMS contents:

def load_all_files():
    x=[]
    y=[]
    datafile="../data/sms/smsspamcollection/SMSSpamCollection.txt"
    with open(datafile, encoding='utf-8') as f:
        for line in f:
            line=line.strip('\n')
            label,text=line.split('\t')
            x.append(text)
            if label == 'ham':
                y.append(0)
            else:
                y.append(1)
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4)
    return x_train, x_test, y_train, y_test

The calling source code is as follows:

x_train, x_test, y_train, y_test=load_all_files()

2, Feature extraction

(1) Word set (Glossary) model

Vectorization model is divided into word set (vocabulary) and word bag model. The latter counts the frequency of words, while the vocabulary model uses the generated vocabulary to encode the original sentences one by one according to words. The specific processing is as follows:

def  get_features_by_tf():
    global  max_document_length
    x_train, x_test, y_train, y_test=load_all_files()

    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x_train=vp.fit_transform(x_train, unused_y=None)
    x_train=np.array(list(x_train))

    x_test=vp.transform(x_test)
    x_test=np.array(list(x_test))

    return x_train, x_test, y_train, y_test

1.train[0] example

Here is an example of train[0] to demonstrate the processing process of vocabulary model. The first step is the first step

And stop being an old man. You get to build snowman snow angels and snowball fights.

After vocabulary processing

     vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)

    x_train=vp.fit_transform(x_train, unused_y=None)

The result of train[0] is

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

 2.train[1] example

Here, select train[1] for another example, as shown below before processing

What's ur pin?

After vocabulary processing, the results are as follows

[17 18 19  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]

(2) Word bag model

def get_features_by_wordbag():
    global max_features
    x_train, x_test, y_train, y_test=load_all_files()

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print (vectorizer)
    x_train=vectorizer.fit_transform(x_train)
    x_train=x_train.toarray()
    vocabulary=vectorizer.vocabulary_

    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 vocabulary=vocabulary,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print (vectorizer)
    x_test=vectorizer.fit_transform(x_test)
    x_test=x_test.toarray()

    return x_train, x_test, y_train, y_test

For the example of this part, refer to the notes of "deep learning practice of Web security": TF-IDF processing logic in Chapter 8 harassment short message recognition (2).

note: there are many contents in this chapter. The notes are divided into a series, and the corresponding contents in the subsequent columns are referred to.

Topics: AI Deep Learning Cyber Security