Technical / advertising article classifier

Posted by Mateobus on Tue, 04 Jan 2022 04:00:25 +0100

preface

This article is based on the previous blog Technical / advertising article classifier (I) , some optimizations were made to improve the accuracy from 84.5% to 94.4%

1, Optimization means

1. Add training data

In the previous training data set, there were only about 500 pieces of two types of data respectively, and there were too few training data.
The data set used in this paper is more than 45000, an increase of 90 times, which should be fully sufficient

2. Change classification model

Before using polynomial naive Bayes, the effect is general. Due to the assumption of independence of sample attributes, the effect is not good if the sample attributes are related. Therefore, using ensemble learning directly achieves a better effect

3. Add user dictionary during word segmentation

Some key words are not segmented by the ideal word. Compared with not adding the user dictionary, the accuracy is improved by about 1%

4. Remove stop words and special symbols

Before word segmentation, expressions and some special symbols are removed. I tried to remove special symbols after word segmentation. The results show that removing special symbols before word segmentation has better effect. After removing special symbols, the accuracy is improved by about 2%

2, TFIDF + AdaBoost

All codes

class TrainBlogClsTfidfAdaBoost:
    def __init__(self):
        jieba.load_userdict(get_blog_cls_jieba_user_dict_path())

        self.train_data_dir = get_blog_cls_train_data_optimize_dir()
        self.tfidf_path = get_tfidf_path()
        self.model_path = get_adaboost_model_path()

        # self.train_data_dir = get_blog_cls_train_data_dev_dir()
        # self.tfidf_path = get_test_tfidf_path()
        # self.model_path = get_adaboost_test_model_path()

    def load(self):
        if not os.path.exists(self.model_path):
            logger.warning("Start training, target model data:", self.model_path)
            self.train()

        logger.info("Loading model")
        self.model = joblib.load(self.model_path)
        self.tf_idf = joblib.load(self.tfidf_path)

    def load_data(self):
        '''Load file contents and labels'''
        files = get_files_path(self.train_data_dir, '.txt')
        contents = []
        labels = []

        for file in files:
            with open(file, 'r') as f:
                data = f.read()
            data = filter_content_for_blog_cls(data)
            data_cut = ' '.join(jieba.cut(data))
            contents.append(data_cut)
            label = file.split('/')[-2]
            labels.append(label)
        X_train, X_test, y_train, y_test = train_test_split(contents,
                                                            labels,
                                                            test_size=0.2,
                                                            random_state=123456)
        return X_train, X_test, y_train, y_test

    def load_stopwords(self):
        path = './data/pro/datasets/stopwords/cn_stopwords.txt'
        with open(path, 'r') as f:
            stopwords = f.read().split('\n')
        return stopwords

    def train(self):
        logger.info('Start training...')
        stopwords = self.load_stopwords()
        X_train, X_test, y_train, y_test = self.load_data()
        tfidf = TfidfVectorizer(stop_words=stopwords, max_df=0.5)
        train_data = tfidf.fit(X_train)
        train_data = tfidf.transform(X_train)
        test_data = tfidf.transform(X_test)

        joblib.dump(tfidf, self.tfidf_path, compress=1)

        model = AdaBoostClassifier()  # 99%

        model.fit(train_data, y_train)

        predict_test = model.predict(test_data)

        joblib.dump(model, self.model_path, compress=1)

        print("The accuracy is:", metrics.accuracy_score(predict_test, y_test))

    def predict(self, test_data):
        test_data = filter_content_for_blog_cls(test_data)
        test_data = ' '.join(jieba.cut(test_data))

        test_vec = self.tf_idf.transform([test_data])
        res = self.model.predict(test_vec)
        return res

    def test_acc(self):
        data_path = './data/pro/datasets/blogs/blog_adver_cls/test_dev.csv'
        data = pd.read_csv(data_path)
        data = data.dropna(axis=0)
        test_text = data['content']
        text_list = []
        for text in test_text:
            text = filter_content_for_blog_cls(text)
            text = ' '.join(jieba.cut(text))
            text_list.append(text)
        label = data['label']
        test_data = self.tf_idf.transform(text_list)
        predict_test = self.model.predict(test_data)
        print("In the test set, the accuracy is:", metrics.accuracy_score(predict_test, label))

result:

In the test set, the accuracy is: 0.9646315789473684

The test data is about 5000, which is still quite convincing

3, Fasttext

fasttext has been used for book classification before. See fasttext text classification ", the accuracy rate of three categories is 93%, and the accuracy rate of 35 categories is 75.6%. The overall effect is good. So I thought of using fasttext to try and see if the effect will be better.

All codes

import os
import fasttext
import jieba
import logging
import random
from tqdm import tqdm
import pandas as pd
from sklearn import metrics
from common.utils import get_files_path
from common.utils import filter_content_for_blog_cls
from common.path.dataset.blog import get_blog_cls_jieba_user_dict_path

from common.path.dataset.blog import get_blog_cls_train_data_dev_dir, get_fasttext_train_data_path
from common.path.model.blog import get_blog_cls_fasttext_model_path

logger = logging.getLogger(__name__)


class TrainBlogClsFasttext:
    def __init__(self):
        jieba.load_userdict(get_blog_cls_jieba_user_dict_path())
        self.train_data_dev_dir = get_blog_cls_train_data_dev_dir()
        self.train_data_path = get_fasttext_train_data_path()
        self.fasttext_model_path = get_blog_cls_fasttext_model_path()
        self.class_name_mapping = {
            '__label__0': 'technology',
            '__label__1': 'advertisement'
        }

    
    def load(self):
        if not os.path.exists(self.fasttext_model_path):
            logger.info('Start training model...')
            self.train_fasttext()
        logger.info("Loading model")
        self.model = fasttext.load_model(self.fasttext_model_path)
        

    def data_process(self):
        data_dir = self.train_data_dev_dir
        files = get_files_path(data_dir, '.txt')
        
        if not os.path.exists(self.train_data_path):
            os.mkdir(self.train_data_path)
        random.shuffle(files)

        fasttext_train_data_path = os.path.join(self.train_data_path, 'train.txt')
        fasttext_test_data_path = os.path.join(self.train_data_path, 'test.txt')
        if os.path.exists(fasttext_train_data_path) and os.path.exists(fasttext_test_data_path):
            return
        lines_train = []
        lines_test = []
        all_data = []
        for file in tqdm(files, desc='Building training data: '):
            with open(file, 'r') as f:
                data = f.read()
            data = filter_content_for_blog_cls(data)
            data = ' '.join(jieba.cut(data))

            if file.find('technology') != -1:
                label = '__label__{}'.format(0)
            elif file.find('advertisement') != -1:
                label = '__label__{}'.format(1)
            else:
                print("Bad data:{}".format(file))
            line = data + '\t' + label + '\n'
            all_data.append(line)
        
        lines_train = all_data[:int(len(all_data)*0.8)]
        lines_test = all_data[int(len(all_data)*0.8):]
        with open(fasttext_train_data_path, 'a') as f:
            f.writelines(lines_train)
        with open(fasttext_test_data_path, 'a') as f:
            f.writelines(lines_test)


    def load_stopwords(self):
        path = './data/pro/datasets/stopwords/cn_stopwords.txt'
        with open(path, 'r') as f:
            stopwords = f.read().split('\n')
        return stopwords


    def train_fasttext(self):
        self.data_process()
        data_dir = self.train_data_path
        train_path = os.path.join(data_dir, 'train.txt')
        test_path = os.path.join(data_dir, 'test.txt')

        classifier = fasttext.train_supervised(input=train_path,
                                            label="__label__",
                                            dim=100,
                                            epoch=10,
                                            lr=0.1,
                                            wordNgrams=2,
                                            loss='softmax',
                                            thread=8,
                                            verbose=True)
        classifier.save_model(self.fasttext_model_path)
        result = classifier.test(test_path)
        logger.info('Train Result:'.format(result))
        logger.info('F1 Score: {}'.format(result[1] * result[2] * 2 /
                                    (result[2] + result[1])))
    
    def predict(self, text):

        test_data = filter_content_for_blog_cls(text)
        test_data = ' '.join(jieba.cut(test_data))

        result = self.model.predict(test_data)
        class_name = result[0][0]
        res_label = self.class_name_mapping[class_name]
        return res_label

    def test_acc(self):

        data_path = './data/pro/datasets/blogs/blog_adver_cls/test_dev.csv'
        data = pd.read_csv(data_path)
        data = data.dropna(axis=0)
        test_text = data['content']
        text_list = []
        for text in test_text:
            text = filter_content_for_blog_cls(text)
            text = ' '.join(jieba.cut(text))
            text_list.append(text)
        labels = data['label']
        res_labels = []
        for text in text_list:
            label = self.model.predict(text)
            class_name = label[0][0]
            res_label = self.class_name_mapping[class_name]
            res_labels.append(res_label)
        print("In the test set, the accuracy is:", metrics.accuracy_score(res_labels, labels))

The code is not difficult, mainly data processing. Here, special symbols are removed before word segmentation. The effect is indeed improved. You can try it yourself.

Let's see the effect directly:

[INFO][2022-01-03 14:39:23][fasttext_classifier.py:33 at load]: Start training model...
Read 5M words
Number of words:  261664
Number of labels: 2
Progress: 100.0% words/sec/thread: 1415852 lr:  0.000000 avg.loss:  0.059132 ETA:   0h 0m 0s
[INFO][2022-01-03 14:39:33][fasttext_classifier.py:101 at train_fasttext]: Train Result:
[INFO][2022-01-03 14:39:33][fasttext_classifier.py:102 at train_fasttext]: F1 Score: 0.9638259736027375
[INFO][2022-01-03 14:39:33][fasttext_classifier.py:35 at load]: Loading model
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
In the test set, the accuracy is: 0.9661052631578947

On the same test data set, the accuracy of Fasttext is 0.2% higher, but the model size is 912M, and the model trained with TFIDF + AdaBoost adds up to 4.9M.

The actual reasoning speed has not been tested, so TFIDF + AdaBoost, which occupies less memory, is currently used.

summary

Observing more data and understanding data characteristics are of great help to improve the effect of the model.

Facts have proved that:

1. Adding user dictionaries can improve accuracy
2. Removing special characters from text can improve accuracy

Related articles:

1,Technical / advertising article classifier (I)
2,fasttext text classification

Topics: Python Machine Learning NLP