preface
This article is based on the previous blog Technical / advertising article classifier (I) , some optimizations were made to improve the accuracy from 84.5% to 94.4%
1, Optimization means
1. Add training data
In the previous training data set, there were only about 500 pieces of two types of data respectively, and there were too few training data.
The data set used in this paper is more than 45000, an increase of 90 times, which should be fully sufficient
2. Change classification model
Before using polynomial naive Bayes, the effect is general. Due to the assumption of independence of sample attributes, the effect is not good if the sample attributes are related. Therefore, using ensemble learning directly achieves a better effect
3. Add user dictionary during word segmentation
Some key words are not segmented by the ideal word. Compared with not adding the user dictionary, the accuracy is improved by about 1%
4. Remove stop words and special symbols
Before word segmentation, expressions and some special symbols are removed. I tried to remove special symbols after word segmentation. The results show that removing special symbols before word segmentation has better effect. After removing special symbols, the accuracy is improved by about 2%
2, TFIDF + AdaBoost
All codes
class TrainBlogClsTfidfAdaBoost: def __init__(self): jieba.load_userdict(get_blog_cls_jieba_user_dict_path()) self.train_data_dir = get_blog_cls_train_data_optimize_dir() self.tfidf_path = get_tfidf_path() self.model_path = get_adaboost_model_path() # self.train_data_dir = get_blog_cls_train_data_dev_dir() # self.tfidf_path = get_test_tfidf_path() # self.model_path = get_adaboost_test_model_path() def load(self): if not os.path.exists(self.model_path): logger.warning("Start training, target model data:", self.model_path) self.train() logger.info("Loading model") self.model = joblib.load(self.model_path) self.tf_idf = joblib.load(self.tfidf_path) def load_data(self): '''Load file contents and labels''' files = get_files_path(self.train_data_dir, '.txt') contents = [] labels = [] for file in files: with open(file, 'r') as f: data = f.read() data = filter_content_for_blog_cls(data) data_cut = ' '.join(jieba.cut(data)) contents.append(data_cut) label = file.split('/')[-2] labels.append(label) X_train, X_test, y_train, y_test = train_test_split(contents, labels, test_size=0.2, random_state=123456) return X_train, X_test, y_train, y_test def load_stopwords(self): path = './data/pro/datasets/stopwords/cn_stopwords.txt' with open(path, 'r') as f: stopwords = f.read().split('\n') return stopwords def train(self): logger.info('Start training...') stopwords = self.load_stopwords() X_train, X_test, y_train, y_test = self.load_data() tfidf = TfidfVectorizer(stop_words=stopwords, max_df=0.5) train_data = tfidf.fit(X_train) train_data = tfidf.transform(X_train) test_data = tfidf.transform(X_test) joblib.dump(tfidf, self.tfidf_path, compress=1) model = AdaBoostClassifier() # 99% model.fit(train_data, y_train) predict_test = model.predict(test_data) joblib.dump(model, self.model_path, compress=1) print("The accuracy is:", metrics.accuracy_score(predict_test, y_test)) def predict(self, test_data): test_data = filter_content_for_blog_cls(test_data) test_data = ' '.join(jieba.cut(test_data)) test_vec = self.tf_idf.transform([test_data]) res = self.model.predict(test_vec) return res def test_acc(self): data_path = './data/pro/datasets/blogs/blog_adver_cls/test_dev.csv' data = pd.read_csv(data_path) data = data.dropna(axis=0) test_text = data['content'] text_list = [] for text in test_text: text = filter_content_for_blog_cls(text) text = ' '.join(jieba.cut(text)) text_list.append(text) label = data['label'] test_data = self.tf_idf.transform(text_list) predict_test = self.model.predict(test_data) print("In the test set, the accuracy is:", metrics.accuracy_score(predict_test, label))
result:
In the test set, the accuracy is: 0.9646315789473684
The test data is about 5000, which is still quite convincing
3, Fasttext
fasttext has been used for book classification before. See fasttext text classification ", the accuracy rate of three categories is 93%, and the accuracy rate of 35 categories is 75.6%. The overall effect is good. So I thought of using fasttext to try and see if the effect will be better.
All codes
import os import fasttext import jieba import logging import random from tqdm import tqdm import pandas as pd from sklearn import metrics from common.utils import get_files_path from common.utils import filter_content_for_blog_cls from common.path.dataset.blog import get_blog_cls_jieba_user_dict_path from common.path.dataset.blog import get_blog_cls_train_data_dev_dir, get_fasttext_train_data_path from common.path.model.blog import get_blog_cls_fasttext_model_path logger = logging.getLogger(__name__) class TrainBlogClsFasttext: def __init__(self): jieba.load_userdict(get_blog_cls_jieba_user_dict_path()) self.train_data_dev_dir = get_blog_cls_train_data_dev_dir() self.train_data_path = get_fasttext_train_data_path() self.fasttext_model_path = get_blog_cls_fasttext_model_path() self.class_name_mapping = { '__label__0': 'technology', '__label__1': 'advertisement' } def load(self): if not os.path.exists(self.fasttext_model_path): logger.info('Start training model...') self.train_fasttext() logger.info("Loading model") self.model = fasttext.load_model(self.fasttext_model_path) def data_process(self): data_dir = self.train_data_dev_dir files = get_files_path(data_dir, '.txt') if not os.path.exists(self.train_data_path): os.mkdir(self.train_data_path) random.shuffle(files) fasttext_train_data_path = os.path.join(self.train_data_path, 'train.txt') fasttext_test_data_path = os.path.join(self.train_data_path, 'test.txt') if os.path.exists(fasttext_train_data_path) and os.path.exists(fasttext_test_data_path): return lines_train = [] lines_test = [] all_data = [] for file in tqdm(files, desc='Building training data: '): with open(file, 'r') as f: data = f.read() data = filter_content_for_blog_cls(data) data = ' '.join(jieba.cut(data)) if file.find('technology') != -1: label = '__label__{}'.format(0) elif file.find('advertisement') != -1: label = '__label__{}'.format(1) else: print("Bad data:{}".format(file)) line = data + '\t' + label + '\n' all_data.append(line) lines_train = all_data[:int(len(all_data)*0.8)] lines_test = all_data[int(len(all_data)*0.8):] with open(fasttext_train_data_path, 'a') as f: f.writelines(lines_train) with open(fasttext_test_data_path, 'a') as f: f.writelines(lines_test) def load_stopwords(self): path = './data/pro/datasets/stopwords/cn_stopwords.txt' with open(path, 'r') as f: stopwords = f.read().split('\n') return stopwords def train_fasttext(self): self.data_process() data_dir = self.train_data_path train_path = os.path.join(data_dir, 'train.txt') test_path = os.path.join(data_dir, 'test.txt') classifier = fasttext.train_supervised(input=train_path, label="__label__", dim=100, epoch=10, lr=0.1, wordNgrams=2, loss='softmax', thread=8, verbose=True) classifier.save_model(self.fasttext_model_path) result = classifier.test(test_path) logger.info('Train Result:'.format(result)) logger.info('F1 Score: {}'.format(result[1] * result[2] * 2 / (result[2] + result[1]))) def predict(self, text): test_data = filter_content_for_blog_cls(text) test_data = ' '.join(jieba.cut(test_data)) result = self.model.predict(test_data) class_name = result[0][0] res_label = self.class_name_mapping[class_name] return res_label def test_acc(self): data_path = './data/pro/datasets/blogs/blog_adver_cls/test_dev.csv' data = pd.read_csv(data_path) data = data.dropna(axis=0) test_text = data['content'] text_list = [] for text in test_text: text = filter_content_for_blog_cls(text) text = ' '.join(jieba.cut(text)) text_list.append(text) labels = data['label'] res_labels = [] for text in text_list: label = self.model.predict(text) class_name = label[0][0] res_label = self.class_name_mapping[class_name] res_labels.append(res_label) print("In the test set, the accuracy is:", metrics.accuracy_score(res_labels, labels))
The code is not difficult, mainly data processing. Here, special symbols are removed before word segmentation. The effect is indeed improved. You can try it yourself.
Let's see the effect directly:
[INFO][2022-01-03 14:39:23][fasttext_classifier.py:33 at load]: Start training model... Read 5M words Number of words: 261664 Number of labels: 2 Progress: 100.0% words/sec/thread: 1415852 lr: 0.000000 avg.loss: 0.059132 ETA: 0h 0m 0s [INFO][2022-01-03 14:39:33][fasttext_classifier.py:101 at train_fasttext]: Train Result: [INFO][2022-01-03 14:39:33][fasttext_classifier.py:102 at train_fasttext]: F1 Score: 0.9638259736027375 [INFO][2022-01-03 14:39:33][fasttext_classifier.py:35 at load]: Loading model Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar. In the test set, the accuracy is: 0.9661052631578947
On the same test data set, the accuracy of Fasttext is 0.2% higher, but the model size is 912M, and the model trained with TFIDF + AdaBoost adds up to 4.9M.
The actual reasoning speed has not been tested, so TFIDF + AdaBoost, which occupies less memory, is currently used.
summary
Observing more data and understanding data characteristics are of great help to improve the effect of the model.
Facts have proved that:
1. Adding user dictionaries can improve accuracy
2. Removing special characters from text can improve accuracy
Related articles:
1,Technical / advertising article classifier (I)
2,fasttext text classification