[Paddle competition] iFLYTEK title - academic paper classification challenge 0.8+Baseline

Posted by kingleo on Mon, 08 Nov 2021 03:23:48 +0100

[Paddle competition] iFLYTEK title - academic paper classification challenge 0.8+Baseline

1, Project introduction

1. Project introduction:

This project is iFLYTEK title - academic paper classification challenge paddle version Baseline, with a submission score of 0.8 +. At present, there is still a large space for optimization, and more attempts can be made to improve it. Interested can also be migrated to similar text classification projects.

2. Event address: (details can be viewed on the specific competition page)

Academic paper classification challenge

3. Introduction to the competition task:

The title of the competition is a more conventional English long text classification competition. There are 5W papers in the training set. Each paper contains four fields: paper id, title, abstract and category. Test set 1W papers. Each paper contains the paper id, title and abstract, excluding the paper category field. Contestants need to use the paper information: paper id, title and abstract to divide the specific categories of papers. At the same time, a paper only belongs to one category, and there is no complex situation that a paper belongs to multiple categories. The accuracy index shall be adopted as the evaluation standard, and special attention shall be paid to the competition rules. Other data other than the data provided shall not be used.

4.Baseline idea:

This Baseline is mainly based on PaddleHub to complete the training of the paper text classification model through the fine-tuning of the pre training model on the competition data set, and finally predict the test data set, export and submit the result file to complete the competition task. Note that the code of this project needs to run in GPU environment. If the video memory is insufficient, please reduce the batchsize.

   Game related data sets have been uploaded AI Studio,Search the dataset'IFLYTEK competition questions-Data set of academic paper classification challenge'Add after.

2, Data reading and processing

2.1 read data and view

# Decompress the game dataset
%cd /home/aistudio/data/data100192/
!unzip data.zip
/home/aistudio/data/data100192
Archive:  data.zip
  inflating: sample_submit.csv       
  inflating: test.csv                
  inflating: train.csv               
# Read dataset
import pandas as pd
train = pd.read_csv('train.csv', sep='\t')  # Tagged training data file
test = pd.read_csv('test.csv', sep='\t')    # Test data file to predict
sub = pd.read_csv('sample_submit.csv')      # Example of submission of result documents
# View the top 5 items of training data
train.head()
paperidtitleabstractcategories
0train_00000Hard but Robust, Easy but Sensitive: How Encod...Neural machine translation (NMT) typically a...cs.CL
1train_00001An Easy-to-use Real-world Multi-objective Opti...Although synthetic test problems are widely ...cs.NE
2train_00002Exploration of reproducibility issues in scien...This is the first part of a small-scale expl...cs.DL
3train_00003Scheduled Sampling for TransformersScheduled sampling is a technique for avoidi...cs.CL
4train_00004Hybrid Forests for Left Ventricle Segmentation...Machine learning models produce state-of-the...cs.CV
# View training data file information
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   paperid     50000 non-null  object
 1   title       50000 non-null  object
 2   abstract    50000 non-null  object
 3   categories  50000 non-null  object
dtypes: object(4)
memory usage: 1.5+ MB
# View the distribution of total categories in training data
train['categories'].value_counts()
cs.CV    11038
cs.CL     4260
cs.NI     3218
cs.CR     2798
cs.AI     2706
cs.DS     2509
cs.DC     1994
cs.SE     1940
cs.RO     1884
cs.LO     1741
cs.LG     1352
cs.SY     1292
cs.CY     1228
cs.DB      998
cs.GT      984
cs.HC      943
cs.PL      841
cs.IR      770
cs.CC      719
cs.NE      704
cs.CG      683
cs.OH      677
cs.SI      603
cs.DL      537
cs.DM      523
cs.FL      469
cs.AR      363
cs.CE      362
cs.GR      314
cs.MM      261
cs.ET      230
cs.MA      210
cs.NA      176
cs.SC      172
cs.SD      140
cs.PF      139
cs.MS      105
cs.OS       99
cs.GL       18
Name: categories, dtype: int64
# View the top 5 test data to be predicted
test.head()
paperidtitleabstract
0test_00000Analyzing 2.3 Million Maven Dependencies to Re...This paper addresses the following question:...
1test_00001Finding Higher Order Mutants Using Variational...Mutation testing is an effective but time co...
2test_00002Automatic Detection of Search Tactic in Indivi...Information seeking process is an important ...
3test_00003Polygon Simplification by Minimizing Convex Co...Let $P$ be a polygon with $r>0$ reflex verti...
4test_00004Differentially passive circuits that switch an...The concept of passivity is central to analy...
# View test data file information
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   paperid   10000 non-null  object
 1   title     10000 non-null  object
 2   abstract  10000 non-null  object
dtypes: object(3)
memory usage: 234.5+ KB

2.2 data processing and division training and verification set

# Process the data set, splice the title and abstract of the paper, and process it as text_ a. Label format
train['text_a'] = train['title'] + ' ' + train['abstract']
test['text_a'] = test['title'] + ' ' + test['abstract']
train['label'] = train['categories']
train = train[['text_a', 'label']]
# Check the first 5 items of processed data to see if they comply with text_ a. Label format
train.head()
text_alabel
0Hard but Robust, Easy but Sensitive: How Encod...cs.CL
1An Easy-to-use Real-world Multi-objective Opti...cs.NE
2Exploration of reproducibility issues in scien...cs.DL
3Scheduled Sampling for Transformers Schedule...cs.CL
4Hybrid Forests for Left Ventricle Segmentation...cs.CV
# Divide training and validation sets:
# 50000 labeled training data are directly divided into training and verification sets by 9:1 according to the index
train_data = train[['text_a', 'label']][:45000]
valid_data = train[['text_a', 'label']][45000:]

# Random scrambling of data
from sklearn.utils import shuffle
train_data = shuffle(train_data)
valid_data = shuffle(valid_data)

# Save training and validation set files
train_data.to_csv('train_data.csv', sep='\t', index=False)
valid_data.to_csv('valid_data.csv', sep='\t', index=False)

3, Constructing baseline model based on PaddleHub

PaddleHub can easily acquire the pre training model under the PaddlePaddle ecosystem, and complete the management of the model and the one button prediction. With the fine tune API, the migration learning can be quickly completed based on the large-scale pre training model, so that the pre training model can better serve the application of user specific scenarios.

3.1 pre environment preparation

# Download the latest version of paddlehub
!pip install -U paddlehub -i https://pypi.tuna.tsinghua.edu.cn/simple
# Import paddlehub and paddle packages
import paddlehub as hub
import paddle

3.2 selection of pre training model

# Set 39 categories requiring paper classification
label_list=list(train.label.unique())
print(label_list)
label_map = { 
    idx: label_text for idx, label_text in enumerate(label_list)
}
print(label_map)
['cs.CL', 'cs.NE', 'cs.DL', 'cs.CV', 'cs.LG', 'cs.DS', 'cs.IR', 'cs.RO', 'cs.DM', 'cs.CR', 'cs.AR', 'cs.NI', 'cs.AI', 'cs.SE', 'cs.CG', 'cs.LO', 'cs.SY', 'cs.GR', 'cs.PL', 'cs.SI', 'cs.OH', 'cs.HC', 'cs.MA', 'cs.GT', 'cs.ET', 'cs.FL', 'cs.CC', 'cs.DB', 'cs.DC', 'cs.CY', 'cs.CE', 'cs.MM', 'cs.NA', 'cs.PF', 'cs.OS', 'cs.SD', 'cs.SC', 'cs.MS', 'cs.GL']
{0: 'cs.CL', 1: 'cs.NE', 2: 'cs.DL', 3: 'cs.CV', 4: 'cs.LG', 5: 'cs.DS', 6: 'cs.IR', 7: 'cs.RO', 8: 'cs.DM', 9: 'cs.CR', 10: 'cs.AR', 11: 'cs.NI', 12: 'cs.AI', 13: 'cs.SE', 14: 'cs.CG', 15: 'cs.LO', 16: 'cs.SY', 17: 'cs.GR', 18: 'cs.PL', 19: 'cs.SI', 20: 'cs.OH', 21: 'cs.HC', 22: 'cs.MA', 23: 'cs.GT', 24: 'cs.ET', 25: 'cs.FL', 26: 'cs.CC', 27: 'cs.DB', 28: 'cs.DC', 29: 'cs.CY', 30: 'cs.CE', 31: 'cs.MM', 32: 'cs.NA', 33: 'cs.PF', 34: 'cs.OS', 35: 'cs.SD', 36: 'cs.SC', 37: 'cs.MS', 38: 'cs.GL'}
# Select ernie_v2_eng_large pre training model and set the fine-tuning task as 39 classification task
model = hub.Module(name="ernie_v2_eng_large", task='seq-cls', num_classes=39, label_map=label_map) # In multi classification tasks, num_classes needs to explicitly specify the number of categories, which is set to 39 according to the dataset
Download https://bj.bcebos.com/paddlehub/paddlehub_dev/ernie_v2_eng_large_2.0.2.tar.gz
[##################################################] 100.00%
Decompress /home/aistudio/.paddlehub/tmp/tmp8enjb395/ernie_v2_eng_large_2.0.2.tar.gz
[##################################################] 100.00%


[2021-08-01 17:43:16,200] [    INFO] - Successfully installed ernie_v2_eng_large-2.0.2
[2021-08-01 17:43:16,203] [    INFO] - Downloading https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-2.0-large-en
[2021-08-01 17:43:16,205] [    INFO] - Downloading ernie_v2_eng_large.pdparams from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/ernie_v2_eng_large.pdparams
100%|██████████| 1309198/1309198 [00:19<00:00, 68253.50it/s]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.weight. classifier.weight is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1297: UserWarning: Skip loading for classifier.bias. classifier.bias is not found in the provided dict.
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))

The parameter usage of hub.Module is as follows:

  • Name: model name. You can select ernie or ernie_tiny, Bert base cased, Bert base Chinese, Roberta WwM ext, Roberta WwM ext large, etc.
  • Task: fine tune task. Here is SEQ CLS, indicating text classification task.
  • num_classes: indicates the number of categories of the current text classification task. It is determined according to the specific data set used. The default is 2.

PaddleHub also provides models such as BERT to choose from. The loading examples corresponding to the models currently supporting text classification tasks are as follows:

Model namePaddleHub Module
ERNIE, Chinesehub.Module(name='ernie')
ERNIE tiny, Chinesehub.Module(name='ernie_tiny')
ERNIE 2.0 Base, Englishhub.Module(name='ernie_v2_eng_base')
ERNIE 2.0 Large, Englishhub.Module(name='ernie_v2_eng_large')
BERT-Base, English Casedhub.Module(name='bert-base-cased')
BERT-Base, English Uncasedhub.Module(name='bert-base-uncased')
BERT-Large, English Casedhub.Module(name='bert-large-cased')
BERT-Large, English Uncasedhub.Module(name='bert-large-uncased')
BERT-Base, Multilingual Casedhub.Module(nane='bert-base-multilingual-cased')
BERT-Base, Multilingual Uncasedhub.Module(nane='bert-base-multilingual-uncased')
BERT-Base, Chinesehub.Module(name='bert-base-chinese')
BERT-wwm, Chinesehub.Module(name='chinese-bert-wwm')
BERT-wwm-ext, Chinesehub.Module(name='chinese-bert-wwm-ext')
RoBERTa-wwm-ext, Chinesehub.Module(name='roberta-wwm-ext')
RoBERTa-wwm-ext-large, Chinesehub.Module(name='roberta-wwm-ext-large')
RBT3, Chinesehub.Module(name='rbt3')
RBTL3, Chinesehub.Module(name='rbtl3')
ELECTRA-Small, Englishhub.Module(name='electra-small')
ELECTRA-Base, Englishhub.Module(name='electra-base')
ELECTRA-Large, Englishhub.Module(name='electra-large')
ELECTRA-Base, Chinesehub.Module(name='chinese-electra-base')
ELECTRA-Small, Chinesehub.Module(name='chinese-electra-small')

Through the above line of code, the model is initialized as a model suitable for text classification tasks, and a fully connected network is spliced after the pre training model of ERNIE.

3.3 loading and processing data

# The length of statistical data is convenient to determine max_seq_len
print('The maximum length of the title and summary after the splicing of the training data set is{}'.format(train['text_a'].map(lambda x:len(x)).max()))
print('The minimum length of the title and summary after the splicing of the training data set is{}'.format(train['text_a'].map(lambda x:len(x)).min()))
print('The average length of the title and summary after the splicing of the training data set is{}'.format(train['text_a'].map(lambda x:len(x)).mean()))
print('The maximum length of the title and summary after the splicing of the test data set is{}'.format(test['text_a'].map(lambda x:len(x)).max()))
print('The minimum length of the title and summary after splicing of the test data set is{}'.format(test['text_a'].map(lambda x:len(x)).min()))
print('The average length of the title and summary after the splicing of the test data set is{}'.format(test['text_a'].map(lambda x:len(x)).mean()))
The maximum length of the title and summary after the splicing of the training data set is 3713
 The minimum length of the title and summary after the splicing of the training data set is 69
 The average length of the title and summary after the splicing of the training data set is 1131.28478
 The maximum length of the title and summary after splicing of the test data set is 3501
 The minimum length of the title and summary after splicing of the test data set is 74
 The average length of the title and summary after the splicing of the test data set is 1127.0977
import os, io, csv
from paddlehub.datasets.base_nlp_dataset import InputExample, TextClassificationDataset

# Data set storage location
DATA_DIR="/home/aistudio/data/data100192/"
# Process the training data into a format acceptable to the model
class Papers(TextClassificationDataset):
    def __init__(self, tokenizer, mode='train', max_seq_len=128):
        if mode == 'train':
            data_file = 'train_data.csv'
        elif mode == 'dev':
            data_file = 'valid_data.csv'
        super(Papers, self).__init__(
            base_path=DATA_DIR,
            data_file=data_file,
            tokenizer=tokenizer,
            max_seq_len=max_seq_len,
            mode=mode,
            is_file_with_header=True,
            label_list=label_list
            )

    # Parsing samples in text files
    def _read_file(self, input_file, is_file_with_header: bool = False):
        if not os.path.exists(input_file):
            raise RuntimeError("The file {} is not found.".format(input_file))
        else:
            with io.open(input_file, "r", encoding="UTF-8") as f:
                reader = csv.reader(f, delimiter="\t")  # '\ t' delimited data
                examples = []
                seq_id = 0
                header = next(reader) if is_file_with_header else None
                for line in reader:
                    example = InputExample(guid=seq_id, text_a=line[0], label=line[1])
                    seq_id += 1
                    examples.append(example)
                return examples

# Maximum sequence length max_seq_len is a parameter that can be adjusted. The recommended value is 128. This value can be adjusted according to the length of task text, but the maximum value is no more than 512. The text here is long, so it is set to 512.
train_dataset = Papers(model.get_tokenizer(), mode='train', max_seq_len=512)
dev_dataset = Papers(model.get_tokenizer(), mode='dev', max_seq_len=512)

# After processing, view the first 2 items in the data
for e in train_dataset.examples[:2]:
    print(e)
for e in dev_dataset.examples[:2]:
    print(e)
[2021-08-01 17:44:12,576] [    INFO] - Downloading vocab.txt from https://paddlenlp.bj.bcebos.com/models/transformers/ernie_v2_large/vocab.txt
100%|██████████| 227/227 [00:00<00:00, 2484.96it/s]
[2021-08-01 17:46:28,717] [    INFO] - Found /home/aistudio/.paddlenlp/models/ernie-2.0-large-en/vocab.txt


text=Enforcing Label and Intensity Consistency for IR Target Detection   This study formulates the IR target detection as a binary classification
problem of each pixel. Each pixel is associated with a label which indicates
whether it is a target or background pixel. The optimal label set for all the
pixels of an image maximizes aposteriori distribution of label configuration
given the pixel intensities. The posterior probability is factored into (or
proportional to) a conditional likelihood of the intensity values and a prior
probability of label configuration. Each of these two probabilities are
computed assuming a Markov Random Field (MRF) on both pixel intensities and
their labels. In particular, this study enforces neighborhood dependency on
both intensity values, by a Simultaneous Auto Regressive (SAR) model, and on
labels, by an Auto-Logistic model. The parameters of these MRF models are
learned from labeled examples. During testing, an MRF inference technique,
namely Iterated Conditional Mode (ICM), produces the optimal label for each
pixel. The detection performance is further improved by incorporating temporal
information through background subtraction. High performances on benchmark
datasets demonstrate effectiveness of this method for IR target detection.
	label=cs.CV
text=Saliency Preservation in Low-Resolution Grayscale Images   Visual salience detection originated over 500 million years ago and is one of
nature's most efficient mechanisms. In contrast, many state-of-the-art
computational saliency models are complex and inefficient. Most saliency models
process high-resolution color (HC) images; however, insights into the
evolutionary origins of visual salience detection suggest that achromatic
low-resolution vision is essential to its speed and efficiency. Previous
studies showed that low-resolution color and high-resolution grayscale images
preserve saliency information. However, to our knowledge, no one has
investigated whether saliency is preserved in low-resolution grayscale (LG)
images. In this study, we explain the biological and computational motivation
for LG, and show, through a range of human eye-tracking and computational
modeling experiments, that saliency information is preserved in LG images.
Moreover, we show that using LG images leads to significant speedups in model
training and detection times and conclude by proposing LG images for fast and
efficient salience detection.
	label=cs.CV
text=RT-Gang: Real-Time Gang Scheduling Framework for Safety-Critical Systems   In this paper, we present RT-Gang: a novel real-time gang scheduling
framework that enforces a one-gang-at-a-time policy. We find that, in a
multicore platform, co-scheduling multiple parallel real-time tasks would
require highly pessimistic worst-case execution time (WCET) and schedulability
analysis - even when there are enough cores - due to contention in shared
hardware resources such as cache and DRAM controller. In RT-Gang, all threads
of a parallel real-time task form a real-time gang and the scheduler globally
enforces the one-gang-at-a-time scheduling policy to guarantee tight and
accurate task WCET. To minimize under-utilization, we integrate a
state-of-the-art memory bandwidth throttling framework to allow safe execution
of best-effort tasks. Specifically, any idle cores, if exist, are used to
schedule best-effort tasks but their maximum memory bandwidth usages are
strictly throttled to tightly bound interference to real-time gang tasks. We
implement RT-Gang in the Linux kernel and evaluate it on two representative
embedded multicore platforms using both synthetic and real-world DNN workloads.
The results show that RT-Gang dramatically improves system predictability and
the overhead is negligible.
	label=cs.DC
text=AI Enabling Technologies: A Survey   Artificial Intelligence (AI) has the opportunity to revolutionize the way the
United States Department of Defense (DoD) and Intelligence Community (IC)
address the challenges of evolving threats, data deluge, and rapid courses of
action. Developing an end-to-end artificial intelligence system involves
parallel development of different pieces that must work together in order to
provide capabilities that can be used by decision makers, warfighters and
analysts. These pieces include data collection, data conditioning, algorithms,
computing, robust artificial intelligence, and human-machine teaming. While
much of the popular press today surrounds advances in algorithms and computing,
most modern AI systems leverage advances across numerous different fields.
Further, while certain components may not be as visible to end-users as others,
our experience has shown that each of these interrelated components play a
major role in the success or failure of an AI system. This article is meant to
highlight many of these technologies that are involved in an end-to-end AI
system. The goal of this article is to provide readers with an overview of
terminology, technical details and recent highlights from academia, industry
and government. Where possible, we indicate relevant resources that can be used
for further reading and understanding.
	label=cs.AI

3.4 selection of optimization strategy and operation configuration

# Optimizer selection
optimizer = paddle.optimizer.AdamW(learning_rate=4e-6, parameters=model.parameters())
# run setup
trainer = hub.Trainer(model, optimizer, checkpoint_dir='./ckpt', use_gpu=True, use_vdl=True)      # Performer of fine tune task

3.5 model training and verification

trainer.train(train_dataset, epochs=4, batch_size=12, eval_dataset=dev_dataset, save_interval=1)   # Configure training parameters, start training, and specify validation sets

trainer.train mainly controls the specific training process, including the following controllable parameters:

  • train_dataset: the dataset used in training;
  • Epichs: number of training rounds;
  • batch_size: the batch size of training. If GPU is used, please adjust the batch according to the actual situation_ size;
  • num_ Workers: the number of works, which is 0 by default;
  • eval_dataset: validation set;
  • log_interval: the interval between printing the log, in the number of times the batch training is executed.
  • save_interval: the interval frequency of saving the model. The unit is the number of rounds of training.

3.6 model prediction and saving result file

# Predict the test set
import numpy as np
# Process the input data into list format
new = pd.DataFrame(columns=['text'])
new['text'] = test["text_a"]
# First, convert the data read by pandas into array
data_array = np.array(new)
# Then it is converted to list form
data_list =data_array.tolist()

# Define categories to classify
label_list=list(train.label.unique())
label_map = { 
    idx: label_text for idx, label_text in enumerate(label_list)
}

# Load the trained model
model = hub.Module(
    name="ernie_v2_eng_large", 
    version='2.0.2', 
    task='seq-cls', 
    load_checkpoint='./ckpt/best_model/model.pdparams',
    num_classes=39, 
    label_map=label_map)

# Prediction of test set data
predictions = model.predict(data_list, max_seq_len=512, batch_size=2, use_gpu=True)
# Generate the result file to submit
sub = pd.read_csv('./sample_submit.csv')
sub['categories'] = predictions
sub.to_csv('submission.csv',index=False)
# Move the result file to the work directory for easy saving
it.csv')
sub['categories'] = predictions
sub.to_csv('submission.csv',index=False)
# Move the result file to the work directory for easy saving
!cp -r /home/aistudio/data/data100192/submission.csv /home/aistudio/work/

After the prediction is completed, enter data/data100192 / on the left, download the generated result file submission.csv and submit it, with a score of 0.8 +. At present, there is still much room for improvement. Those who are interested can try more!

4, Subsequent lifting direction

  1. Data enhancement attempts (synonym replacement, random insertion, random exchange, random deletion, mutual translation, etc.)

  2. Parameter tuning optimization and improvement of pre training model( A sharp tool for text classification: Bert trim)

  3. 5fodls cross validation and Voting Fusion of multi model results( Big killer of machine learning competition -- model fusion)

  4. Refer to the Top sharing of similar competitions, eg: Internet news emotion analysis competition sharing

About the use of PaddleHub:

github address of PaddleHub: (issue if you have any questions)

https://github.com/PaddlePaddle/PaddleHub

API lookup of the paddle related functions:

https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/index_cn.html

Topics: NLP paddlepaddle