IFLYTEK Chinese idiom quiz: ERNIE MASK with 80% accuracy

Posted by semlabs on Sun, 02 Jan 2022 14:17:53 +0100

Event background

Broad and profound China culture has a long history, and idioms are the cream of China's culture. Idioms are mostly composed of four words, usually with allusions or sources. Some idioms are not difficult to understand literally, such as "making a mountain out of a molehill", "catching up from behind" and so on. Some idioms need to know the source or allusions to understand the meaning, such as "every day", "the shadow of a bow and a snake in a cup" and so on.

Idiom learning is an important learning content in primary school Chinese and junior middle school. How to choose the appropriate idiom in the sentence? In this competition, the contestants are expected to build a model to understand Chinese idioms.

https://challenge.xfyun.cn/topic/info?type=chinese-idioms

Event task

Given a Chinese sentence, the contestant needs to select the most appropriate idiom from the idioms to be selected under the given context. That is, given the context of the sentence, complete the appropriate idiom and fill in the corresponding position.

Event data set

There are 5w pieces of data in the training set and 1w pieces of data in the test set. The label field in the test set is empty and requires player prediction.

Sample dataset:

  • Original text: at present, it has long been "[MASK] for government departments to make use of the Internet to disclose information, collect public opinion, improve work efficiency and narrow the distance with the masses
  • Idioms to be chosen: ['vulnerable', 'common practice', 'like wind passing through ears',' uncanny workmanship and lightning axe ']
  • Correct answer: common practice

Review rules

  1. Data description
    The contest data consists of a training set and a test set. There are 5w pieces of data in the training set and 1w pieces of data in the test set, all of which are in csv format. The columns are \ t divided. See sample for test set submission cases_ submit. csv file does not need a header, and 1w idioms can be written directly according to the order and line.

  2. Evaluation index

The evaluation standard of this competition adopts the classification accuracy rate, and the highest score is 1. Evaluation code reference:

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)
!pip install paddle-ernie > log.log
import sys
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import paddle as P
test_df = pd.read_csv('test.csv',sep='\t')

train_df = pd.read_csv('train.csv',sep='\t')
train_df.head()
textcandidatelabel
0As the most popular digital product, Apple iPad tablet has been started by a large number of users, but many netizens are putting their beloved iP['Polite', 'inexplicable', 'reluctant to leave', 'looking for a suitcase']be rather baffling
1Take the latest iPad product, commonly known as iPad 2, as an example (although Apple has never called this product iPad 2['don't take your clothes off', 'smack your lips',' make plans and borrow money ',' follow the good as the flow ']follow correct opinions or well-intentioned advice like water flowing swiftly and smoothly downward
2At present, government departments have long been "[MASK]['vulnerable', 'common practice', 'like wind passing by', 'uncanny workmanship']become a common practice
3One is called "twin achievements". The two towns in the Central Plains have posted almost identical political performance reports on their respective government websites['unique', 'indomitable', 'unparalleled in the world', 'overworked']it happens that there is a similar case
4Another one is called "the question has been read". Not long ago, a netizen in a certain place reported the difficulty of travel to the district head's mailbox. Six days later, when you come['don't be surprised at strange things',' nepotism ',' root cause ',' deserve death ']become inured to the unusual
train_df['text'].iloc[10]
'The first problem in developing emerging markets is to start from scratch. In these areas, there is a demand for online games and entertainment, but there is no ready-made market to rely on and learn from. "I don't even know such terms as in-game test and public test.[MASK][MASK][MASK][MASK]The difficulty of promotion. " Chen Xing told reporters.'

Problem solving ideas

The existing idea of the competition question is a typical Mask filling problem, which can be solved by two ideas:

  • Using MaskLM prediction
  • Use QA text Q & A

Here we use ERNIE to complete the word level prediction, which is also very consistent with ERNIE's model.

Define model

from ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel
from ernie.tokenizing_ernie import ErnieTokenizer
import paddle.fluid.layers as L
import paddle.fluid.dygraph as D

class ErnieCloset(ErnieModelForPretraining):
    def __init__(self, *args, **kwargs):
        super(ErnieCloset, self).__init__(*args, **kwargs)
        del self.pooler_heads
    def forward(self, src_ids, *args, **kwargs):
        pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs)
        encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id))
        encoded_2d = self.mlm(encoded_2d)
        encoded_2d = self.mlm_ln(encoded_2d)

        # Convert to token prediction results
        logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias
        return logits_2d
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
rev_dict = {v: k for k, v in tokenizer.vocab.items()}
rev_dict[tokenizer.pad_id] = '' # replace [PAD]
rev_dict[tokenizer.sep_id] = '' # replace [PAD]
rev_dict[tokenizer.unk_id] = '' # replace [PAD]

@np.vectorize
def rev_lookup(i):
    return rev_dict[i]

ernie = ErnieCloset.from_pretrained('ernie-1.0')
ernie.eval()
ids, _ = tokenizer.encode('The reform movement of 1898, also known as the hundred day reform, was[MASK][MASK][MASK] ,Liang Qichao and other reformists carried out a bourgeois reform through Emperor Guangxu.')
mask_id = tokenizer.mask_id
print(ids)

ids = np.expand_dims(ids, 0)
ids = D.to_variable(ids)
logits = ernie(ids).numpy()
output_ids = np.argmax(logits, -1)
seg_txt = rev_lookup(output_ids)
print(seg_txt)
[    1  3299  3721   282    72     4   311   351   502   139   534   102
     4    10     3     3     3     6  1164  1087   634    43   534   102
   809     8   391   124    93   325  1659   757    71    40     5     7
   191   138    66   976   222   285   698 12043     2]
['Kang' 'have' 'by']

Simple test

for row in train_df.iloc[:10].iterrows():
    print(row[1].text)
    print(f'Correct answer:{row[1].label}')
    ids, _ = tokenizer.encode(row[1].text)
    mask_id = tokenizer.mask_id
    ids = np.expand_dims(ids, 0)
    ids = D.to_variable(ids)
    logits = ernie(ids).numpy()
    
    candidates = eval(row[1].candidate)
    candidate_logits = []
    for s in candidates:
        candidate_logits.append(logits[[0,1,2,3], tokenizer.encode(s)[0][1:-1]].mean())

    print(f'Forecast results:{candidates[np.argmax(candidate_logits)]}')
    print('')
As the most popular digital product, apple iPad Tablet computers have been started by a large number of users, but many netizens are putting their beloved iPad When I took it home, I found that it was somewhat strange[MASK][MASK][MASK][MASK]: iTunes,iOS System, firmware version AppStore,JailbreakME Prison break, synchronization and so on seem to be words composed of letters and words randomly extracted from the dictionary, which appear in front of everyone, making the new fruit fans feel anxious. Although confused by Apple products, it's actually nothing to be ashamed of. Every Apple user, including Xiaobian me, has had this experience. After all, no one has the habit of reading the manual before buying.
Correct answer: inexplicable
 Prediction result: inexplicable

with iPad The latest product, commonly known as iPad 2 take as an example(Although Apple has never called this product iPad 2,But this product is indeed an upgraded version of the first generation of products, so we are not as good as[MASK][MASK][MASK][MASK],Call it iPad 2),The specification of this product has the following features that we must understand:
Correct answer: follow good advice
 Prediction result: follow good advice

At present, it is long ago for government departments to use the Internet to make public information, gather public opinion, improve work efficiency and narrow the distance with the masses“[MASK][MASK][MASK][MASK]". Looking around, it seems that no government does not have its own website. There are countless mayors' mailboxes and district heads' mailboxes, and official blogs are very popular.
Correct answer: common practice
 Prediction results: common practice

One is called "twin achievements". The two towns in the Central Plains have posted almost identical performance reports on their respective government websites. Tongbai.com and suixian people's government.com have published articles on "highlighting three characteristics of forestry ecological construction in CHENGWAN township" and "highlighting three characteristics of forestry ecological construction in Pinggang Town, suixian county". The two different townships not only have the same land area and population, but also the same seedlings![MASK][MASK][MASK][MASK],At the beginning of this year, on the "China Fire Online" website, the publicity drafts of Henan Kaifeng fire brigade and Luohe fire brigade were the same, which was jokingly called "Kaifeng leading Luohe guiding work" by netizens.
Correct answer: coincidentally
 Prediction results: coincidentally

Another one is called "problem'Read'". Not long ago, a netizen in a certain place reported the difficulty of travel to the district head's mailbox. After 6 days, when I came to reply online, there were two words: "read". "Read" two words, do not say yes, do not say no, play the ball level by level, and turn down level by level. this[MASK][MASK][MASK][MASK]The official practice of kicked the Internet and the mailbox of the district head.
Correct answer: no wonder
 Prediction results: no wonder

"Twin achievements "," official blog "and" problems "'Read'"The appearance of makes people[MASK][MASK][MASK][MASK],More people stay away. Because through such an official website, the masses see the perfunctory and dereliction of duty of some officials, and the rudeness and neglect of treating the masses as "fools". Where is the equal dissemination of information?
Correct answer: neither laugh nor cry
 Prediction result: neither laugh nor cry

A netizen left a message: the leadership blog is not important at all. The problem is that we can really sympathize with the people's feelings, pay attention to the people's livelihood, and really take the interests of the people seriously. If so, whether online or offline, officialdom will be[MASK][MASK][MASK][MASK]. 
Right answer: gone
 Prediction result: gone

11 On June 18, Xi Guohua, Vice Minister of the Ministry of information industry, said that the fixed network life is not easy now, and the Ministry of information industry will[MASK][MASK][MASK][MASK]Promote licensing. This statement is beneficial to China Telecom and China Netcom without mobile license. This time, China Mobile also revealed for the first time that it will obtain fixed network license and develop related businesses. Sina
 Correct answer: duty bound
 Prediction result: keep your eyes on it

"yes IMEI Should there be a discussion on control[MASK][MASK][MASK][MASK],The original conclusion was no need. " A senior industry insider said that China's mobile phones have passed SIM Network access, not IMEI. The fact is that every time a user calls, the mobile phone IMEI The code will be sent to the operator's background first. Therefore, in foreign countries, after the user's mobile phone is lost, the operator can be required to pass the restriction  IMEI Code to stop using the phone. Mobile stakeholders confirmed that Chinese operators have not yet provided this service.
Correct answer: it has a long history
 Prediction result: it has a long history

With the strong growth of China's online game market, domestic online games seem to have unlimited scenery in the global market. Netdragon's "Conquest" officially entered the Middle East and other places, and Chinese online games appeared in the Arab region for the first time;Journey to the west by Blue Harbor OL>The signing authorization in Vietnam reached US $1 million, setting a record for the export of domestic online games to Southeast Asia;The Chinese "fleet" composed of 15 enterprises such as Jinshan, Lianzhong, target software and perfect time and space appeared in the Japanese video game exhibition for the first time. For a while, it seems that Chinese online games have arrived overseas[MASK][MASK][MASK][MASK]The situation. But is it really so smooth sailing?
Correct answer: blossom and bear fruit
 Prediction result: spring is like a sea
MAX_SEQLEN = 400

def make_data(df):
    data = []
    for i, row in enumerate(df.iterrows()):
        text_id, _ = tokenizer.encode(row[1].text) 
        text_id = text_id[:MAX_SEQLEN]
        text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant')
        data.append((text_id, 1))
    return data

test_data = make_data(test_df)
BATCH = 30
def get_batch_data(data, i):
    d = data[i*BATCH: (i + 1) * BATCH]
    feature, label = zip(*d)
    feature = np.stack(feature)  # Integrate BATCH row samples into a numpy In array
    label = np.stack(list(label))
    feature = P.to_tensor(feature) # Use to_variable will numpy Convert array to paddle tensor
    label = P.to_tensor(label)
    return feature, label

Batch forecast

with P.no_grad():
    ernie.eval()
    test_idx = 0
    test_pred = []
    for j in range(len(test_data) // BATCH):
        feature, label = get_batch_data(test_data, j)
        logits = ernie(feature).numpy()

        for ins_idx in range(feature.shape[0]):
            candidates = eval(test_df['candidate'].iloc[test_idx])
            candidate_logits = []
            for s in candidates:
                candidate_logits.append(
                    logits[ins_idx*4:(ins_idx+1)*4][[0,1,2,3], tokenizer.encode(s)[0][1:-1]].mean()
                )
            
                test_pred.append(
                    candidates[np.argmax(candidate_logits)]
                )
            test_idx+=1
pd.DataFrame({
    'label': test_pred,
}).to_csv('submit.csv', index=None)

summary

This paper uses MaskLM task to complete idiom filling in the blank, and the specific accuracy can reach more than 80%. In addition, the existing accuracy can be increased through the training process.

Topics: Machine Learning Deep Learning NLP paddlepaddle