Event background
Broad and profound China culture has a long history, and idioms are the cream of China's culture. Idioms are mostly composed of four words, usually with allusions or sources. Some idioms are not difficult to understand literally, such as "making a mountain out of a molehill", "catching up from behind" and so on. Some idioms need to know the source or allusions to understand the meaning, such as "every day", "the shadow of a bow and a snake in a cup" and so on.
Idiom learning is an important learning content in primary school Chinese and junior middle school. How to choose the appropriate idiom in the sentence? In this competition, the contestants are expected to build a model to understand Chinese idioms.
https://challenge.xfyun.cn/topic/info?type=chinese-idioms
Event task
Given a Chinese sentence, the contestant needs to select the most appropriate idiom from the idioms to be selected under the given context. That is, given the context of the sentence, complete the appropriate idiom and fill in the corresponding position.
Event data set
There are 5w pieces of data in the training set and 1w pieces of data in the test set. The label field in the test set is empty and requires player prediction.
Sample dataset:
- Original text: at present, it has long been "[MASK] for government departments to make use of the Internet to disclose information, collect public opinion, improve work efficiency and narrow the distance with the masses
- Idioms to be chosen: ['vulnerable', 'common practice', 'like wind passing through ears',' uncanny workmanship and lightning axe ']
- Correct answer: common practice
Review rules
-
Data description
The contest data consists of a training set and a test set. There are 5w pieces of data in the training set and 1w pieces of data in the test set, all of which are in csv format. The columns are \ t divided. See sample for test set submission cases_ submit. csv file does not need a header, and 1w idioms can be written directly according to the order and line. -
Evaluation index
The evaluation standard of this competition adopts the classification accuracy rate, and the highest score is 1. Evaluation code reference:
from sklearn.metrics import accuracy_score y_pred = [0, 2, 1, 3] y_true = [0, 1, 2, 3] accuracy_score(y_true, y_pred)
!pip install paddle-ernie > log.log
import sys import numpy as np import pandas as pd from sklearn.metrics import f1_score import paddle as P
test_df = pd.read_csv('test.csv',sep='\t') train_df = pd.read_csv('train.csv',sep='\t') train_df.head()
text | candidate | label | |
---|---|---|---|
0 | As the most popular digital product, Apple iPad tablet has been started by a large number of users, but many netizens are putting their beloved iP | ['Polite', 'inexplicable', 'reluctant to leave', 'looking for a suitcase'] | be rather baffling |
1 | Take the latest iPad product, commonly known as iPad 2, as an example (although Apple has never called this product iPad 2 | ['don't take your clothes off', 'smack your lips',' make plans and borrow money ',' follow the good as the flow '] | follow correct opinions or well-intentioned advice like water flowing swiftly and smoothly downward |
2 | At present, government departments have long been "[MASK] | ['vulnerable', 'common practice', 'like wind passing by', 'uncanny workmanship'] | become a common practice |
3 | One is called "twin achievements". The two towns in the Central Plains have posted almost identical political performance reports on their respective government websites | ['unique', 'indomitable', 'unparalleled in the world', 'overworked'] | it happens that there is a similar case |
4 | Another one is called "the question has been read". Not long ago, a netizen in a certain place reported the difficulty of travel to the district head's mailbox. Six days later, when you come | ['don't be surprised at strange things',' nepotism ',' root cause ',' deserve death '] | become inured to the unusual |
train_df['text'].iloc[10]
'The first problem in developing emerging markets is to start from scratch. In these areas, there is a demand for online games and entertainment, but there is no ready-made market to rely on and learn from. "I don't even know such terms as in-game test and public test.[MASK][MASK][MASK][MASK]The difficulty of promotion. " Chen Xing told reporters.'
Problem solving ideas
The existing idea of the competition question is a typical Mask filling problem, which can be solved by two ideas:
- Using MaskLM prediction
- Use QA text Q & A
Here we use ERNIE to complete the word level prediction, which is also very consistent with ERNIE's model.
Define model
from ernie.modeling_ernie import ErnieModelForPretraining, ErnieModel from ernie.tokenizing_ernie import ErnieTokenizer import paddle.fluid.layers as L import paddle.fluid.dygraph as D class ErnieCloset(ErnieModelForPretraining): def __init__(self, *args, **kwargs): super(ErnieCloset, self).__init__(*args, **kwargs) del self.pooler_heads def forward(self, src_ids, *args, **kwargs): pooled, encoded = ErnieModel.forward(self, src_ids, *args, **kwargs) encoded_2d = L.gather_nd(encoded, L.where(src_ids == mask_id)) encoded_2d = self.mlm(encoded_2d) encoded_2d = self.mlm_ln(encoded_2d) # Convert to token prediction results logits_2d = L.matmul(encoded_2d, self.word_emb.weight, transpose_y=True) + self.mlm_bias return logits_2d
tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0') rev_dict = {v: k for k, v in tokenizer.vocab.items()} rev_dict[tokenizer.pad_id] = '' # replace [PAD] rev_dict[tokenizer.sep_id] = '' # replace [PAD] rev_dict[tokenizer.unk_id] = '' # replace [PAD] @np.vectorize def rev_lookup(i): return rev_dict[i] ernie = ErnieCloset.from_pretrained('ernie-1.0') ernie.eval()
ids, _ = tokenizer.encode('The reform movement of 1898, also known as the hundred day reform, was[MASK][MASK][MASK] ,Liang Qichao and other reformists carried out a bourgeois reform through Emperor Guangxu.') mask_id = tokenizer.mask_id print(ids) ids = np.expand_dims(ids, 0) ids = D.to_variable(ids) logits = ernie(ids).numpy() output_ids = np.argmax(logits, -1) seg_txt = rev_lookup(output_ids) print(seg_txt)
[ 1 3299 3721 282 72 4 311 351 502 139 534 102 4 10 3 3 3 6 1164 1087 634 43 534 102 809 8 391 124 93 325 1659 757 71 40 5 7 191 138 66 976 222 285 698 12043 2] ['Kang' 'have' 'by']
Simple test
for row in train_df.iloc[:10].iterrows(): print(row[1].text) print(f'Correct answer:{row[1].label}') ids, _ = tokenizer.encode(row[1].text) mask_id = tokenizer.mask_id ids = np.expand_dims(ids, 0) ids = D.to_variable(ids) logits = ernie(ids).numpy() candidates = eval(row[1].candidate) candidate_logits = [] for s in candidates: candidate_logits.append(logits[[0,1,2,3], tokenizer.encode(s)[0][1:-1]].mean()) print(f'Forecast results:{candidates[np.argmax(candidate_logits)]}') print('')
As the most popular digital product, apple iPad Tablet computers have been started by a large number of users, but many netizens are putting their beloved iPad When I took it home, I found that it was somewhat strange[MASK][MASK][MASK][MASK]: iTunes,iOS System, firmware version AppStore,JailbreakME Prison break, synchronization and so on seem to be words composed of letters and words randomly extracted from the dictionary, which appear in front of everyone, making the new fruit fans feel anxious. Although confused by Apple products, it's actually nothing to be ashamed of. Every Apple user, including Xiaobian me, has had this experience. After all, no one has the habit of reading the manual before buying. Correct answer: inexplicable Prediction result: inexplicable with iPad The latest product, commonly known as iPad 2 take as an example(Although Apple has never called this product iPad 2,But this product is indeed an upgraded version of the first generation of products, so we are not as good as[MASK][MASK][MASK][MASK],Call it iPad 2),The specification of this product has the following features that we must understand: Correct answer: follow good advice Prediction result: follow good advice At present, it is long ago for government departments to use the Internet to make public information, gather public opinion, improve work efficiency and narrow the distance with the masses“[MASK][MASK][MASK][MASK]". Looking around, it seems that no government does not have its own website. There are countless mayors' mailboxes and district heads' mailboxes, and official blogs are very popular. Correct answer: common practice Prediction results: common practice One is called "twin achievements". The two towns in the Central Plains have posted almost identical performance reports on their respective government websites. Tongbai.com and suixian people's government.com have published articles on "highlighting three characteristics of forestry ecological construction in CHENGWAN township" and "highlighting three characteristics of forestry ecological construction in Pinggang Town, suixian county". The two different townships not only have the same land area and population, but also the same seedlings![MASK][MASK][MASK][MASK],At the beginning of this year, on the "China Fire Online" website, the publicity drafts of Henan Kaifeng fire brigade and Luohe fire brigade were the same, which was jokingly called "Kaifeng leading Luohe guiding work" by netizens. Correct answer: coincidentally Prediction results: coincidentally Another one is called "problem'Read'". Not long ago, a netizen in a certain place reported the difficulty of travel to the district head's mailbox. After 6 days, when I came to reply online, there were two words: "read". "Read" two words, do not say yes, do not say no, play the ball level by level, and turn down level by level. this[MASK][MASK][MASK][MASK]The official practice of kicked the Internet and the mailbox of the district head. Correct answer: no wonder Prediction results: no wonder "Twin achievements "," official blog "and" problems "'Read'"The appearance of makes people[MASK][MASK][MASK][MASK],More people stay away. Because through such an official website, the masses see the perfunctory and dereliction of duty of some officials, and the rudeness and neglect of treating the masses as "fools". Where is the equal dissemination of information? Correct answer: neither laugh nor cry Prediction result: neither laugh nor cry A netizen left a message: the leadership blog is not important at all. The problem is that we can really sympathize with the people's feelings, pay attention to the people's livelihood, and really take the interests of the people seriously. If so, whether online or offline, officialdom will be[MASK][MASK][MASK][MASK]. Right answer: gone Prediction result: gone 11 On June 18, Xi Guohua, Vice Minister of the Ministry of information industry, said that the fixed network life is not easy now, and the Ministry of information industry will[MASK][MASK][MASK][MASK]Promote licensing. This statement is beneficial to China Telecom and China Netcom without mobile license. This time, China Mobile also revealed for the first time that it will obtain fixed network license and develop related businesses. Sina Correct answer: duty bound Prediction result: keep your eyes on it "yes IMEI Should there be a discussion on control[MASK][MASK][MASK][MASK],The original conclusion was no need. " A senior industry insider said that China's mobile phones have passed SIM Network access, not IMEI. The fact is that every time a user calls, the mobile phone IMEI The code will be sent to the operator's background first. Therefore, in foreign countries, after the user's mobile phone is lost, the operator can be required to pass the restriction IMEI Code to stop using the phone. Mobile stakeholders confirmed that Chinese operators have not yet provided this service. Correct answer: it has a long history Prediction result: it has a long history With the strong growth of China's online game market, domestic online games seem to have unlimited scenery in the global market. Netdragon's "Conquest" officially entered the Middle East and other places, and Chinese online games appeared in the Arab region for the first time;Journey to the west by Blue Harbor OL>The signing authorization in Vietnam reached US $1 million, setting a record for the export of domestic online games to Southeast Asia;The Chinese "fleet" composed of 15 enterprises such as Jinshan, Lianzhong, target software and perfect time and space appeared in the Japanese video game exhibition for the first time. For a while, it seems that Chinese online games have arrived overseas[MASK][MASK][MASK][MASK]The situation. But is it really so smooth sailing? Correct answer: blossom and bear fruit Prediction result: spring is like a sea
MAX_SEQLEN = 400 def make_data(df): data = [] for i, row in enumerate(df.iterrows()): text_id, _ = tokenizer.encode(row[1].text) text_id = text_id[:MAX_SEQLEN] text_id = np.pad(text_id, [0, MAX_SEQLEN-len(text_id)], mode='constant') data.append((text_id, 1)) return data test_data = make_data(test_df)
BATCH = 30 def get_batch_data(data, i): d = data[i*BATCH: (i + 1) * BATCH] feature, label = zip(*d) feature = np.stack(feature) # Integrate BATCH row samples into a numpy In array label = np.stack(list(label)) feature = P.to_tensor(feature) # Use to_variable will numpy Convert array to paddle tensor label = P.to_tensor(label) return feature, label
Batch forecast
with P.no_grad(): ernie.eval() test_idx = 0 test_pred = [] for j in range(len(test_data) // BATCH): feature, label = get_batch_data(test_data, j) logits = ernie(feature).numpy() for ins_idx in range(feature.shape[0]): candidates = eval(test_df['candidate'].iloc[test_idx]) candidate_logits = [] for s in candidates: candidate_logits.append( logits[ins_idx*4:(ins_idx+1)*4][[0,1,2,3], tokenizer.encode(s)[0][1:-1]].mean() ) test_pred.append( candidates[np.argmax(candidate_logits)] ) test_idx+=1
pd.DataFrame({ 'label': test_pred, }).to_csv('submit.csv', index=None)
summary
This paper uses MaskLM task to complete idiom filling in the blank, and the specific accuracy can reach more than 80%. In addition, the existing accuracy can be increased through the training process.