python3 seq2seq_convert.py corresponds to code interpretation extraction + generative extraction summary code interpretation ------ summary code interpretation 4

Posted by Jocka on Fri, 22 Oct 2021 08:52:18 +0200

View the contents of the main function and read the file first

if __name__ == '__main__':
    #data_extract_json = /home/xiaoguzai / code / French research cup text summary / datasets/train_extract.json
    data = load_data(data_extract_json)
    data_x = np.load(data_extract_npy)
    data_seq2seq_json = data_json[:-5] + '_seq2seq.json'
    convert(data_seq2seq_json, data, data_x)
    print(u'Output path:%s' % data_seq2seq_json)

In order to clarify the data here_ extract_ JSON and data_ extract_ For the contents of NPY file, we need to go to the previous extract_ Check the data in model.py_ extract_ JSON and data_ extract_ Call of NPY

Back to extract_ View in model.py

there

data = load_data(data_extract_json)
data_x = np.load(data_extract_npy)

Call data and data_x

Go back to seq2seq again_ View in convert.py

Here, the length of len(data) and len(data_x) is 20, data_seq2seq_json = train_seq2seq.json
(the total length of all data is 20)
Next, enter the data conversion section

convert(data_seq2seq_json,data,data_x)

Load corresponding data
Data here_ Split is in snippets.py

def data_split(data, fold, num_folds, mode):
    """Divide training set and verification set
    """
    if mode == 'train':
        D = [d for i, d in enumerate(data) if i % num_folds != fold]
    else:
        D = [d for i, d in enumerate(data) if i % num_folds == fold]

    if isinstance(data, np.ndarray):
        return np.array(D)
    else:
        return D

That is, every fold_ The result of data obtained after convert is the same
Here's the valid_data and valid_x is the value of each cut. We look at the data retrieved from each fold.

for fold in range(num_folds):
	total_results.append(fold_convert(data,data_x,fold))

Enter fold_ View the procedure called in the convert function

def fold_convert(data, data_x, fold):
    """Each fold Use the corresponding model for data conversion
    """
    valid_data = data_split(data, fold, num_folds, 'valid')
    valid_x = data_split(data_x, fold, num_folds, 'valid')
    model.load_weights('./weights/extract_model.%s.weights' % fold)
    y_pred = model.predict(valid_x)[:, :, 0]
    results = []
    for d, yp in tqdm(zip(valid_data, y_pred), desc=u'Conversion in progress'):
        yp = yp[:len(d[0])]
        yp = np.where(yp > threshold)[0]
        source_1 = ''.join([d[0][int(i)] for i in yp])
        source_2 = ''.join([d[0][int(i)] for i in d[1]])
        result = {
            'source_1': source_1,
            'source_2': source_2,
            'target': d[2],
        }
        results.append(result)

    return results

The results extracted when fold = 0 and fold = 1 are different, and the data extracted when fold = 0
Here are the same data and data_x disassembles different valid_data and valid_x
I'm thinking about why I want to split the same data into two valid values_ Data and valid_x, we first have to think about a question: what data do we have?
The text, the original text and the summary content are all the data we have. In addition, we also have an officially designated index content from the original text to the summary, and a trained model from the original text to the summary.
Here, we extract two pieces of data at a time from 20 pieces of data, due to num_ Epochs = 2 (the author originally defined num_epochs = 15, which is related to the training during data extraction. Several epochs were trained during data extraction, and there are several epochs here). Therefore, a total of 2 * 2 = 4 groups of data are extracted.
For each group of data, there are both the simplification results predicted by the model and the original simplification results (so there is a question here, regard the model as one, and use the simplification results predicted by the model or the original microscopic examination results?)

for d, yp in tqdm(zip(valid_data, y_pred), desc=u'Conversion in progress'):
    yp = yp[:len(d[0])]
    yp = np.where(yp > threshold)[0]
    source_1 = ''.join([d[0][int(i)] for i in yp])
    source_2 = ''.join([d[0][int(i)] for i in d[1]])
    result = {
        'source_1': source_1,
        'source_2': source_2,
        'target': d[2],
    }
    results.append(result)

Next, write the data to a file

n = 0
while True:
    i, j = n % num_folds, n // num_folds
    try:
        d = total_results[i][j]
    except:
        break
    F.write(json.dumps(d, ensure_ascii=False) + '\n')
    n += 1

(in fact, the source_2 in the middle is not standard, but is obtained by calculating the corresponding rouge score. We regard it as the standard answer, but the source_1 predicted by the actual model may be different from the standard answer source_2)

Topics: Python Machine Learning Deep Learning