[AI talent Creation Camp phase II] modern poetry generator based on LSTM_ copy

Posted by powlow on Sun, 06 Mar 2022 03:27:57 +0100

Link from AI Studio project https://aistudio.baidu.com/aistudio/projectdetail/3458536?contributionType=1

Automatic generation of modern poetry based on LSTM

Project background

Emotion is noble because of poetry, and poetry spreads because of emotion. Seeing that the generation technology of ancient poetry has been perfected, it is difficult to distinguish the true and false stories. However, as a young man in the 21st century, I seem to prefer modern poetry because it is incomprehensible... Well, I wrote it myself.

However, geese, the poems he wrote were so bad that he was really embarrassed to give them to his npy. Therefore, it happened that he had recently learned the LSTM model, so this project was born by referring to the relevant projects generated by the texts of many bosses and the projects of many bosses.

Dataset usage

Two data sets are used in this project.

  1. Data set of modern love poetry , is a modern poem I crawled through a web crawler, about 2000.
  2. Ancient poetry collection data , considering that the data set 1 is only 0.5M and the amount of data is small, some ancient poetry data are mixed for training.

Introduction to LSTM model

LSTM is a very popular recurrent neural network. Compared with simple RNN, it has poor ability to understand longer sentences (because RNN updates ceil state indiscriminately, resulting in the loss of previous information). However, LSTM selectively updates ceil state by adding four logic gates, which makes LSTM have a better effect on up and down inference of long text semantic understanding.
Many leaders have talked about the specific principle. You can see that it is quite thorough Understanding LSTM network.

Effect demonstration

I'm worried about the falling mountain with the bright moon. I walk high and short along a song of returning to silver flowers. As a result, I don't stop flowing in autumn. A thread is in his mouth. Your hand is still in my body. It's far away. You won't put it on if I don't light up all day
I send my sorrow to the bright moon. Will not, once again, do not ask why ruthless.
Nothing is known, nothing is born, nothing is wrong

Model training and model evaluation

See model for detailed training procedures_ LSTM. Ipynb, which comprehensively writes notes and core ideas.

Document organization

  • main.ipynb can be taken directly to play and automatically load the trained model
  • model_LSTM.ipynb is a training file with detailed comments. You can train your own model by calling it.
  • [static diagram test] folder is abandoned... I originally intended to export the static graph model, but I encountered a little problem. Dig a pit and solve it later
  • The [models] folder stores two models trained with different numbers of samples, which can be called directly
  • The [vocab] folder stores the data used to make vocabulary

Load the model for testing

# Import dependent
from paddle.io import Dataset
import paddle.fluid as fluid
import numpy as np
import paddle
import paddle.nn
from paddlenlp.embeddings import TokenEmbedding
from paddlenlp.data import JiebaTokenizer,Vocab
import visualdl
# Define super parameters
class Config(object):
    # version = 'models/version1-modern/version1.pdparams' # Modern poetic style
    version = 'models/version2-ancient/version2.pdparams' # Style of ancient poetry
    maxl = 120
    filepath = "vocab/poems_without_title.txt"
    filepath2 = "vocab/poems_zh.txt"
    embedding_dim = 300
    hidden_dim = 512
    num_layers = 3

    max_gen_len = 150
    prefix = "Love you all my life" # Pre style, you can adjust the style of generated text
    beginning = "Accompany is the longest confession of love" # The beginning of the poem needs to be given and the model needs to be continued

config = Config()

# Load vocabulary vocab
dic = {'[PAD]':0,'<start>':1,'<end>':2,'[UNK]':3}
cnt=4
with open (config.filepath) as fp:
    for line in fp:
        for char in line:
            if char not in dic:
                dic[char] = cnt
                cnt+=1

with open (config.filepath2) as fp:
    for line in fp:
        for char in line:
            if char not in dic:
                dic[char] = cnt
                cnt+=1

vocab = Vocab.from_dict(dic,unk_token='[UNK]')
# Loading model
class Poetry(paddle.nn.Layer):
    def __init__(self,vocab_size,embedding_dim,hidden_dim):
        super().__init__()
        self.embeddings = paddle.nn.Embedding(vocab_size,embedding_dim)
        self.lstm = paddle.nn.LSTM(
            input_size=embedding_dim,
            hidden_size=hidden_dim,
            num_layers=config.num_layers,
        )
        self.linear = paddle.nn.Linear(in_features=hidden_dim,out_features=vocab_size)

    def forward(self,input,hidden=None):
        batch_size, seq_len = paddle.shape(input)
        embeds = self.embeddings(input)
        if hidden is None:
            output,hidden = self.lstm(embeds)
        else:
            output,hidden = self.lstm(embeds,hidden)
        output = paddle.reshape(output,[seq_len*batch_size,Config.hidden_dim])
        output = self.linear(output)
        return output,hidden

poetry = Poetry(len(vocab),config.embedding_dim,config.hidden_dim)
poetry.set_state_dict(paddle.load(config.version))
results = [i for i in config.beginning]
start_words_len = len(results)
input = (paddle.to_tensor(vocab("<start>"))).reshape([1,1])
hidden = None
if config.prefix:
    words = [i for i in config.prefix]
    for word in words:
        _, hidden = poetry(input, hidden)
        input = (paddle.to_tensor(vocab(word))).reshape([1,1])

for i in range(config.max_gen_len):
    output, hidden = poetry(input, hidden)
    if i < start_words_len:
        word = results[i]
        input = (paddle.to_tensor(vocab(word))).reshape([1,1])
    else:
        _,top_index = paddle.fluid.layers.topk(output[0],k=1)
        top_index = top_index.item()
        word = vocab.to_tokens(top_index)
        results.append(word)
        input = paddle.to_tensor([top_index])
        input = paddle.reshape(input,[1,1])
    if word == '<end>':
        del results[-1]
        break
results = ''.join(results)
print(results)
Company is the longest love confession, I said, so many, so many people, I said I don't need my lover, I dare not give you such a gesture, I will be in your stomach, I won't talk about you as the shadow of some small days, your shadow is the same as the autumn water, but you can't be a villain. You say you can't: you prove me a cup. I sit on my back and look at your head and your neck. You talk about it

Project summary

  1. This is an attempt on the LSTM model. Its core is to use LSTM to predict the next word to achieve the purpose of text generation. From the actual performance, the model does learn some things, such as how to use punctuation to segment sentences, how to organize phrases, and how to distribute the subject, predicate and object of sentences.
  2. Considering that the data set is still very small, only 500KB, which is still too small for the task of text generation, there will still be puzzling statements. In contrast, the model trained on the ancient poetry data set is better (the data set has a full 45M), so adding data is a way to improve the model.
  3. Now using LSTM for text generation is actually a rotten technology played by big guys, so we should try to learn more advanced models~
  4. This is my first public project. I hope you like it~

Topics: Deep Learning NLP paddlepaddle BERT