Link from AI Studio project https://aistudio.baidu.com/aistudio/projectdetail/3458536?contributionType=1
Automatic generation of modern poetry based on LSTM
Project background
Emotion is noble because of poetry, and poetry spreads because of emotion. Seeing that the generation technology of ancient poetry has been perfected, it is difficult to distinguish the true and false stories. However, as a young man in the 21st century, I seem to prefer modern poetry because it is incomprehensible... Well, I wrote it myself.
However, geese, the poems he wrote were so bad that he was really embarrassed to give them to his npy. Therefore, it happened that he had recently learned the LSTM model, so this project was born by referring to the relevant projects generated by the texts of many bosses and the projects of many bosses.
Dataset usage
Two data sets are used in this project.
- Data set of modern love poetry , is a modern poem I crawled through a web crawler, about 2000.
- Ancient poetry collection data , considering that the data set 1 is only 0.5M and the amount of data is small, some ancient poetry data are mixed for training.
Introduction to LSTM model
LSTM is a very popular recurrent neural network. Compared with simple RNN, it has poor ability to understand longer sentences (because RNN updates ceil state indiscriminately, resulting in the loss of previous information). However, LSTM selectively updates ceil state by adding four logic gates, which makes LSTM have a better effect on up and down inference of long text semantic understanding.
Many leaders have talked about the specific principle. You can see that it is quite thorough Understanding LSTM network.
Effect demonstration
I'm worried about the falling mountain with the bright moon. I walk high and short along a song of returning to silver flowers. As a result, I don't stop flowing in autumn. A thread is in his mouth. Your hand is still in my body. It's far away. You won't put it on if I don't light up all day
I send my sorrow to the bright moon. Will not, once again, do not ask why ruthless.
Nothing is known, nothing is born, nothing is wrong
Model training and model evaluation
See model for detailed training procedures_ LSTM. Ipynb, which comprehensively writes notes and core ideas.
Document organization
- main.ipynb can be taken directly to play and automatically load the trained model
- model_LSTM.ipynb is a training file with detailed comments. You can train your own model by calling it.
- [static diagram test] folder is abandoned... I originally intended to export the static graph model, but I encountered a little problem. Dig a pit and solve it later
- The [models] folder stores two models trained with different numbers of samples, which can be called directly
- The [vocab] folder stores the data used to make vocabulary
Load the model for testing
# Import dependent from paddle.io import Dataset import paddle.fluid as fluid import numpy as np import paddle import paddle.nn from paddlenlp.embeddings import TokenEmbedding from paddlenlp.data import JiebaTokenizer,Vocab import visualdl
# Define super parameters class Config(object): # version = 'models/version1-modern/version1.pdparams' # Modern poetic style version = 'models/version2-ancient/version2.pdparams' # Style of ancient poetry maxl = 120 filepath = "vocab/poems_without_title.txt" filepath2 = "vocab/poems_zh.txt" embedding_dim = 300 hidden_dim = 512 num_layers = 3 max_gen_len = 150 prefix = "Love you all my life" # Pre style, you can adjust the style of generated text beginning = "Accompany is the longest confession of love" # The beginning of the poem needs to be given and the model needs to be continued config = Config()
# Load vocabulary vocab dic = {'[PAD]':0,'<start>':1,'<end>':2,'[UNK]':3} cnt=4 with open (config.filepath) as fp: for line in fp: for char in line: if char not in dic: dic[char] = cnt cnt+=1 with open (config.filepath2) as fp: for line in fp: for char in line: if char not in dic: dic[char] = cnt cnt+=1 vocab = Vocab.from_dict(dic,unk_token='[UNK]')
# Loading model class Poetry(paddle.nn.Layer): def __init__(self,vocab_size,embedding_dim,hidden_dim): super().__init__() self.embeddings = paddle.nn.Embedding(vocab_size,embedding_dim) self.lstm = paddle.nn.LSTM( input_size=embedding_dim, hidden_size=hidden_dim, num_layers=config.num_layers, ) self.linear = paddle.nn.Linear(in_features=hidden_dim,out_features=vocab_size) def forward(self,input,hidden=None): batch_size, seq_len = paddle.shape(input) embeds = self.embeddings(input) if hidden is None: output,hidden = self.lstm(embeds) else: output,hidden = self.lstm(embeds,hidden) output = paddle.reshape(output,[seq_len*batch_size,Config.hidden_dim]) output = self.linear(output) return output,hidden poetry = Poetry(len(vocab),config.embedding_dim,config.hidden_dim) poetry.set_state_dict(paddle.load(config.version))
results = [i for i in config.beginning] start_words_len = len(results) input = (paddle.to_tensor(vocab("<start>"))).reshape([1,1]) hidden = None if config.prefix: words = [i for i in config.prefix] for word in words: _, hidden = poetry(input, hidden) input = (paddle.to_tensor(vocab(word))).reshape([1,1]) for i in range(config.max_gen_len): output, hidden = poetry(input, hidden) if i < start_words_len: word = results[i] input = (paddle.to_tensor(vocab(word))).reshape([1,1]) else: _,top_index = paddle.fluid.layers.topk(output[0],k=1) top_index = top_index.item() word = vocab.to_tokens(top_index) results.append(word) input = paddle.to_tensor([top_index]) input = paddle.reshape(input,[1,1]) if word == '<end>': del results[-1] break results = ''.join(results) print(results)
Company is the longest love confession, I said, so many, so many people, I said I don't need my lover, I dare not give you such a gesture, I will be in your stomach, I won't talk about you as the shadow of some small days, your shadow is the same as the autumn water, but you can't be a villain. You say you can't: you prove me a cup. I sit on my back and look at your head and your neck. You talk about it
Project summary
- This is an attempt on the LSTM model. Its core is to use LSTM to predict the next word to achieve the purpose of text generation. From the actual performance, the model does learn some things, such as how to use punctuation to segment sentences, how to organize phrases, and how to distribute the subject, predicate and object of sentences.
- Considering that the data set is still very small, only 500KB, which is still too small for the task of text generation, there will still be puzzling statements. In contrast, the model trained on the ancient poetry data set is better (the data set has a full 45M), so adding data is a way to improve the model.
- Now using LSTM for text generation is actually a rotten technology played by big guys, so we should try to learn more advanced models~
- This is my first public project. I hope you like it~