Automatic generation of Chinese Tibetan poems based on LSTM

Posted by axo on Sun, 26 Dec 2021 13:35:37 +0100

Like RNN neurons, LSTM neurons can maintain memory in their pipeline to allow solving sequence and time problems without disappearing gradient problems affecting their performance.

Using the ancient poetry data set, the LSTM neural network model is used to train and automatically generate Chinese Tibetan head poetry.

import pandas as pd
import string
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential     
from tensorflow.keras.layers import Embedding,LSTM,Dense,Activation
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import load_model
import tensorflow
tensorflow.__version__
'2.7.0'

1, Data read in

import pandas as pd
poems_text = pd.read_table('poems.txt', header=None)
poems_text.columns = ["text"]
poems_text.head()
text
0Answer Li Siku Kui: if you come down from a distance, take a photo of the envoy's piano. It's appropriate to pour wine at night. It's expensive. In spring, Sima stays in the solitary hall. Pisces gives it to my old friend in the Ming Dynasty
1Send it to Taoist Sima on the rooftop: lying on the afterlife, white hair suddenly turns into silk, far ashamed of the child's face of the meal Xiazi, and holding on to the old tour, cherish the open and micro, and still don't send it to the phosphorus group on the day
2Let's go to Tianping army Mayo and Chen Ziang's new township for a period of time, and still don't meet: the son of entering the Wei period calls for many lovers to go, where to Qi, the water day is long
3Pass Hangu pass: for 240 years, he duozhong at home, the soldiers of the six countries are united, and the seven heroes are not divided. Emperor Qin decided to ask Su Jun Jiming to steal the dog
4Send him to the new moon. He Shilang: I heard that the messenger in the cloud returned to the river by riding back and forth. The soldiers guarded the sun and the moon. The prisoners lost their posts in the Yin Mountain. I tried to follow the Pumi Ming skill and didn't let ban Xuan hear it

The Chinese ancient poetry data set (poems.txt) contains 15374 commonly used ancient poems. First, read the txt text data line by line, remove the title and complete the text, divide the poems into strings composed of single Chinese characters and store them in the list.

import string
import numpy as np

f = open('poems.txt',"r",encoding='utf-8')
poems = []
for line in f.readlines():
    title,poem = line.split(':')
    poem = poem.replace(' ','')
    poem = poem.replace('\n','')
    poems.append(list(poem))

print(poems[0][:])

Take the first poem as an example, output the first data after processing:
['far', 'Square', 'come', 'down', 'guest', 'Lu', 'Xuan', 'photo', 'envoy', 'minister', 'Nong', 'Qin', 'appropriate', 'in', 'Night', 'pour', 'wine', 'expensive', 'meet', 'spring', 'Si', 'horse', 'stay', 'lonely', 'house', 'double', 'fish', 'gift', 'reason', 'person', 'Ming', 'Dynasty ’, 'scattered', 'cloud', 'rain', 'remote', 'Yang', 'virtue', 'for', 'neighbor']

2, Data preprocessing

1. Map the data from characters to positive integers and truncate / supplement them.

Mapping from characters to positive integers
from keras. preprocessing. sequence import pad_ The sequences command is prone to errors due to the version of the third-party library. It can be corrected by adding tensorflow before keras.
Tokenizer class allows two methods to vectorize a text corpus: convert each text into a sequence of integers (each integer is the index marked in the dictionary); or convert it into a vector, where the coefficient of each mark can be binary value, word frequency, TF-IDF weight, etc.

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(poems)
vocab_size=len(tokenizer.word_index)+1
poems_digit = tokenizer.texts_to_sequences(poems)
poems_digit = pad_sequences(poems_digit,maxlen=50,padding='post')
poems_digit[0]

texts_ to_ The sequences (poems) method can convert the list of verse texts into a list of sequences, and each sequence in the list corresponds to an input text.
pad_sequences can truncate or complement multiple sequences to the same length. Here, set the maximum length of the sequence (maxlen) to 50 and post it at the back end of the sequence.
Results after coding + completion:

array([  46,  171,   12,   40,   29, 3342,  528, 3176,  322,  592,  774,
        352,  608,   44,   23,  648,   73,  593,  158,    8, 2020,   92,
        188,  149,  548,  268,  305,  740,  114,    2,   61,   93,  333,
          9,   45,  201, 1352,  781,   43,  491,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0])

2. Extract dependent and independent variables

X=poems_digit[:,:-1]
Y=poems_digit[:,1:]
X exampleY example
46171
17112
1240
4029
293342
3342528
......

3. One hot coding

One hot coding is a process of transforming category variables into a form easy to use by deep learning algorithm. Use to_ The category function converts a category vector to a binary matrix type representation. After the corresponding embedding is obtained, it can be sent to the deep learning model for subsequent processing.

# from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical
Y = to_categorical(Y,num_classes=vocab_size)
print(Y.shape)
(15374, 49, 5040)

3, Constructing deep neural network model

keras.layers introduction to various layers https://www.cnblogs.com/lhxsoft/p/13534667.html

The main model in Keras is the Sequential model, which is a stack composed of a series of network layers in order to pass some network layers add() is stacked to form a model.
 embedded layer defines a dictionary as vocab_ Embedded layer with size = 5040, a hidden layer_ Size1 = 128 dimensional vector space, and the length of the input sequence is 49.
The dimension of output space set at LSTM layer is 64
 the output dimension of Dense on the whole connection layer is set to 5040
 Activation of Activation layer applies Activation function to the output of a layer

from tensorflow.keras.models import Sequential     
from tensorflow.keras.layers import Embedding,LSTM,Dense,Activation
from tensorflow.keras import Model

hidden_size1=128
hidden_size2=64
# Pass some network layers through add() is stacked to form a model
model = Sequential()
model.add(Embedding(input_dim=vocab_size,output_dim=hidden_size1,input_length=49,mask_zero=True))
#  Treat '0' in the input as a 'padding' value that should be ignored
model.add(LSTM(hidden_size2,return_sequences=True))
model.add(Dense(vocab_size))
model.add(Activation('softmax'))

model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 49, 128)           645120    
                                                                 
 lstm (LSTM)                 (None, 49, 64)            49408     
                                                                 
 dense (Dense)               (None, 49, 5040)          327600    
                                                                 
 activation (Activation)     (None, 49, 5040)          0         
                                                                 
=================================================================
Total params: 1,022,128
Trainable params: 1,022,128
Non-trainable params: 0
_________________________________________________________________

4, Model training and testing

After the model is built, it needs to be used compile() method to compile the model, and carry out a certain number of iterative training according to batch to fit the network.
The loss function and optimizer must be specified when compiling the model

from tensorflow.keras.optimizers import Adam
model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.01),metrics=['accuracy'])
model.fit(X,Y,epochs=10,batch_size=64,validation_split=0.2)
D:\Anaconda\Anaconda3\Lib\site-packages\keras\optimizer_v2\adam.py:105: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(Adam, self).__init__(name, **kwargs)
Epoch 1/10
193/193 [==============================] - 60s 298ms/step - loss: 4.7847 - accuracy: 0.0301 - val_loss: 4.6860 - val_accuracy: 0.0337
Epoch 2/10
193/193 [==============================] - 56s 290ms/step - loss: 4.5226 - accuracy: 0.0480 - val_loss: 4.5059 - val_accuracy: 0.0552
Epoch 3/10
193/193 [==============================] - 59s 306ms/step - loss: 4.3119 - accuracy: 0.0724 - val_loss: 4.3876 - val_accuracy: 0.0719
Epoch 4/10
193/193 [==============================] - 60s 309ms/step - loss: 4.1503 - accuracy: 0.0889 - val_loss: 4.3279 - val_accuracy: 0.0798
Epoch 5/10
193/193 [==============================] - 59s 304ms/step - loss: 4.0391 - accuracy: 0.0986 - val_loss: 4.3107 - val_accuracy: 0.0838
Epoch 6/10
193/193 [==============================] - 58s 299ms/step - loss: 3.9588 - accuracy: 0.1041 - val_loss: 4.3057 - val_accuracy: 0.0860
Epoch 7/10
193/193 [==============================] - 56s 290ms/step - loss: 3.8951 - accuracy: 0.1090 - val_loss: 4.3062 - val_accuracy: 0.0872
Epoch 8/10
193/193 [==============================] - 56s 291ms/step - loss: 3.8426 - accuracy: 0.1130 - val_loss: 4.3138 - val_accuracy: 0.0879
Epoch 9/10
193/193 [==============================] - 56s 291ms/step - loss: 3.7964 - accuracy: 0.1173 - val_loss: 4.3226 - val_accuracy: 0.0881
Epoch 10/10
193/193 [==============================] - 56s 290ms/step - loss: 3.7565 - accuracy: 0.1208 - val_loss: 4.3346 - val_accuracy: 0.0887
  <keras.callbacks.History at 0x19d0b5a6b80>

Save model

model.save('Poetry_LSTM.h5')

Loading model

from tensorflow.keras.models import load_model
model = load_model('Poetry_LSTM.h5')

Load the trained model, input the "poem head" and write the Tibetan poem.

poem_incomplete='rain****Xuan****can****love****'
poem_index=[]
poem_text=''
for i in range(len(poem_incomplete)):
    current_word=poem_incomplete[i]
    
    if current_word !='*':
        index=tokenizer.word_index[current_word]
        
    else:
        x=np.expand_dims(poem_index,axis=0)
        x=pad_sequences(x,maxlen=49,padding='post')
        y=model.predict(x)[0,i]
        
        y[0]=0
        index=y.argmax()
        current_word=tokenizer.index_word[index]
        
    poem_index.append(index)
    poem_text=poem_text+current_word
    
poem_text=poem_text[0:]
print(poem_text[0:5])
print(poem_text[5:10])
print(poem_text[10:15])
print(poem_text[15:20])
Rain drops and wind blowing
 Xuan tree wind flute
 Poor song no
 Don't love people at this time

Topics: neural networks Deep Learning NLP keras lstm