DQN+LSTM -- simple interpretation and implementation of DRQN

Posted by sasito on Wed, 17 Nov 2021 03:20:31 +0100

1, Foreword

DRQN comes from a paper in 2015. It is an early algorithm, and the content is easy to understand. It is to combine the traditional DQN with LSTM to make the agent have the function of memory, and finally achieved good results. It performs better than DQN in POMDP environment.

Before reading this article, it is assumed that you already know DQN, and you can understand the code of DQN and its operation principle. If you don't know, you can go and have a look https://aistudio.baidu.com/aistudio/projectdetail/2231135 . The code of this article is also modified and implemented directly on its basis.

The following is a link to the original paper. Interested friends can have a look.

https://readpaper.com/pdf-annotate/note?noteId=614295504096829440

2, Principle

The following mainly discusses the relevant contents according to the paper

1. What is POMDP

First of all, we must know what MDP is. MDP is the abbreviation of Markov decision process. Simply put, the observation observed by the agent is equal to the environment of the environment with which it interacts. The whole environment has no secret to the agent. POMDP corresponds to a partially observable Markov decision-making process. The observation observed by the agent is not equal to the environment. It can only see part of the environment. Generally speaking, MDP is from the perspective of God and POMDP is from the perspective of players.

2. Why consider POMDP

MDP environment is generally simple. For example, in yadali game, DQN infers the complete state of the game by taking the input of four frames as the feature, which makes the environment MDP. However, in real life, the environment conforming to POMDP is the mainstream. The environment faced in real life is usually complex and unpredictable. At this time, it is necessary to consider the performance of agents in POMDP.

3. How to build POMDP through MDP

In this paper, the author transforms the classic table tennis game. By setting a probability parameter, each frame has a certain probability of "being blocked", that is, the picture turns black, so that the agent cannot obtain the required information to build an environment in line with POMDP - flicker table tennis game.

4. How does drqn perform on POMDP

This paper mainly observes the performance of DRQN through flashing table tennis game.
See the effect by visualizing the convolution layer and LSTM layer.

It can be seen that during the game, the model can also detect important events such as ball leakage and reflection. This shows that the performance of DRQN can meet the requirements in POMDP environment.
Even if only one frame is input in each time step, DRQN can complete the task well, which shows that the cyclic neural network can effectively integrate the information between frames and obtain results similar to those of multi frame input convolution layer

5. Conclusion

Whether training in MDP, reasoning in POMDP, or training in POMDP, reasoning in MDP, DRQN can achieve good results.
The network trained with DQRN can get quite good performance even when the input is only 1 frame. However, the disadvantage is that DRQN and DQN are not very different in MDP environment. In POMDP, it is only an alternative to DQN multi frame input.
It does not have systematic advantages.

3, Network structure and update mode

1. Network structure

The first is the convolutional neural network for image processing. The image features processed by the convolutional neural network are input into LSTM, and then processed by LSTM and then input into DQN. It can be seen that the network structure of the algorithm is relatively simple. The main thing is to add an LSTM layer in front of DQN. However, there are still some points to pay attention to in the specific code implementation and input and output. At the same time, because we use a third-party library to build the environment, the previous convolutional neural network can be omitted and built directly from the LSTM layer.

2. Update method

There are two main update methods. One is sequential update. Randomly select an episode in the experience pool, and then randomly select a time point in the episode, running from this point to the end of the episode. Sequentially update the status of LSTM at the beginning of each training, inherited from the previous one.

The other is random update. Randomly select an episode in the experience pool, and then randomly select a time point in the episode. These steps are the same as sequential update, and then run the preset step instead of to the end. Out of order update the hidden layer status of LSTM is reset to 0 at the beginning of each training.

For example, after any episode is selected, the third step is selected, and the episode has a total of 10 steps, in order from the third to 10. Assuming five are preset, the disorder is from the third to the eighth

This paper mainly uses out of order update

4, Code implementation

The code is modified directly from DQN. The original DQN code is on the link at the beginning. You can look at that first, and then look down.

Let's talk about the modified part.

#Import the third-party libraries that will be used
import parl
from parl.utils import logger
import paddle
import copy
import numpy as np
import os
import gym
import random
import collections

#Set the super parameters that will be used
learn_freq = 3 # The training frequency does not need to learn from every step. Learn after accumulating some new experience to improve efficiency   
memory_warmup_size = 50  # episode_ replay_ Some experience data needs to be stored in memory before starting training
batch_size = 8   # The amount of data that is given to the agent to learn each time is obtained from a batch of random sample s in replay memory
lr = 6e-4 # Learning rate
gamma = 0.99 # The attenuation factor of reward is generally 0.9 to 0.999
num_step=10
episode_size=500    # The larger the size of replay memory (the size of the dataset), the more memory it takes

An LSTM layer is added at the beginning of the network, and a function for initializing the LSTM layer is set. Because it is an out of order update, it is all 0

The obtained output shape is [batch_size,num_steps,output_size]. In the original code, the output shape is [batch_size,output_size]. For alignment, use padding.reshape to convert the data shape, and the conversion will not change the corresponding relationship

#Build network
class Model(paddle.nn.Layer):
    def __init__(self, obs_dim,act_dim):
        super(Model,self).__init__()
        self.hidden_size=64
        self.first=False
        self.act_dim=act_dim
        # Layer 3 fully connected network
        self.fc1 =  paddle.nn.Sequential(
                                        paddle.nn.Linear(obs_dim,128),
                                        paddle.nn.ReLU())

        self.fc2 = paddle.nn.Sequential(
                                        paddle.nn.Linear(self.hidden_size,128),
                                        paddle.nn.ReLU())
        self.fc3 = paddle.nn.Linear(128,act_dim)
        self.lstm=paddle.nn.LSTM(128,self.hidden_size,1)      #[input_size,hidden_size,num_layers]

    def init_lstm_state(self,batch_size):
        self.h=paddle.zeros(shape=[1,batch_size,self.hidden_size],dtype='float32')
        self.c=paddle.zeros(shape=[1,batch_size,self.hidden_size],dtype='float32')
        self.first=True

    def forward(self, obs):
        # Enter state and output Q corresponding to all action s, [Q(s,a1), Q(s,a2), Q(s,a3)...]
        obs = self.fc1(obs)
        #Reset before each workout
        if (self.first):
            x,(h,c) = self.lstm(obs,(self.h,self.c))  #obs:[batch_size,num_steps,input_size]
            self.first=False
        else:
            x,(h,c) = self.lstm(obs)  #obs:[batch_size,num_steps,input_size]
        x=paddle.reshape(x,shape=[-1,self.hidden_size])
        h2 = self.fc2(x)
        Q = self.fc3(h2)
        return Q

Changed the shape of action, reward and done, others remain unchanged

#DRQN algorithm
class DRQN(parl.Algorithm):
    def __init__(self, model, act_dim=None, gamma=None, lr=None):
        self.model = model
        self.target_model = copy.deepcopy(model)    #Copy the predict network to get the target network and realize the fixed-Q-target function

        #Is the data type correct
        assert isinstance(act_dim, int)
        assert isinstance(gamma, float)
        assert isinstance(lr, float)

        self.act_dim = act_dim
        self.gamma = gamma
        self.lr = lr
        self.optimizer=paddle.optimizer.Adam(learning_rate=self.lr,parameters=self.model.parameters())    # Using Adam optimizer
    
    #Prediction function
    def predict(self, obs):
        return self.model.forward(obs)
        
    def learn(self, obs, action, reward, next_obs, terminal):
        #Flatten data
        action=paddle.reshape(action,shape=[-1])
        reward=paddle.reshape(reward,shape=[-1])
        terminal=paddle.reshape(terminal,shape=[-1])

        # From target_ Get the value of max Q 'in model to calculate target_Q
        next_predict_Q = self.target_model.forward(next_obs)
        best_v = paddle.max(next_predict_Q, axis=-1)#next_ predict_ Each dimension (row) of Q is maximized, because each row corresponds to a St, and the number of rows is the batch size of our input data
        best_v.stop_gradient = True                 #Prevent gradient transfer because model parameters are fixed
        terminal = paddle.cast(terminal, dtype='float32')    #Convert data type to float32
        target = reward + (1.0 - terminal) * self.gamma * best_v  #Realistic value of Q

        predict_Q = self.model.forward(obs)  # Get Q prediction

        #The next step is to get the Q(s,a) corresponding to the action
        action_onehot = paddle.nn.functional.one_hot(action, self.act_dim)    # Convert action to onehot vector, for example: 3 = > [0,0,0,1,0]
        action_onehot = paddle.cast(action_onehot, dtype='float32')        
        predict_action_Q = paddle.sum(
                                      paddle.multiply(action_onehot, predict_Q)              #Multiply element by element to get Q(s,a) corresponding to action
                                      , axis=1)  #Sum each row,Note the true purpose of summing here  # For example: pred_value = [[2.3, 5.7, 1.2, 3.9, 1.4]], action_onehot = [[0,0,0,1,0]]
                                                #Real is a transformation dimension, similar to matrix transpose. And target Same form. #  ==> pred_action_value = [[3.9]]


        # Calculate Q(s,a) and target_ The mean square deviation of Q is obtained. It is a regression problem to let the output of one group approach the output of another group, so the mean square loss function is used

        loss=paddle.nn.functional.square_error_cost(predict_action_Q, target)         
        cost = paddle.mean(loss)
        cost.backward()   #Back propagation
        self.optimizer.step()  #Update parameters
        self.optimizer.clear_grad()  #Clear gradient

    def sync_target(self):
        self.target_model = copy.deepcopy(model)    #Copy the predict network to get the target network and realize the fixed-Q-target function

class Agent(parl.Agent):
    def __init__(self,
                 algorithm,
                 act_dim,
                 e_greed=0.1,  
                 e_greed_decrement=0 ):

        #Judge whether the type of input data is int
        assert isinstance(act_dim, int)

        self.act_dim = act_dim
        
        #Call the object of the Agent parent class and enter the algorithm class algorithm so that we can call the members in the algorithm
        super(Agent, self).__init__(algorithm)

        self.global_step = 0          #Total operation steps
        self.update_target_steps = 200  # Every 200 training steps, copy the parameters of the model to the target_ In model

        self.e_greed = e_greed  # There is a certain probability to randomly select actions and explore
        self.e_greed_decrement = e_greed_decrement  # As the training gradually converges, the degree of exploration gradually decreases

    #The parameter obs is a single input, which is different from the parameter of the learn function
    def sample(self, obs):
        sample = np.random.rand()  # Generate decimals between 0 and 1
        if sample < self.e_greed:
            act = np.random.randint(self.act_dim)  # Exploration: every action has a probability to be selected
        else:
            act = self.predict(obs)  # Select the best action
        self.e_greed = max(
            0.01, self.e_greed - self.e_greed_decrement)  # As the training gradually converges, the degree of exploration gradually decreases
        return act        

    #The output is obtained through neural network
    def predict(self, obs):  # Select the best action
        obs=paddle.to_tensor(obs,dtype='float32')  #Convert target array to tensor
        predict_Q=self.alg.predict(obs).numpy()    #Convert the resulting tensor to an array
        act = np.argmax(predict_Q)  # Select the subscript with the largest Q, that is, the corresponding action
        return act

    #The learn function here mainly includes two functions. 1. Synchronize model parameters 2. Update model. These two functions are finally realized by calling the functions in the algorithm algorithm.
    #Note that the parameters entered here are arrays composed of a batch of data
    def learn(self, obs, act, reward, next_obs, terminal):
        # Synchronize the model and target every 200 training steps_ Parameters of model
        if self.global_step % self.update_target_steps == 0:
            self.alg.sync_target()
        self.global_step += 1      #Every time the learn function is executed, the total number of times is + 1

        #Convert to tensor
        obs=paddle.to_tensor(obs,dtype='float32')
        act=paddle.to_tensor(act,dtype='int32')
        reward=paddle.to_tensor(reward,dtype='float32')
        next_obs=paddle.to_tensor(next_obs,dtype='float32')
        terminal=paddle.to_tensor(terminal,dtype='float32')
        
        #Learning
        self.alg.learn(obs, act, reward, next_obs, terminal)

Because the data required by DRQN is sampled from an entire episode, each data in the dataset should be an episode. Therefore, the experience pool class is rewritten. The function of each step collected by the original class remains unchanged. At the same time, each step determines whether it is the last step of an episode, that is, whether done is True. Create a new episodemomery class, input all episodes, and randomly select a time step for processing

class EpisodeMemory(object):
    def __init__(self,episode_size,num_step):
        self.buffer = collections.deque(maxlen=episode_size)
        self.num_step=num_step   #time step 

    def put(self,episode):
        self.buffer.append(episode)
        
    def sample(self,batch_size):
        mini_batch = random.sample(self.buffer, batch_size)  #The return value is a list
        obs_batch, action_batch, reward_batch, next_obs_batch, done_batch = [], [], [], [], []

        for experience in mini_batch:
            self.num_step = min(self.num_step, len(experience)) #Prevents the sequence length from being less than the predefined length

        for experience in mini_batch:
            idx = np.random.randint(0, len(experience)-self.num_step+1)  #Randomly select the id of a time step
            s, a, r, s_p, done = [],[],[],[],[]
            for i in range(idx,idx+self.num_step):
                e1,e2,e3,e4,e5=experience[i][0]
                s.append(e1[0][0]),a.append(e2),r.append(e3),s_p.append(e4),done.append(e5)       
            obs_batch.append(s)
            action_batch.append(a)
            reward_batch.append(r)
            next_obs_batch.append(s_p)
            done_batch.append(done)

        #Convert data format
        obs_batch=np.array(obs_batch).astype('float32')
        action_batch=np.array(action_batch).astype('float32')
        reward_batch=np.array(reward_batch).astype('float32')
        next_obs_batch=np.array(next_obs_batch).astype('float32')
        done_batch=np.array(done_batch).astype('float32')

        #Convert list to array and data type
        return obs_batch,action_batch,reward_batch,next_obs_batch,done_batch    

    #Length of output queue
    def __len__(self):
        return len(self.buffer)

class ReplayMemory(object):
    def __init__(self,e_rpm):
        #Create a fixed length queue as a buffer area. When the queue is full, the oldest message will be automatically deleted
        self.e_rpm=e_rpm
        self.buff=[]
    # Add an experience to the experience pool
    def append(self,exp,done):
        self.buff.append([exp])
        #Add an entire episode to the experience pool
        if(done):
            self.e_rpm.put(self.buff)
            self.buff=[]
    #Length of output queue
    def __len__(self):
        return len(self.buff)

Set a certain episode interval to train once and jump out of the loop at the same time. Reinitialize the hidden layer parameters of LSTM before each training

# Train an episode
def run_episode(env, agent, rpm, e_rpm, obs_shape):   #rpm is the experience pool
    for step in range(1,learn_freq+1):
        #Reset environment
        obs = env.reset()
        while True:
            obs=obs.reshape(1,1,obs_shape)
            action = agent.sample(obs)  # Sampling actions, all actions have the probability of being tried
            next_obs, reward, done, _ = env.step(action)
            rpm.append((obs, action, reward, next_obs, done),done)   #Collect data
            obs = next_obs
            if done:
                break

    #After storing enough experience, train at intervals
    if (len(e_rpm) > memory_warmup_size):
        #Reset LSTM parameters before each workout
        model.init_lstm_state(batch_size)
        (batch_obs, batch_action, batch_reward, batch_next_obs,batch_done) = e_rpm.sample(batch_size)
        agent.learn(batch_obs, batch_action, batch_reward,batch_next_obs,batch_done)  # s,a,r,s',done

# Evaluate the agent, run 5 episode s, and average the total reward
def evaluate(env, agent, obs_shape,render=False):
    eval_reward = []   #The list stores the reward of all episode s
    for i in range(5):
        obs = env.reset()
        episode_reward = 0
        while True:
            obs=obs.reshape(1,1,obs_shape)
            action = agent.predict(obs)  # Predict action, select only the best action
            obs, reward, done, _ = env.step(action)
            episode_reward += reward
            if render:
                env.render()
            if done:
                break
        eval_reward.append(episode_reward)
    return np.mean(eval_reward)  #Average

env = gym.make('CartPole-v1')  
action_dim = env.action_space.n  
obs_shape = env.observation_space.shape  

save_path = './dqn_model.ckpt'

e_rpm=EpisodeMemory(episode_size,num_step)
rpm = ReplayMemory(e_rpm)  # Instantiate the experience playback pool of DQN
# Constructing agent based on parl framework
model = Model(obs_dim=obs_shape[0],act_dim=action_dim)
algorithm = DRQN(model, act_dim=action_dim, gamma=gamma, lr=lr)
agent = Agent(
    algorithm,
    act_dim=action_dim,
    e_greed=0.1,  # There is a certain probability to randomly select actions and explore
    e_greed_decrement=8e-7)  # As the training gradually converges, the degree of exploration gradually decreases

# First save some data in the experience pool to avoid insufficient sample richness at the beginning of training
while len(e_rpm) < memory_warmup_size:
    run_episode(env, agent, rpm,e_rpm,obs_shape[0])

#Define training times
max_train_num = 2000
best_acc=377.0

agent.restore(save_path)

# Start training
train_num = 0
while train_num < max_train_num:  # Training max_episode rounds, and the test part will not be counted into the episode quantity
    # train part
    #The purpose of the for loop is to test every 50 times
    for i in range(0, 50):
        run_episode(env, agent,rpm, e_rpm,obs_shape[0])
        train_num += 1
    # test part
    eval_reward = evaluate(env, agent,obs_shape[0], render=False)  #render=True to view the display effect

    if eval_reward>best_acc:
        best_acc=eval_reward
        agent.save(save_path)

    #Write information to log file
    logger.info('train_num:{}    e_greed:{}   test_reward:{}'.format(
        train_num, agent.e_greed, eval_reward))

e(env, agent,obs_shape[0], render=False)  #render=True to view the display effect

    if eval_reward>best_acc:
        best_acc=eval_reward
        agent.save(save_path)

    #Write information to log file
    logger.info('train_num:{}    e_greed:{}   test_reward:{}'.format(
        train_num, agent.e_greed, eval_reward))

[32m[10-30 21:27:56 MainThread @machine_info.py:88][0m nvidia-smi -L found gpu count: 1


/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1301: UserWarning: Skip loading for fc1.0.weight. fc1.0.weight receives a shape [64, 128], but the expected shape is [4, 128].
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1301: UserWarning: Skip loading for fc2.0.weight. fc2.0.weight receives a shape [128, 128], but the expected shape is [64, 128].
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1301: UserWarning: Skip loading for lstm.weight_ih_l0. lstm.weight_ih_l0 receives a shape [256, 4], but the expected shape is [256, 128].
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py:1301: UserWarning: Skip loading for lstm.0.cell.weight_ih. lstm.0.cell.weight_ih receives a shape [256, 4], but the expected shape is [256, 128].
  warnings.warn(("Skip loading for {}. ".format(key) + str(err)))


[32m[10-30 21:27:59 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:50    e_greed:0.09840000000000951   test_reward:10.0
[32m[10-30 21:28:01 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:100    e_greed:0.09717920000001676   test_reward:9.8
[32m[10-30 21:28:04 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:150    e_greed:0.09591920000002424   test_reward:10.2
[32m[10-30 21:28:07 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:200    e_greed:0.0944768000000328   test_reward:11.0
[32m[10-30 21:28:09 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:250    e_greed:0.09330160000003979   test_reward:9.0
[32m[10-30 21:28:12 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:300    e_greed:0.09211440000004684   test_reward:9.0
[32m[10-30 21:28:14 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:350    e_greed:0.09093200000005386   test_reward:9.6
[32m[10-30 21:28:17 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:400    e_greed:0.08976000000006082   test_reward:9.2
[32m[10-30 21:28:19 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:450    e_greed:0.0885680000000679   test_reward:9.8
[32m[10-30 21:28:22 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:500    e_greed:0.087371200000075   test_reward:9.2
[32m[10-30 21:28:24 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:550    e_greed:0.08620640000008192   test_reward:9.6
[32m[10-30 21:28:27 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:600    e_greed:0.0850312000000889   test_reward:9.4
[32m[10-30 21:28:29 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:650    e_greed:0.08388160000009573   test_reward:9.4
[32m[10-30 21:28:32 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:700    e_greed:0.08273040000010257   test_reward:9.6
[32m[10-30 21:28:34 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:750    e_greed:0.0815128000001098   test_reward:9.8
[32m[10-30 21:28:37 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:800    e_greed:0.08027520000011715   test_reward:9.8
[32m[10-30 21:28:40 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:850    e_greed:0.07881680000012581   test_reward:10.6
[32m[10-30 21:28:44 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:900    e_greed:0.07678400000013788   test_reward:13.0
[32m[10-30 21:28:48 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:950    e_greed:0.07478480000014975   test_reward:9.0
[32m[10-30 21:28:53 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1000    e_greed:0.07232720000016435   test_reward:12.0
[32m[10-30 21:28:57 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1050    e_greed:0.07040160000017578   test_reward:9.4
[32m[10-30 21:29:03 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1100    e_greed:0.06750000000019302   test_reward:105.2
[32m[10-30 21:29:23 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1150    e_greed:0.057448000000208894   test_reward:86.2
[32m[10-30 21:29:38 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1200    e_greed:0.04959200000018741   test_reward:53.2
[32m[10-30 21:30:08 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1250    e_greed:0.03536800000014851   test_reward:376.2
[32m[10-30 21:30:53 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1300    e_greed:0.012563200000160545   test_reward:142.4
[32m[10-30 21:31:24 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1350    e_greed:0.01   test_reward:16.2
[32m[10-30 21:31:53 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1400    e_greed:0.01   test_reward:189.2
[32m[10-30 21:32:20 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1450    e_greed:0.01   test_reward:177.8
[32m[10-30 21:32:58 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1500    e_greed:0.01   test_reward:119.8
[32m[10-30 21:33:36 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1550    e_greed:0.01   test_reward:192.8
[32m[10-30 21:34:37 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1600    e_greed:0.01   test_reward:200.6
[32m[10-30 21:35:13 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1650    e_greed:0.01   test_reward:19.4
[32m[10-30 21:36:00 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1700    e_greed:0.01   test_reward:181.8
[32m[10-30 21:36:47 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1750    e_greed:0.01   test_reward:139.2
[32m[10-30 21:37:43 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1800    e_greed:0.01   test_reward:193.8
[32m[10-30 21:38:50 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1850    e_greed:0.01   test_reward:322.4
[32m[10-30 21:40:11 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1900    e_greed:0.01   test_reward:500.0
[32m[10-30 21:41:29 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:1950    e_greed:0.01   test_reward:438.4
[32m[10-30 21:42:43 MainThread @<ipython-input-9-4c51392c320d>:45][0m train_num:2000    e_greed:0.01   test_reward:500.0

The effect is fairly good, but it can be seen that the current parameters are not the optimal parameters, and there is still room for improvement. You can try to adjust the given super parameters. The model can get better results in less time

There are already trained models in the space. If you don't want to train, you can load them directly.

Personal profile

Author: Wang Zhenhao

2020 undergraduate of computer science and technology in Qinhuangdao branch of Northeast University

Direction of interest: CV, RL

I am here AI Studio Gain silver level on the and light up 2 badges to

Topics: Computer Vision Deep Learning paddlepaddle

Programmer Think