Q-learning(DQN): the swing of reinforcement learning

Posted by vigge89 on Tue, 04 Jan 2022 02:42:09 +0100

1. Project introduction

Q-learning: Q-learning was first proposed in 1989 and was initially based on tabular form.

DQN: DQN (deep Q network) was proposed only in 2013. It is a Q-learning algorithm based on deep neural network, and it is also the most commonly used Q-learning algorithm at present.

Objectives of the project:
The DQN algorithm is used to train an agent to obtain a higher reward value in the vehicle placement environment. For a detailed description of car swing, please refer to the blog: Introduction to the classic control environment of OpenAI Gym -- CartPole

The reference contents of this project include:

2. Import dependency

import os
import gym
import numpy as np
import paddle
from collections import deque
from visualdl import LogWriter
import copy
import time

3. Build model

3.1 DQN network structure

Because our state s is a one-dimensional vector, we use the full connection layer in practice

class MyDQNnetwork(paddle.nn.Layer):

    # state_size: the size of the state space
    # action_size: the size of the action space

    def __init__(self, state_size, action_size):
        super(MyDQNnetwork, self).__init__()

        self.fc1 = paddle.nn.Linear(state_size, 128)
        self.fc2 = paddle.nn.Linear(128, 128)
        self.fc3 = paddle.nn.Linear(128, action_size)


    def forward(self, state):
        out = self.relu(self.fc1(state))
        out = self.relu(self.fc2(out))
        q = self.fc3(out)
        return q

3.2 experience playback array

experience replay is an important skill in reinforcement learning, which can greatly improve the performance of reinforcement learning. Experience playback means to store the records of the interaction between the agent and the environment (i.e. experience) in an array, and then repeatedly use these experiences to train the agent. This array is called the replay buffer.

Advantages of experience playback:

  • Breaking the correlation of sequences: when training DQN, we update the parameters of DQN with a quadruple every time. We want two adjacent quads to be independent. However, when the agent collects experience, there is a strong correlation between the two adjacent quads (st, at, rt, st+1) and (st+1, at+1, rt+1, st+2). The effect of training DQN with these strongly correlated quads in turn is often very poor. Experience playback randomly extracts a quad from the array to update the DQN parameters. In this way, the four elements randomly selected are independent, eliminating the correlation.

  • Reuse the collected experience instead of discarding it once, so that the same performance can be achieved with fewer samples

Modification of actual operation:

  • Use deque to store experience. Deque is a Python double ended queue. When the capacity is specified, if you continue to add elements to the end of the queue, the first element of the queue will automatically leave the queue.

  • In actual use, the experience array does not necessarily store quads, but may be n tuples (this item is the five tuples used), depending on the actual situation.

class MyMemoryBuffer(object):
    def __init__(self,memory_size):

    # Experience is increased because the experience array is stored in deque, which is a double ended queue,
    # Our deque specifies the size. When the deque is full and then add the element, the element at the head of the team will be automatically removed from the team
    def add(self,experience):

    def size(self):
        return len(self.buffer)

    # continuous=True indicates that batch is taken continuously_ Three experiences
    def sample(self , batch_szie , continuous = True):

        # Does the selected number of experiences exceed the number of experiences in the buffer
        if batch_szie>len(self.buffer):
        # Whether experience is taken continuously
        if continuous:
            # random.randint(a, b) returns any integer between [a, b]
            return [self.buffer[i] for i in range(rand,rand+batch_szie)]
            # numpy.random.choice(a, size=None, replace=True, p=None)
            # A if it is an array, sample in the array; If a is an integer, it is sampled randomly from the sequence [0,a-1]
            # If size is an integer, it indicates the number of samples
            # If replace is True, sampling can be repeated; If it is false, it will not be repeated
            # p is an array representing the sampling probability of each element in a; None means equal probability sampling
            return [self.buffer[i] for i in indexes]

    def clear(self):

4. Define agent

Different strategies, behavior strategies and target strategies:

  • Behavior strategy: in reinforcement learning, we let the agent interact with the environment, record the observed state, action and reward, and use these experiences to learn a strategy function. In this process, the strategy that controls the interaction between agent and environment is called behavior strategy. The function of behavioral strategy is to collect experience, that is, observed environment, action and reward.
  • Target strategy: the purpose of training is to obtain a strategy function, which is used to control the agent after training; This policy function is called the target policy.
  • The behavior strategy and the target strategy can be the same or different. The same strategy refers to using the same behavior strategy and target strategy; Different strategies refer to different behavior strategies and target strategies.

DQN is a heterogeneous strategy. In this project, the behavior strategy uses ϵ- The green strategy is commonly used and simple. The corresponding sample function in our code

The target strategy is a deterministic strategy, that is, it is predicted by DQN network and corresponds to the predict function in the code.

Target network:
Q learning algorithm has a defect: the DQN trained by Q learning will overestimate the real value and pass
It is often non-uniform. This defect leads to poor performance of DQN. Overestimation is not the defect of DQN model
It is the defect of Q learning algorithm. One reason for the overestimation of Q learning is the spread of deviation caused by bootstrap (in fact, there is more than one reason. Please refer to Mr. Wang Shusen's book).

The embodiment of bootstrap problem here is that DQN allows itself to fit its own estimation. If the estimation is high, the fitting value will be high, and then estimate and fit... It will continue to be high. One way to alleviate the bootstrap problem is to use the target network.

The training process of agent is as follows:
It is mainly reflected in the learn function

  1. DQN propagates forward, and the input is s_t,a_t. The output gets q_t
  2. The target network propagates forward, and the input is s_t+1,a_t. Output gets q '_ t
  3. Calculate TD target y=r_t+gamma*q’_ t
  4. Calculate TD error error=q_t-y
  5. DQN back propagation, calculate the gradient and update it
  6. Update the parameters of the target network (it can be executed once in multiple time steps)
# Update the operation function of the target network in mydqnagent Call in learn () function
def soft_update(target,source,tau=0):
    # The zip() function takes the iteratable object as a parameter, packages the corresponding elements in the object into tuples, and then returns a list composed of these tuples.
    for target_param,param in zip(target.parameters(),source.parameters()):
class MyDQNAgent():

    def __init__(self, model, action_size,gamma=None, lr=None, e_greed=0.1, e_greed_decrement=0):
        self.action_size = action_size
        self.global_step = 0
        self.update_target_steps = 200  # The parameters of the target network are updated every 200 time steps
        self.e_greed = e_greed          # ϵ- In green ϵ
        self.e_greed_decrement = e_greed_decrement # ϵ Dynamic update factor
        self.model = model
        self.target_model = copy.deepcopy(model)
        self.gamma = gamma  # Return discount factor
        self.lr = lr
        self.mse_loss = paddle.nn.MSELoss(reduction='mean')
        self.optimizer = paddle.optimizer.Adam(learning_rate=lr, parameters=self.model.parameters())

    # Generate actions using behavior policies
    def sample(self, state):

        sample = np.random.random()  # [0.0, 1.0)
        if sample < self.e_greed:
            act = np.random.randint(self.action_size) # Returns an integer of [0, action_size), here is 0 or 1
            if np.random.random() < 0.01:
                act = np.random.randint(self.action_size)
                act = self.predict(state)

        # Dynamic change e_ Green, but not less than 0.01
        self.e_greed = max(0.01, self.e_greed - self.e_greed_decrement)

        return act

    # DQN network prediction
    def predict(self, state):

        state = paddle.to_tensor(state, dtype='float32')
        # DQN network prediction
        pred_q = self.model(state)
        # Select the action with the highest probability value
        act = pred_q.argmax().numpy()[0]
        return act

    # Update DQN network
    def learn(self, state, action, reward, next_state, terminal):
        """Update model with an episode data

            state(np.float32): shape of (batch_size, state_size)
            act(np.int32): shape of (batch_size)
            reward(np.float32): shape of (batch_size)
            next_state(np.float32): shape of (batch_size, state_size)
            terminal(np.float32): shape of (batch_size)


        if self.global_step % self.update_target_steps == 0:
            # 6. Update target network

        self.global_step += 1

        action = np.expand_dims(action, axis=-1)
        reward = np.expand_dims(reward, axis=-1)
        terminal = np.expand_dims(terminal, axis=-1)

        state = paddle.to_tensor(state, dtype='float32')
        action = paddle.to_tensor(action, dtype='int32')
        reward = paddle.to_tensor(reward, dtype='float32')
        next_state = paddle.to_tensor(next_state, dtype='float32')
        terminal = paddle.to_tensor(terminal, dtype='float32')
         # 1. DQN network does forward propagation
        pred_values = self.model(state)

        # Dimension of action: 2
        action_dim = pred_values.shape[-1]

        # Delete the dimension with dimension 1 in the Shape of the input action
        action = paddle.squeeze(action, axis=-1)

        # onhot encoding of action
        action_onehot = paddle.nn.functional.one_hot(action, num_classes=action_dim)

        pred_value = paddle.multiply(pred_values, action_onehot)
        pred_value = paddle.sum(pred_value, axis=1, keepdim=True)

        # target Q
        with paddle.no_grad():
            # 2. Forward propagation of target network
            max_v = self.target_model(next_state).max(1, keepdim=True)
            # 3. TD objectives
            target = reward + (1 - terminal) * self.gamma * max_v
        # 4. TD error
        loss = self.mse_loss(pred_value, target)

        # 5. Update the parameters of DQN
        # Gradient clearing
        # Reverse calculation gradient
        # Gradient update

        return loss.numpy()[0]

5. Training

5.1 define visualization file path


5.2 training

LEARN_FREQ = 5  # The frequency of training,
MEMORY_SIZE = 200000  # Empirical array size
MEMORY_WARMUP_SIZE = 200  # Threshold number of experiences to start training
GAMMA = 0.99

# Enable the environment for training. If done=1, the training will be ended and the reward value will be returned
def run_train_episode(agent, env, rpmemory):
    total_reward = 0
    state = env.reset()
    step = 0
    while True:
        step += 1
        # Agent sampling action
        action = agent.sample(state)
        next_state, reward, done, _ = env.step(action)
        rpmemory.add((state, action, reward, next_state, done))

        # When the number of experiences in the experience playback array is enough (greater than the given threshold, set manually), train once every 5 time steps
        if (rpmemory.size() > MEMORY_WARMUP_SIZE) and (step % LEARN_FREQ == 0):
            # s,a,r,s',done
            batch_state, batch_action, batch_reward, batch_next_state,batch_done = zip(*experiences)
            # Agent updating value network
            train_loss = agent.learn(batch_state, batch_action, batch_reward,batch_next_state, batch_done)

        total_reward += reward
        state = next_state
        if done:
    return total_reward

# Verify the environment for 5 times and take the average reward value
def run_evaluate_episodes(agent, env, eval_episodes=5, render=False):
    eval_reward = []
    for i in range(eval_episodes):
        state = env.reset()
        episode_reward = 0
        while True:
            # Agent selection action execution
            action = agent.predict(state)
            state, reward, done, _ = env.step(action)
            episode_reward += reward
            # render is not supported on AI studio platform and can be started on your own computer
            if render:

            if done:
    return np.mean(eval_reward)

def main():
    # Loading environment
    env = gym.make('CartPole-v0')
    state_size = env.observation_space.shape[0] # 4
    action_size = env.action_space.n  # 2

    # Initialize experience array
    rpm = MyMemoryBuffer(MEMORY_SIZE)

    # build an agent
    model = MyDQNnetwork(state_size, action_size)

    agent = MyDQNAgent(model, action_size,gamma=GAMMA, lr=LEARNING_RATE, e_greed=0.1, e_greed_decrement=1e-6)

    max_episode = 1200

    # start training
    episode = 0
    while episode < max_episode:
        # train part
        for i in range(50):
            total_reward = run_train_episode(agent, env, rpm)
            episode += 1

        # test part
        eval_reward= run_evaluate_episodes(agent, env, render=False)
        writer.add_scalar('eval reward',eval_reward,episode)
        if episode%50==0:
            print('episode:{}    e_greed:{}   Test reward:{}'.format(episode, agent.e_greed, eval_reward))
    print('all used time {:.2}s = {:.2}h'.format(time.time()-start_time,(time.time()-start_time)/3600))
if __name__ == '__main__':

W1227 16:36:14.529779  4613 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.0, Runtime API Version: 10.1
W1227 16:36:14.534389  4613 device_context.cc:465] device: 0, cuDNN Version: 7.6.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/creation.py:130: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe. 
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if data.dtype == np.object:

episode:50    e_greed:0.09952399999999953   Test reward:8.8
episode:100    e_greed:0.09902599999999903   Test reward:8.6
episode:150    e_greed:0.09853399999999854   Test reward:9.6
episode:200    e_greed:0.09804899999999805   Test reward:9.6
episode:250    e_greed:0.09756399999999757   Test reward:9.8
episode:300    e_greed:0.09706799999999707   Test reward:9.6
episode:350    e_greed:0.09656799999999657   Test reward:9.4
episode:400    e_greed:0.09608299999999609   Test reward:9.6
episode:450    e_greed:0.0955879999999956   Test reward:9.6
episode:500    e_greed:0.0950979999999951   Test reward:9.8
episode:550    e_greed:0.09447199999999448   Test reward:15.8
episode:600    e_greed:0.09283599999999284   Test reward:53.0
episode:650    e_greed:0.08931699999998932   Test reward:52.4
episode:700    e_greed:0.08250499999998251   Test reward:145.8
episode:750    e_greed:0.07330899999997331   Test reward:200.0
episode:800    e_greed:0.06380499999996381   Test reward:200.0
episode:850    e_greed:0.054163999999954165   Test reward:193.2
episode:900    e_greed:0.044443999999944445   Test reward:164.8
episode:950    e_greed:0.03524099999993524   Test reward:139.8
episode:1000    e_greed:0.026618999999926618   Test reward:157.4
episode:1050    e_greed:0.01726199999991726   Test reward:194.4
episode:1100    e_greed:0.01   Test reward:200.0
episode:1150    e_greed:0.01   Test reward:154.2
episode:1200    e_greed:0.01   Test reward:195.0
all used time 1.3e+02s = 0.036h

5.3 result display

The reward value at the beginning is not high because there is not enough experience in the experience array. As the number of iterations increases, the reward value starts to jump when the experience in the experience array reaches a certain number.

6. Project summary

This project is the construction of DQN from 0. Personally, I think it is more detailed. Because I have been reading books before, only theoretical knowledge, and now I start to practice. I feel that it is still difficult from theory to practice, but there are many examples to refer to, which can speed up the pace of learning.

I am a shallow learner and have just started intensive learning. There are many deficiencies. Please criticize and correct!!

Topics: AI Deep Learning paddlepaddle