UR5 assembly simulation and comparison based on DDPG and TD3

Posted by android6011 on Fri, 08 Oct 2021 04:23:45 +0200

This project is a major assignment of the course "optimal control" taught by Le Xinyi in the fall semester of Shanghai Jiaotong University in 2020. Most of the content is based on the work of senior Fang Xiaomeng, "Research on multi manipulator cooperative control technology based on neural network algorithm". Recently, due to the opening of the topic, I reviewed the intensive learning again. Here I share what I learned in the course

Project introduction

In this project, the UR5 manipulator is used in the simulation environment to insert the shaft at the end of the robot into the fixed hole, as shown in the following figure. Different from the direct assembly in the original paper, this scheme adopts two improvement measures:
1) The initial position is changed to near the orifice (plus random disturbance), which can be realized by visual servo in engineering, and the technology is relatively mature
2) Change the jack scheme from "straight beating the Yellow Dragon" to "looking for things"

Combined with the problems existing in the original paper, the following improvement ideas are put forward

Problems in the original paperImprovement ideas
Training time is too long (4 days and 4 nights)Shorten step and episode
The training quality is poor and the success rate is lowChange the initial position so that it can obtain as much effective data as possible
Low learning efficiencyReduce replay_buffer pool depth

system configuration

Coppelia Sim (V-REP simulation environment)

The environment used in this project is Ubuntu 16.04, and the version used is CoppeliaSim V4.2.0 rev5. For different systems, please select the corresponding version according to the options on the official website, CoppeliaSim Download After downloading, CoppeliaSim_Edu_V4_2_0_Ubuntu16_04 folder is the policy environment folder.

At coppeliasim_ Edu_ V4_ 2_ 0_ Ubuntu16_ Open the terminal under file 04 and enter. / coppeliaSim.sh.

 way@way:~/CoppeliaSim_Edu_V4_2_0_Ubuntu16_04$ ./coppeliaSim.sh

Creating a virtual environment using conda

tensorflow 1.14.0, 1.x and GPU Version (CUDA, cudnn) are used in this project. The graphics card is a GeForce GTX 1070/PCIe/SSE2. Here, Anaconda3 is recommended to create a virtual environment.
First, import the path of anaconda3. It is recommended to use the abbreviation in. bashrc and the corresponding environment in alias for easy operation:

alias condaenv="export PATH="/home/way/anaconda3/bin:$PATH""
alias condaact='source activate'
# To activate this environment, use
alias tf='conda activate tf114py36'
alias pytorch='conda activate pytorch'

Here tf is the environment we use. The creation method of conda is:
Create a new environment conda create -n tf114py36 python=3.6.0
Install tensorflow 1.14.0: CONDA install - n tf114py36 tensorflow GPU = = 1.14.0
Delete existing environment conda remove -n xxx
Switch environment conda activate xxx
Exit environment conda deactivate
View the installed package conda list

If all are installed, the steps to enter the environment are very simple. Three steps are OK: condaenv, condaact, tf

Program running

  1. First run V-REP and open scene → \to → ur5_ force_ vision_ Insert-v0.ttt, import the hole axis model, and the size of the hole axis is shown in the figure below
FILE NAMESHAPESIZE
peg0circular ϕ 20 × 30 \phi 20 \times 30 ϕ20×30
hole0circular ϕ 22 × 30 \phi22\times30 ϕ22×30
peg1circular ϕ 20 × 30 \phi 20 \times 30 ϕ20×30
hole1circular ϕ 23 × 30 \phi23\times30 ϕ23×30
peg2square 20 × 20 × 30 20\times20\times30 20×20×30
hole2square 23 × 23 × 30 23\times23\times30 23×23×30
  1. Run the learning function in the tf environment. Take td3 as an example. python main_TD3.py
from vrepEnv import ArmEnv
import numpy as np
import os
from logger import Logger
import time
import datetime
import torch
import utils
import TD3



MAX_EPISODES = 10000
MAX_EP_STEPS = 10
ON_TRAIN = False //True if learning status
MEMORY_CAPACITY = 1000
BATCH_SIZE = 200
logging_directory = os.path.abspath('logs')
# logger = Logger(logging_directory)

# Set environment
env = ArmEnv()
state_dim = env.state_dim
action_dim = env.action_dim
a_bound = env.action_bound
seed = 0
torch.manual_seed(seed)
np.random.seed(seed)
pn = a_bound*0.2
nc = a_bound*0.5


# Set method: pd/RL

rl = TD3.TD3(state_dim, action_dim, a_bound,policy_noise=pn,noise_clip=nc)
replay_buffer = utils.ReplayBuffer(state_dim, action_dim,MEMORY_CAPACITY)

def train():
    """ Training the actor """

    logger = Logger(logging_directory)
    env.start_simulation()

    action_var = a_bound  # Action noise factor
    reward_value = []  # Record the reward value for each episode

    for episode in range(MAX_EPISODES):

        s = env.reset()
        episode_reward = 0.  # Total reward for an episode

        for step in range(MAX_EP_STEPS):
            print('step: %i '%(step))
            pn = action_var * 0.2
            nc = action_var * 0.5
            a = (
                    rl.select_action(np.array(s))
                    + np.random.normal(0, pn, size=action_dim)
            ).clip(-action_var, action_var)
            print('Action: ', a)

            #Perform action
            s_, r, done, safety = env.step(a)
            # Store data in replay buffer s, a, s_, r, done->not_done
            replay_buffer.add(s, a, s_, r, done)

            s = s_
            episode_reward += r
            mean_episode_reward = episode_reward/(step+1)

            if episode>MEMORY_CAPACITY:
                print('the memory is full')
                rl.train(replay_buffer,BATCH_SIZE)  # Start to learn once has fulfilled the memory
                action_var *= .999  # Reduce the action randomness once start to learn

            if done or step == MAX_EP_STEPS - 1 or safety == 1:
                print('Episode: %i | %s | %s | MER: %f | Steps: %i | Var: %f'
                      % (episode, 'Not inserted' if not done else 'Inserted', 'Unsafe' if safety == 1 else 'Safe',
                         mean_episode_reward, step, action_var))
                break

        reward_value.append(mean_episode_reward)

    logger.save_reward_value(reward_value)
    rl.save(logger.models_directory)

def test():
    models_directory = '/home/way/py/RL_UR_Project/UR5_RL_Assembly-inserted_discrete/logs/2020-12-21.22:51:59/models'
    rl.load(models_directory)
    s = env.reset()
    while True:
        a = rl.select_action(s)
        s, r, done, _ = env.step(a)
        if done and np.abs(s[12]) <= 10 and np.abs(s[13]) <= 10 and np.abs(s[14]) <= 10:
            env.move_to(np.array([s[0],s[1],0.011]),np.array([0, 0, 0]))
            time.sleep(3)
            break



if __name__ == '__main__':

    timestamp = time.time()
    timestamp_value = datetime.datetime.fromtimestamp(timestamp)
    print("The program starts at:  ", timestamp_value.strftime('%Y-%m-%d.%H:%M:%S'))

    if ON_TRAIN:
        train()
    else:
        test()

Comparison of experimental results

DDPG simulation effect

Effect picture of successful training


Effect drawing of training failure

Periodic oscillation occurs and it is difficult to converge

DDPG simulation results

Assembly time less than one minute, maximum assembly force less than 14 n, success rate: 85%

TD3 simulation effect

Effect picture of successful training

Training failure diagram

Final result divergence

TD3 simulation results

The assembly time is less than 30 seconds, the maximum assembly force is less than 20 N, and the variance is small, the success rate is 60%

TD3 vs DDPG

projectTD3DDPG
Success rate60%85%
Failure performanceExponential divergencePeriodic oscillation is difficult to converge
Maximum assembly force20 N14 N
Average assembly time15 s30 s
Variance of assembly forceSmalllarge
Study timeone nightone night

Topics: TensorFlow Deep Learning