Reinforcement learning: Q-learning, with source code interpretation

Posted by davidx714 on Thu, 13 Jan 2022 21:47:46 +0100

Reinforcement learning, supervised learning and unsupervised learning are the three directions of machine learning. It mainly solves decision-making problems, especially continuous decision-making problems.

Insert a reinforcement learning block diagram where

Learning Agent: strengthen the "learner" in the learning system;
Environment: the behavior of the subject is generated in the environment and the environment has an impact on the subject;
State: a collection of states that the environment feeds back to the subject;
Reward: a collection of feedback from the environment to the subject's behavior;
Action: a collection of actions of the subject in the environment.

Reinforcement learning is classified according to different conditions. This article talks about value-based offline reinforcement learning: Q-learning.


The purpose of Q-learning is to learn the value of a specific Action under a specific State. Create a Q-table, with State as the row and Action as the column, and update the Q-table through the Reward brought by each Action.

Q-learning belongs to offline learning. It is a learning of different strategies. The so-called different strategies mean that action strategies and evaluation strategies are not one strategy.

Q is Q(s,a), which is the expectation that taking Action a(a ∈ A) in the s State (s ∈ S) at a certain time can obtain benefits. The environment will feed back the corresponding Reward r according to the Action of the agent. Therefore, the main idea of the algorithm is to build the State and Action into a Q-table to store the Q value, and then select the Action that can obtain the maximum benefits according to the Q value.

The source code analysis is attached below. For partners who need formula derivation, please refer to the blogger's blog:

Several notes:
1. The window mentioned below can also be called canvas. I prefer to call it window;
2. At the end of the source code, a Baidu cloud link will be attached to you. Because the file is not very large, you can put it on Baidu cloud. If you put github, you may encounter problems such as lack of Intranet connectivity, which is not as good as Baidu cloud;
3. If there is any mistake in the explanation, please forgive me and inform the blogger in time to learn from each other. Thank you.

Here, the analysis is carried out according to the operation sequence:

The main file is, first run the main function, and an env class will be generated, which inherits the parent class Env(). First initialize init:

Main function:

if __name__ == "__main__":
    env = Env()
    agent = QLearningAgent(actions=list(range(env.n_actions)))
    for episode in range(1000):  #1000 cycles
        state = env.reset()
        while True:
            # agent generates action
            action = agent.get_action(str(state))   #str() converts a number to a string
            next_state, reward, done = env.step(action)
            # Update Q table
            agent.learn(str(state), action, reward, str(next_state))
            state = next_state   #Status update
            # When you reach the end, stop the game and start a new round of training
            if done:

env = Env():

class Env(tk.Tk):    #Create a parent window    
    def __init__(self):
        super(Env, self).__init__()    #super(Ecv,self) first finds the parent class (tk.Tk) of Env, and then converts the object of class env to the object of class (tk.Tk)
        self.action_space = ['u', 'd', 'l', 'r']   #Action space, up, down, left and right
        self.n_actions = len(self.action_space)   #Number of actions
        self.title('Q Learning')   #Should be window title
        self.geometry('{0}x{1}'.format(HEIGHT * UNIT, HEIGHT * UNIT))    #Window size
        self.shapes = self.load_images()   #Load the picture into the window
        self.canvas = self._build_canvas()   #Create canvas related properties
        self.texts = []   #Create an empty list

Relevant notes are written later. Here are some difficult points. The window size is 500 × 500, the initial value is assigned at the beginning of the file. The whole action has four actions, namely, up, down, left and right movement, which have a value corresponding to 0, 1, 2 and 3 respectively;

Then load_images():

    def load_images(self):
        rectangle = PhotoImage(
  "../img/rectangle.png").resize((65, 65)))   #Load pictures into load_images, picture size 65 × 65, each grid size of the window is 100 × one hundred
        triangle = PhotoImage(
  "../img/triangle.png").resize((65, 65)))
        circle = PhotoImage(
  "../img/circle.png").resize((65, 65)))

        return rectangle, triangle, circle

Why is each grid in the window 100 × 100 later, here is load_ The functions of the images function can be understood in this way. It is equivalent to loading images, that is, pre loading them and then putting them into the window;


    def _build_canvas(self):   #Create canvas related properties
        canvas = tk.Canvas(self, bg='white',
                           height=HEIGHT * UNIT,
                           width=WIDTH * UNIT)   #Window properties, the background color is white, and the height and width are 500
        # create grids
        for c in range(0, WIDTH * UNIT, UNIT):  # 0~400 by 100, the length is 400, draw a line every 100, x from left to right and y from top to bottom
            x0, y0, x1, y1 = c, 0, c, HEIGHT * UNIT
            canvas.create_line(x0, y0, x1, y1)    #Create Line 
        for r in range(0, HEIGHT * UNIT, UNIT):  # 0~400 by 100
            x0, y0, x1, y1 = 0, r, HEIGHT * UNIT, r
            canvas.create_line(x0, y0, x1, y1)

        # add img to canvas to add graphics to the canvas
        self.rectangle = canvas.create_image(50, 50, image=self.shapes[0])
        self.triangle1 = canvas.create_image(250, 150, image=self.shapes[1])
        self.triangle2 = canvas.create_image(150, 250, image=self.shapes[1]) = canvas.create_image(250, 250, image=self.shapes[2])

        # pack all
        canvas.pack()   #Packing

        return canvas

The vaguer thing here may be the for loop. The for loop means to cut the window and cut it into 5 × 5-size table similar to a grid map. As for why it is 0 ~ 400 and circulates every 100, this is the syntax feature of python. I don't explain much.

Auxiliary understanding of enlarged drawing:

For example, the red box on the left (in fact, there is no box, just easy to understand) is a window, and the right is divided into 5 × 5.

Then put in the just preloaded graphics, two triangles and a circle. If a triangle is encountered, reward < 0 indicates punishment; If a circle is encountered, reware > 0 indicates reward.

Then qlearning agent:

Initialization defines several parameters, such as learning rate and coefficients of several strategies, and defines Q-table in dictionary format:

class QLearningAgent:
    def __init__(self, actions):
        # actions = [0, 1, 2, 3]
        self.actions = actions   #There are four actions, using 0, 1, 2, 3 instead of up, down, left and right
        self.learning_rate = 0.01    #Learning rate
        self.discount_factor = 0.9    #Reward decay coefficient
        self.epsilon = 0.1    #Strategy coefficient
        self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])

Enter the main function loop for episode in range(1000):

state = env.reset():

    def reset(self):
        self.update()    #Merge two sets. Duplicate elements are merged. No duplicate elements coexist
        time.sleep(0.5)   #Postpone the running of the calling thread by 0.5s, that is, the following command line cannot run until 0.5s, which means that the data above should be merged before training
        x, y = self.canvas.coords(self.rectangle)    #Position of rectangle, initial position is (50,50)
        self.canvas.move(self.rectangle, UNIT / 2 - x, UNIT / 2 - y)   #Turn back to the starting point, first grid
        self.render()   #Is an updated function
        # return observation
        return self.coords_to_state(self.canvas.coords(self.rectangle))   #Returns the status of the rectangle, which should mean the coordinates of the corresponding Q-table

reset here means that after running a round, the rectangle (rectangle is equivalent to agent) is set to the origin, and the origin is 5 above × 5 the position in the upper left corner of the table;

Here is a function coords_to_state():

    def coords_to_state(self, coords):
        x = int((coords[0] - 50) / 100)
        y = int((coords[1] - 50) / 100)
        return [x, y]

It means that the normal coordinates (such as the origin) are (50,50), then its corresponding position in the Q-table is (0,0), 5 above × 5 can be regarded as a Q-table;

Enter the while loop of the main function:

env.render() is a function to update the window

    def render(self):
        time.sleep(0.03)   #Delay 0.03 seconds
        self.update()      #to update

action = agent.get_action(str(state)):

    def get_action(self, state):
        if np.random.rand() < self.epsilon:    #Enter this judgment statement with probability
            # Greedy strategy random exploration action
            action = np.random.choice(self.actions)   #From self Action = [0,1,2,3] random selection
            # Select from q table
            state_action = self.q_table[state]
            action = self.arg_max(state_action)   #If the value s of the selection actions here are consistent, they will be selected randomly
        return action

It literally means to obtain a dynamic action. At the beginning, this action is obtained randomly, so the first loop must enter the statements after else. Here we focus on self q_table [state], as mentioned earlier, q_table is the dictionary we define, then q_table is to obtain the value named state in the dictionary. For example: '[0,0]': [0.0,0.0,0.0], this is the dictionary Q_ For the content in table, its name is' [0,0] ', and the value corresponding to the name' [0,0] 'is [0.0,0.0,0.0], so self q_table [state] returns [0.0,0.0,0.0,0.0] (some official nouns may be called nonstandard, sorry);

Self arg_ max(state_action):

    def arg_max(state_action):
        max_index_list = []
        max_value = state_action[0]
        for index, value in enumerate(state_action):
            if value > max_value:
                max_value = value
            elif value == max_value:
        return random.choice(max_index_list)

Here is the selection of actions. As mentioned earlier, the actions are random at the beginning of operation. We can substitute a value for specific analysis. For example, just [0.0,0.0,0.0,0.0], enter the for loop, the index of the first loop is 0, value=0.0, and enter the elif statement to give max_ index_ Add the index index to the list. Add the index four times in a cycle. Finally, when running the return statement, max_index_list=[0,1,2,3], which means that the Q values with index values of 0,1,2,3 (corresponding to an action respectively, as mentioned earlier) are the same, so randomly select a value, such as 2, and return;

For example, in this figure, the if statement will be entered. The reason is that when traversing the four Q values where the rectangle is located, it is found that 0.0 > - 1.0, so the value of - 1.0 is cleared and only in the list max_ index_ Leave three index values index with Q value of 0.0 in the list, and then select them randomly. This meets the basic idea of Q-learning: select the actions that can obtain the maximum benefit according to the Q value.

next_state, reward, done = env.step(action):

    def step(self, action):
        state = self.canvas.coords(self.rectangle)   #Go back to the coordinates in the canvas
        base_action = np.array([0, 0])

        if action == 0:  # up
            if state[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:  # down
            if state[1] < (HEIGHT - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:  # left
            if state[0] > UNIT:
                base_action[0] -= UNIT
        elif action == 3:  # right
            if state[0] < (WIDTH - 1) * UNIT:
                base_action[0] += UNIT

        # move
        self.canvas.move(self.rectangle, base_action[0], base_action[1])   #Select the move rectangle according to the above action
        self.canvas.tag_raise(self.rectangle)   #Equivalent to topping
        next_state = self.canvas.coords(self.rectangle)
        # Judge scoring conditions, Reward
        if next_state == self.canvas.coords(
            reward = 100
            done = True
        elif next_state in [self.canvas.coords(self.triangle1),
            reward = -100
            done = True
            reward = 0
            done = False

        next_state = self.coords_to_state(next_state)  #Back to the coordinates of the Q table
        return next_state, reward, done

The action to be performed in the next step has been obtained, and then the function step is to update the position of the rectangle, calculate the reward reward, and judge whether the rectangle touches a triangle (obstacle) or a circle (end point). If so, it means that the round is ended, done will return true, and then exit the while loop to restart a new round;

This part of the code is easy to understand, mainly about self canvas. tag_ Raise (self. Rectangle) is equivalent to topping, which means that if the rectangle runs to the position of a circle, for example, without this line of code, you know it runs, but you can't see it intuitively. If there is this line of code, the rectangle will cover the circle.

The following is to update the Q table:

agent.learn(str(state), action, reward, str(next_state)):

def learn(self, state, action, reward, next_state):
    current_q = self.q_table[state][action]   #Find the corresponding coordinates in the Q table and add Reward in the selection status to them respectively
    # Belman equation update
    new_q = reward + self.discount_factor * max(self.q_table[next_state])   #Update Q value
    self.q_table[state][action] += self.learning_rate * (new_q - current_q)

Here is the calculation of Q value, so I won't elaborate;

Then state update, print_value_all:

    def print_value_all(self, q_table):
        for i in self.texts:
        for i in range(HEIGHT):
            for j in range(WIDTH):
                for action in range(0, 4):   #Here should be to traverse each grid box of the window for text update
                    state = [i, j]
                    if str(state) in q_table.keys():
                        temp = q_table[str(state)][action]   #q_table[str(state)] refers to the state lookup dictionary Q_ You know the corresponding value in table and then action
                        self.text_value(j, i, round(temp, 2), action)   #Round is round

This print_all_value is the Q value re released on the window. You can see it by looking at a picture.

Finally, attach Baidu cloud link:
Extraction code: 8888

Topics: Python