MySQL group replication: why not recommend large transactions

Posted by wild_dog on Wed, 19 Jan 2022 05:05:09 +0100

Recently, the backup and restore function under different data volumes has been tested, so a large amount of data needs to be migrated to the MGR cluster. However, when the amount of migrated data is a little large, errors and interruptions are often reported. Many searches on the Internet just mention suggestions not to perform major tasks. The official documents are better, suggesting that problems may be caused by memory allocation and network bandwidth. However, due to the need to hand over jobs and water the daily and weekly reports, we can't just know one thing and don't know whether the other is right or not. We still have to find out the real reasons that hinder the implementation of major affairs!

Scene reproduction

This error is almost inevitable. It can be repeated by migrating hundreds of megabytes of data to the MGR cluster. The error log of the cluster Master is as follows:

2021-07-15T05:38:44.345845Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
......
2021-07-15T05:38:44.690220Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2021-07-15T05:39:03.249057Z 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.'
2021-07-15T05:39:03.233267Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'

The client side is more concise, and a prompt is given:

error on observer while running replication hook 'before_commit'

The mistake lies in us Last article It is mentioned in that mgr exists in the form of MySQL plug-in. MySQL executes these callback functions after submitting prepare in phase 2 and before writing binlog. Therefore, this error is relatively general. You can only know that there is a problem when executing mgr plug-in functions. According to this question, if you go to Google and Baidu to search, you are asked to modify the replication_ transication_ size_ The size of the limit parameter, but after adjustment, we found that errors still occur.

By checking the official MySQL documentation, we can find the following sentence:

replication_transication_size_limit

This reminds us that we can think about this problem from the heartbeat mechanism between group members

Introduction to terms and functions

We went to GitHub to find the source code of a group of replication plug-ins https://github.com/mysql/mysql-server/blob/3e90d07c3578e4da39dc1bce73559bbdf655c28c/plugin/group_replication/libmysqlgcs/src/bindings/xcom/xcom/xcom_base.cc , you can see these terms and an introduction to some functions.

/*
    A node is an instance of the xcom thread. There is only one instance
    of the xcom thread in the agent.
    A client is the application which is using xcom to send messages.
    A thread is a real OS thread.
    A task is a logical process. It is implemented by coroutines and
    an explicit stack.
*/
/*
 static int tcp_server(task_arg);
 The tcp_server listens on the xcom port and starts an
 acceptor_learner_task whenever a new connection is detected.
*/
/*  
 static int sender_task(task_arg);
    The sender_task waits for tcp messages on its input queue and
    sends it on the tcp socket. If the socket is closed for any
    reason, the sender_task will reconnect the socket. There is one
    sender_task for each socket. The sender task exists mainly to
    simplify the logic in the other tasks, but it could have been
    replaced with a coroutine which handles the connection logic after
    having reserved the socket for its client task.
    
    static int acceptor_learner_task(task_arg);
    This is the server part of the xcom thread. There is one
    acceptor_learner_task for each node in the system. The acceptor
    learner_task reads messages from the socket, finds the correct
    Paxos state machine, and dispatches to the correct message handler
    with the state machine and message as arguments.

    static int alive_task(task_arg);
    Sends i-am-alive to other nodes if there has been no normal traffic
    for a while. It also pings nodes which seem to be inactive.
    static int detector_task(task_arg);
    
    The detector_task periodically scans the set of connections from
    other nodes and sees if there has been any activity. If there has
    been no activity for some time, it will assume that the node is
    dead, and send a view message to the client.
*/

alive_task

Find the code related to heartbeat sending. You can see that the heartbeat is sending in a loop. Source address L408-L462

int alive_task(task_arg arg MY_ATTRIBUTE((unused))) {
  DECL_ENV
  pax_msg *i_p;
  pax_msg *you_p;
  END_ENV;
  TASK_BEGIN

  ep->i_p = ep->you_p = NULL;

  while (!xcom_shutdown) {
    {
      double sec = task_now();
      synode_no alive_synode = get_current_message();
      site_def const *site = find_site_def(alive_synode);

      /*
        If there are some configuration changes, apply them immediately
      */
      validate_update_configuration(site, alive_synode);

      if (site && get_nodeno(site) != VOID_NODE_NO) {
        /* 
        If the node has not sent heartbeat for a period of time,
        Send heartbeat now
        */
        if (server_active(site, get_nodeno(site)) < sec - 0.5) {
          replace_pax_msg(&ep->i_p, pax_msg_new(alive_synode, site));
          ep->i_p->op = i_am_alive_op;
          /* Send is called through this method_ MSG write heartbeat task to queue*/
          send_to_all_site(site, ep->i_p, "alive_task");
        }

        /* 
        If a node doesn't have a heartbeat,
        ping it to see if it's alive or dead
        */
        {
          node_no i;
          for (i = 0; i < get_maxnodes(site); i++) {
            if (i != get_nodeno(site) && may_be_dead(site->detected, i, sec)) {
              replace_pax_msg(&ep->you_p, pax_msg_new(alive_synode, site));
              ep->you_p->op = are_you_alive_op;

              ep->you_p->a = new_app_data();
              ep->you_p->a->app_key.group_id = ep->you_p->a->group_id =
                  get_group_id(site);
              ep->you_p->a->body.c_t = xcom_boot_type;
              init_node_list(1, &site->nodes.node_list_val[i],
                             &ep->you_p->a->body.app_u_u.nodes);

              IFDBG(D_DETECT, FN; COPY_AND_FREE_GOUT(
                        dbg_list(&ep->you_p->a->body.app_u_u.nodes)););

              send_server_msg(site, i, ep->you_p);
            }
          }
        }
      }
    }
    TASK_DELAY(1.0);

send_msg

send_msg is a general method of xcom, that is, sending heartbeat task, ping ing other members, or sending binlog event. Finally, it is stuffed into the corresponding queue through this method. Source address: L846-L864

/* Push message into queue*/
int send_msg(server *s, node_no from, node_no to, uint32_t group_id,
             pax_msg *p) {
  assert(p);
  assert(s);
  {
    /* msg_link Queue*/
    msg_link *link = msg_link_new(p, to);
    IFDBG(D_NONE, FN; PTREXP(&s->outgoing);
          COPY_AND_FREE_GOUT(dbg_msg_link(link)););
    p->from = from;
    p->group_id = group_id;
    p->max_synode = get_max_synode();
    p->delivered_msg = get_delivered_msg();
    IFDBG(D_NONE, FN; PTREXP(p); STREXP(s->srv); NDBG(p->from, d);
          NDBG(p->to, d); NDBG(p->group_id, u));
    channel_put(&s->outgoing, &link->l);
  }
  return 0;
}

sender_task

sender_task gets the task from the queue and passes_ send_mgs is sent to each TCP connection. Source address L1426-L1570

/* Fetch messages from queue and send to other server.  Having a
   separate queue and task for doing this simplifies the logic since we
   never need to wait to send. */
int sender_task(task_arg arg) {
  ......
  TASK_BEGIN
  ......
  ep->s = (server *)get_void_arg(arg);
  ep->link = NULL;
  ep->tag = TAG_START;
  srv_ref(ep->s);

  while (!xcom_shutdown) {
    /* Loop until connected */
    G_DEBUG("Connecting to %s:%d", ep->s->srv, ep->s->port);
    for (;;) {
      ......
    }

    G_DEBUG("Connected to %s:%d on fd=%d", ep->s->srv, ep->s->port,
            ep->s->con.fd);

    /* We are ready to start sending messages.
       Insert a message in the input queue to negotiate the protocol.
    */
    start_protocol_negotiation(&ep->s->outgoing);
    while (is_connected(&ep->s->con)) {
      int64_t ret;
      assert(!ep->link);
      ......
      if (link_empty(&ep->s->outgoing.data)) {
        TASK_CALL(flush_srv_buf(ep->s, &ret));
      }
      CHANNEL_GET(&ep->s->outgoing, &ep->link, msg_link);
      {
        int64_t ret_code;
        if (ep->link->p) {
          ......
          /* Call_ send_msg send messages to other server s*/
          TASK_CALL(_send_msg(ep->s, ep->link->p, ep->link->to, &ret_code));
        }
      }
    next:
      msg_link_delete(&ep->link);
      /* TASK_YIELD; */
    }
  }
}

_send_msg

Finally, write or flush buffer in the following step Source address L256-L322

/* Send a message to server s */
static int _send_msg(server *s, pax_msg *p, node_no to, int64_t *ret) {
    ......
    TASK_CALL(flush_srv_buf(s, ret));
    or
    TASK_CALL(task_write(&s->con, ep->buf, ep->buflen, &sent));
    ......
    alive(s); /* Note activity */
    X_FREE(ep->buf);
    /* UNLOCK_FD(s->con.fd, 'w'); */
    if (sent <= 0) {
        shutdown_connection(&s->con);
    }
    ......
}

conclusion

Here we can explain why large transactions cause problems in mgr clusters, "The reason is that each socket connection has only one sender_task, which is equivalent to only one worker obtaining and processing tasks from the queue. When executing large transactions, whether memory allocation or binlog event sending takes a long time, resulting in the blocking of subsequent heartbeat packets. Therefore, the master is mistakenly considered dead by other members and expelled.", After expulsion, the master is automatically set to read_only status. If the client continues to import sql, it will obviously cause an error. Finally, error befor is reported_ Commit error.

resolvent

The solution is simple:

  1. replication_transication_size_limit up
  2. group_replication_member_expel_timeout up

Remember to return to the original size after the migration is completed

Write at the end

Although only two parameters are simply adjusted in the end, the process is very difficult. It takes a long time and energy to build the environment, reproduce the scene, narrow the scope of the problem, search materials and consult the source code. Another experience is that English has to be supplemented. The annotations of official documents and source code are all in English. Many words are unknown and stumbling to read, and one click translation is machine translation. It is often very different from the original meaning.

Topics: Database MySQL cluster