Recently, the backup and restore function under different data volumes has been tested, so a large amount of data needs to be migrated to the MGR cluster. However, when the amount of migrated data is a little large, errors and interruptions are often reported. Many searches on the Internet just mention suggestions not to perform major tasks. The official documents are better, suggesting that problems may be caused by memory allocation and network bandwidth. However, due to the need to hand over jobs and water the daily and weekly reports, we can't just know one thing and don't know whether the other is right or not. We still have to find out the real reasons that hinder the implementation of major affairs!
❞Scene reproduction
This error is almost inevitable. It can be repeated by migrating hundreds of megabytes of data to the MGR cluster. The error log of the cluster Master is as follows:
2021-07-15T05:38:44.345845Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
......
2021-07-15T05:38:44.690220Z 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2021-07-15T05:39:03.249057Z 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.'
2021-07-15T05:39:03.233267Z 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
The client side is more concise, and a prompt is given:
❝error on observer while running replication hook 'before_commit'
❞The mistake lies in us Last article It is mentioned in that mgr exists in the form of MySQL plug-in. MySQL executes these callback functions after submitting prepare in phase 2 and before writing binlog. Therefore, this error is relatively general. You can only know that there is a problem when executing mgr plug-in functions. According to this question, if you go to Google and Baidu to search, you are asked to modify the replication_ transication_ size_ The size of the limit parameter, but after adjustment, we found that errors still occur.
By checking the official MySQL documentation, we can find the following sentence:
replication_transication_size_limitThis reminds us that we can think about this problem from the heartbeat mechanism between group members
Introduction to terms and functions
We went to GitHub to find the source code of a group of replication plug-ins https://github.com/mysql/mysql-server/blob/3e90d07c3578e4da39dc1bce73559bbdf655c28c/plugin/group_replication/libmysqlgcs/src/bindings/xcom/xcom/xcom_base.cc , you can see these terms and an introduction to some functions.
/*
A node is an instance of the xcom thread. There is only one instance
of the xcom thread in the agent.
A client is the application which is using xcom to send messages.
A thread is a real OS thread.
A task is a logical process. It is implemented by coroutines and
an explicit stack.
*/
/*
static int tcp_server(task_arg);
The tcp_server listens on the xcom port and starts an
acceptor_learner_task whenever a new connection is detected.
*/
/*
static int sender_task(task_arg);
The sender_task waits for tcp messages on its input queue and
sends it on the tcp socket. If the socket is closed for any
reason, the sender_task will reconnect the socket. There is one
sender_task for each socket. The sender task exists mainly to
simplify the logic in the other tasks, but it could have been
replaced with a coroutine which handles the connection logic after
having reserved the socket for its client task.
static int acceptor_learner_task(task_arg);
This is the server part of the xcom thread. There is one
acceptor_learner_task for each node in the system. The acceptor
learner_task reads messages from the socket, finds the correct
Paxos state machine, and dispatches to the correct message handler
with the state machine and message as arguments.
static int alive_task(task_arg);
Sends i-am-alive to other nodes if there has been no normal traffic
for a while. It also pings nodes which seem to be inactive.
static int detector_task(task_arg);
The detector_task periodically scans the set of connections from
other nodes and sees if there has been any activity. If there has
been no activity for some time, it will assume that the node is
dead, and send a view message to the client.
*/
alive_task
Find the code related to heartbeat sending. You can see that the heartbeat is sending in a loop. Source address L408-L462
int alive_task(task_arg arg MY_ATTRIBUTE((unused))) {
DECL_ENV
pax_msg *i_p;
pax_msg *you_p;
END_ENV;
TASK_BEGIN
ep->i_p = ep->you_p = NULL;
while (!xcom_shutdown) {
{
double sec = task_now();
synode_no alive_synode = get_current_message();
site_def const *site = find_site_def(alive_synode);
/*
If there are some configuration changes, apply them immediately
*/
validate_update_configuration(site, alive_synode);
if (site && get_nodeno(site) != VOID_NODE_NO) {
/*
If the node has not sent heartbeat for a period of time,
Send heartbeat now
*/
if (server_active(site, get_nodeno(site)) < sec - 0.5) {
replace_pax_msg(&ep->i_p, pax_msg_new(alive_synode, site));
ep->i_p->op = i_am_alive_op;
/* Send is called through this method_ MSG write heartbeat task to queue*/
send_to_all_site(site, ep->i_p, "alive_task");
}
/*
If a node doesn't have a heartbeat,
ping it to see if it's alive or dead
*/
{
node_no i;
for (i = 0; i < get_maxnodes(site); i++) {
if (i != get_nodeno(site) && may_be_dead(site->detected, i, sec)) {
replace_pax_msg(&ep->you_p, pax_msg_new(alive_synode, site));
ep->you_p->op = are_you_alive_op;
ep->you_p->a = new_app_data();
ep->you_p->a->app_key.group_id = ep->you_p->a->group_id =
get_group_id(site);
ep->you_p->a->body.c_t = xcom_boot_type;
init_node_list(1, &site->nodes.node_list_val[i],
&ep->you_p->a->body.app_u_u.nodes);
IFDBG(D_DETECT, FN; COPY_AND_FREE_GOUT(
dbg_list(&ep->you_p->a->body.app_u_u.nodes)););
send_server_msg(site, i, ep->you_p);
}
}
}
}
}
TASK_DELAY(1.0);
send_msg
send_msg is a general method of xcom, that is, sending heartbeat task, ping ing other members, or sending binlog event. Finally, it is stuffed into the corresponding queue through this method. Source address: L846-L864
/* Push message into queue*/
int send_msg(server *s, node_no from, node_no to, uint32_t group_id,
pax_msg *p) {
assert(p);
assert(s);
{
/* msg_link Queue*/
msg_link *link = msg_link_new(p, to);
IFDBG(D_NONE, FN; PTREXP(&s->outgoing);
COPY_AND_FREE_GOUT(dbg_msg_link(link)););
p->from = from;
p->group_id = group_id;
p->max_synode = get_max_synode();
p->delivered_msg = get_delivered_msg();
IFDBG(D_NONE, FN; PTREXP(p); STREXP(s->srv); NDBG(p->from, d);
NDBG(p->to, d); NDBG(p->group_id, u));
channel_put(&s->outgoing, &link->l);
}
return 0;
}
sender_task
sender_task gets the task from the queue and passes_ send_mgs is sent to each TCP connection. Source address L1426-L1570
/* Fetch messages from queue and send to other server. Having a
separate queue and task for doing this simplifies the logic since we
never need to wait to send. */
int sender_task(task_arg arg) {
......
TASK_BEGIN
......
ep->s = (server *)get_void_arg(arg);
ep->link = NULL;
ep->tag = TAG_START;
srv_ref(ep->s);
while (!xcom_shutdown) {
/* Loop until connected */
G_DEBUG("Connecting to %s:%d", ep->s->srv, ep->s->port);
for (;;) {
......
}
G_DEBUG("Connected to %s:%d on fd=%d", ep->s->srv, ep->s->port,
ep->s->con.fd);
/* We are ready to start sending messages.
Insert a message in the input queue to negotiate the protocol.
*/
start_protocol_negotiation(&ep->s->outgoing);
while (is_connected(&ep->s->con)) {
int64_t ret;
assert(!ep->link);
......
if (link_empty(&ep->s->outgoing.data)) {
TASK_CALL(flush_srv_buf(ep->s, &ret));
}
CHANNEL_GET(&ep->s->outgoing, &ep->link, msg_link);
{
int64_t ret_code;
if (ep->link->p) {
......
/* Call_ send_msg send messages to other server s*/
TASK_CALL(_send_msg(ep->s, ep->link->p, ep->link->to, &ret_code));
}
}
next:
msg_link_delete(&ep->link);
/* TASK_YIELD; */
}
}
}
_send_msg
Finally, write or flush buffer in the following step Source address L256-L322
/* Send a message to server s */
static int _send_msg(server *s, pax_msg *p, node_no to, int64_t *ret) {
......
TASK_CALL(flush_srv_buf(s, ret));
or
TASK_CALL(task_write(&s->con, ep->buf, ep->buflen, &sent));
......
alive(s); /* Note activity */
X_FREE(ep->buf);
/* UNLOCK_FD(s->con.fd, 'w'); */
if (sent <= 0) {
shutdown_connection(&s->con);
}
......
}
conclusion
Here we can explain why large transactions cause problems in mgr clusters, "The reason is that each socket connection has only one sender_task, which is equivalent to only one worker obtaining and processing tasks from the queue. When executing large transactions, whether memory allocation or binlog event sending takes a long time, resulting in the blocking of subsequent heartbeat packets. Therefore, the master is mistakenly considered dead by other members and expelled.", After expulsion, the master is automatically set to read_only status. If the client continues to import sql, it will obviously cause an error. Finally, error befor is reported_ Commit error.
resolvent
The solution is simple:
- replication_transication_size_limit up
- group_replication_member_expel_timeout up
Remember to return to the original size after the migration is completed
❞Write at the end
Although only two parameters are simply adjusted in the end, the process is very difficult. It takes a long time and energy to build the environment, reproduce the scene, narrow the scope of the problem, search materials and consult the source code. Another experience is that English has to be supplemented. The annotations of official documents and source code are all in English. Many words are unknown and stumbling to read, and one click translation is machine translation. It is often very different from the original meaning.