Slve Delay Monitoring in MySQL Replication

Posted by colake on Sat, 01 Jun 2019 02:45:21 +0200


In MySQL replication environments, we usually judge SLAVE latency only based on the value of Seconds_Behind_Master. This is acceptable in most cases, but not accurate enough, and more factors should be taken into account.


First, let's look at the state of SLAVE:

> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
***
Master_Log_File: mysql-bin.000327
Read_Master_Log_Pos: 668711237
Relay_Log_File: mysql-relay-bin.002999
Relay_Log_Pos: 214736858
Relay_Master_Log_File: mysql-bin.000327
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
***
Skip_Counter: 0
Exec_Master_Log_Pos: 654409041
Relay_Log_Space: 229039311
***
Seconds_Behind_Master: 3296

You can see that the value of Seconds_Behind_Master is 3296, which means that SLAVE is delayed by at least 3296 seconds.

Let's look at the two REPLICATION process states on SLAVE:

> show full processlist\G
*************************** 1. row ***************************
Id: 6
User: system user
Host:
db: NULL
Command: Connect
Time: 22005006
State: Waiting for master to send event
Info: NULL
*************************** 2. row ***************************
Id: 7
User: system user
Host:
db: NULL
Command: Connect
Time: 3293
State: Updating
Info: UPDATE ** SET ** WHERE **

You can see that the SQL thread has been executing the UPDATE operation. Note that the value of Time is 3293. It looks like the UPDATE operation has been executing for 3293 seconds. A normal SQL is just a simple one. It certainly won't take so long.
In fact, in the REPLICATION process, the value of the Time column may have several situations:
1. The difference between the timestamp in the binlog (actually relay log) currently executed by the SQL thread and the latest timestamp of the IO thread, which is commonly considered as the value of econds_Behind_Master, is not the actual execution time-consuming of a certain SQL;
2. If the active SQL thread is not currently executed, the Time value of the SQL thread is idle time.

The Time value of the IO thread is the total Time (how many seconds) that the thread has spent since it started. If the system Time changes after the IO thread started, it may cause the Time value to be abnormal, such as negative, or very large.

Let's look at the following states:

#Set pager to view only a few status values of concern
> pager cat | egrep -i 'system user|Exec_Master_Log_Pos|Seconds_Behind_Master|Read_Master_Log_Pos'

#This is the case where there is no active SQL, the Time value is idle time, and Seconds_Behind_Master is 0
> show processlist; show slave status\G
| 6 | system user | | NULL | Connect | 22004245 | Waiting for master to send event | NULL |
| 7 | system user | | NULL | Connect | 13 | Has read all relay log;**
Read_Master_Log_Pos: 445167889
Exec_Master_Log_Pos: 445167889
Seconds_Behind_Master: 0

#Same as above
> show processlist; show slave status\G
| 6 | system user | | NULL | Connect | 22004248 | Waiting for master to send event | NULL |
| 7 | system user | | NULL | Connect | 16 | Has read all relay log;**
Read_Master_Log_Pos: 445167889
Exec_Master_Log_Pos: 445167889
Seconds_Behind_Master: 0

#There is active SQL, and the Time value is the same as that of Econds_Behind_Master, which means that the SQL thread is "slower" than the IO thread by one second.
> show processlist; show slave status\G
| 6 | system user | | NULL | Connect | 22004252 | Waiting for master to send event | NULL |
| 7 | system user | | floweradmin | Connect | 1 | Updating | update **
Read_Master_Log_Pos: 445182239
Exec_Master_Log_Pos: 445175263
Seconds_Behind_Master: 1

#Same as above
> show processlist; show slave status\G
| 6 | system user | | NULL | Connect | 22004254 | Waiting for master to send event | NULL |
| 7 | system user | | floweradmin | Connect | 1 | Updating | update **
Read_Master_Log_Pos: 445207174
Exec_Master_Log_Pos: 445196837
Seconds_Behind_Master: 1

Well, finally, let's talk about how to correctly judge the delay of SLAVE:
1. First, see if there are differences between Relay_Master_Log_File and Master_Log_File.
2. If Relay_Master_Log_File and Master_Log_File are the same, then look at the differences between Exec_Master_Log_Pos and Read_Master_Log_Pos, and compare how many binlog events are slower in SQL threads than in IO threads.
3. If Relay_Master_Log_File is different from Master_Log_File, it means that the delay may be large and binlog needs to be obtained from MASTER. status, judging the current gap between binlog and MASTER;


Therefore, a more rigorous approach is to:
On the third-party monitoring node, SHOW BINARY LOGS and SHOW are launched simultaneously for MASTER and SLAVE SLAVE STATUSG request, and finally judge the difference between the binlog and Exec_Master_Log_Pos and Read_Master_Log_Pos.

For example:
The results of executing SHOW BINARY LOGS on MASTER are as follows:

+------------------+--------------+
| Log_name | File_size |
+------------------+--------------+
| mysql-bin.000009 | 1073742063 |
| mysql-bin.000010 | 107374193 |
+------------------+--------------+

On SLAVE, SHOW is executed The results of SLAVE STATUSG are as follows:

Master_Log_File: mysql-bin.000009
 Read_Master_Log_Pos: 668711237
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
***
Exec_Master_Log_Pos: 654409041

***
Seconds_Behind_Master: 3296
***

At this point, the actual delay of SLAVE should be:
mysql-bin.000009 binlog in this binlog position 1073742063 and The difference delay between binlog position s read on SLAVE is:

1073742063 - 654409041 = 419333022 individual binlog event

And add mysql-bin.000010, a binlog that has generated 107374193 binlogs. event, altogether

107374193 + 419333022 = 526707215 binlog event s


Update of postscript:

1. A monitor table can be maintained on MASTER. It has only one field to store the latest timestamp (event_scheduler can be used to update the high version and cron can be used to update the low version with automatic loop script). The time of reading the field on SLAVE can be quickly known as long as the system time of MASTER and SLAVE is the same. However, in highly concurrent systems, this timestamp can be refined to milliseconds, otherwise even if the time is consistent, it may delay thousands of binlog event s.

2. The calculation delay on careful expenditure is wrong. It should be the maximum number of events in mysql-bin.000009 minus the number of events that have been executed, that is, 1073742063-654409041= 419333022 binlog events, plus 107374193 binlog events produced by mysql-bin.000010 binlog, totaling 526707215 binlog events.


Topics: MySQL SQL