background
Received alarm, content: MySQL_ The master-slave replication delay is too high (example: XXXX), current value: 1032.00, threshold: 1000 - 100000000, occurrence time: 12 / 30 / 2021 02:27:16.
Analysis process
Generally, in case of replication delay, let's first look at the replication status and active session of the slave library. When analyzing the next morning, it was found that there was no delay. Fortunately, our monitoring and operation and maintenance script is relatively perfect. The monitoring script is executed every minute, and the above two information will be recorded if the delay is found. Therefore, from the monitoring and retention log, it can be seen that the earliest is 2021-12-30 02:02:48, which has a delay of 9 seconds.
- View replication status
It can be seen from the replication status that the running status of IO thread and SQL thread is normal Yes, and the replication delay is seconds_ Behind_ The master is 9 seconds, and the status of the SQL thread is Waiting for dependent transaction to commit, that is, waiting for the transaction to commit.
[kingly@rhel7]$ mysql > show slave status\G Relay_Master_Log_File: master-bin.039806 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 927883985 Relay_Log_Space: 929519379 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 9 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 10 Master_UUID: 157bcec0-fc94-11e9-a77d-48df372eb155 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for dependent transaction to commit Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:10146560435-10186081960 Executed_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:1-10186081526, 68eb64a3-fc94-11e9-b9c7-48df372eaef5:1-228645 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version:
- View active sessions
From the perspective of active sessions, several DELETE statements are in the state of System lock, and the reasons for System lock usually include: 1) a large number of small transactions, such as updating a row of data in a large table; 2) There is no primary key on the table; 3) Similar to InnoDB lock, slave modifies data and SQL at the same time_ Thread modifies the same data; 4) IO pressure is too high. Moreover, other SQL playback threads are waiting for transaction submission.
[kingly@rhel7]$ mysql >show processlist; 2021-12-30 02:02:48 NULL NULL 114314 Waiting for master to send event NULL 2021-12-30 02:02:48 NULL NULL 0 Waiting for dependent transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_48 WHERE id=45702564 2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_75 WHERE id=47490222 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_84 WHERE id=46076462 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 System lock DELETE FROM offlinenews_95 WHERE id=46705915 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 48 Waiting for preceding transaction to commit NULL 2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL 2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL 2021-12-30 02:02:48 NULL NULL 49 Waiting for an event from Coordinator NULL 2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL 2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL 2021-12-30 02:02:48 NULL NULL 55 Waiting for an event from Coordinator NULL
Therefore, it is necessary to further analyze whether there are transactions executed from the library that are not committed, resulting in blocking the playback action of SQL thread and these DELETE operations. By analyzing the binlog at about 2:00 a.m. on December 30, 2021, it is found that there are no large transactions in a short time, only a small number of single line DELETE operations based on primary key, which will be executed quickly. Moreover, we set the slave library to be read-only, and usually do not execute any transactions. Therefore, the troubleshooting direction is switched to the server.
- View server load
By analyzing the vmstat results of the deployed OSWBB script, it is found that at about 2:00 on December 30, 2021, only about 1-2 threads of the server are running, and a large number of threads (more than 40) are in the block state, which is very abnormal, and about 10% of the CPU column can be seen in the IO wait state, indicating that the server IO pressure was very high at that time.
zzz ***Thu Dec 30 02:02:35 CST 2021 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 42 126088 1259068 404 110934224 0 0 169 166 0 0 0 0 99 0 0 1 42 126088 1251800 404 110934512 0 0 612144 502 8380 4156 0 0 91 8 0 2 41 126088 1252672 404 110934576 0 0 620544 358 8844 4115 0 1 90 9 0 zzz ***Thu Dec 30 02:02:50 CST 2021 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 41 126088 1256476 404 110936800 0 0 169 166 0 0 0 0 99 0 0 1 41 126088 1249628 404 110937072 0 0 596992 2147 9037 4735 0 0 89 11 0 2 42 126088 1247176 404 110937536 0 0 630704 605 12100 6991 1 0 87 12 0 zzz ***Thu Dec 30 02:03:05 CST 2021 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 42 126088 1251024 404 110939968 0 0 169 166 0 0 0 0 99 0 0 3 37 126088 1242940 404 110940208 0 0 626448 475 8316 3920 0 0 89 10 0 1 42 126088 1246844 404 110940472 0 0 606976 11179 12130 8121 0 0 88 11 0 zzz ***Thu Dec 30 02:03:20 CST 2021 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 39 126088 1247112 404 110943920 0 0 169 166 0 0 0 0 99 0 0 1 43 126088 1240348 404 110944080 0 0 629248 442 9168 4325 0 0 88 11 0 1 41 126088 1242900 404 110944848 0 0 578576 498 7578 3871 0 0 89 10 0
- Analyze database logs
Looking at the database error log, we also found clues. A large number of [note] InnoDB: page_ The prompt of clean indicates that the cleaning was slow at that time, which further proves that the server IO pressure is too high.
2021-12-30T01:55:17.492848+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270201857; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682298959400 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100 2021-12-30T01:57:18.156330+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270245889; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682611334700 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100 2021-12-30T01:59:20.712082+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270290945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683053354600 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100 2021-12-30T02:01:20.422439+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270330881; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683501041500 waited (count) when Workers occupied = 2554254 waited when Workers occupied = 6778342662900 2021-12-30T02:01:33.788727+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5527ms. The settings might not be optimal. (flushed=74 and evicted=0, during the time.) 2021-12-30T02:01:52.969122+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18181ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.) 2021-12-30T02:02:36.051478+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13227ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.) 2021-12-30T02:03:22.916208+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270349313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18770439067400 waited (count) when Workers occupied = 2554266 waited when Workers occupied = 6779684449500 2021-12-30T02:03:49.254684+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13291ms. The settings might not be optimal. (flushed=72 and evicted=0, during the time.) 2021-12-30T02:05:23.176365+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270361601; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18859577400400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300 2021-12-30T02:06:25.751734+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 19666ms. The settings might not be optimal. (flushed=70 and evicted=0, during the time.) 2021-12-30T02:07:38.989406+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18069ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.) 2021-12-30T02:07:44.028187+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 141; events assigned = 270376961; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18970653888400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300 2021-12-30T02:08:22.735042+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 15828ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.) 2021-12-30T02:09:32.283672+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 12917ms. The settings might not be optimal. (flushed=75 and evicted=0, during the time.)
The truth of the problem
Considering that we are a high configuration server, only running database services without other application services, and deploying a single instance database, we use SAN storage for data storage (more extravagant), why is there an IO bottleneck? Combined with the time point when the problem occurred, I think it should be caused by the backup task. The specific backup strategy is as follows. The backup will be added at 2 o'clock every day.
[kingly@rhel7]$crontab -l 00 00 * * 7 sh /mysql/scripts/backup/mysql_backup.sh FULL 1>/dev/null 2>&1 00 2 * * * sh /mysql/scripts/backup/mysql_backup.sh INCR 1>/dev/null 2>&1
Why didn't it go wrong before? It went wrong recently. In fact, the data volume of this set of database is several T, and the business scale has increased greatly recently. Considering the long backup time, in order to avoid taking up too much IO in the backup process, which affects other systems with the switch, after discussing with the storage colleagues, the switch port of this server is limited. As a result, xtrabackup backup directly copies ibd files at the beginning of backup. The instantaneous IO traffic will be very large, and the IO capacity of the server itself is sufficient. Therefore, this IO traffic is overstocked at the switch end due to speed limit, which affects the server to write data to the SAN switch. Therefore, it is seen that there are a lot of IO waiting on the server, resulting in slow playback of SQL threads, Replication latency occurred.
Therefore, the reason for the MySQL replication delay this time is the insufficient IO capacity.
summary
There are many reasons for MySQL replication delay, and some cases will be sorted out later. However, replication delay caused by insufficient IO capacity may be relatively rare. If you have low server configuration or weak IO capacity, you can pay attention to it.
================Welcome to pay attention to my other platform articles at the same time===============
WeChat official account Lao Wang chat MySQL
Chat with the king of MySQL https://www.zhihu.com/people/old_king
CSDN | Lao Wang talks about MySQL | https://blog.csdn.net/wly619?spm=1010.2135.3001.5421
Jianshu Lao Wang talks about MySQL| https://www.jianshu.com/u/80e4c9cd3af5