MySQL has too much replication delay due to insufficient IO capacity

Posted by swasheck on Wed, 09 Mar 2022 02:35:03 +0100

background

Received alarm, content: MySQL_ The master-slave replication delay is too high (example: XXXX), current value: 1032.00, threshold: 1000 - 100000000, occurrence time: 12 / 30 / 2021 02:27:16.

Analysis process

Generally, in case of replication delay, let's first look at the replication status and active session of the slave library. When analyzing the next morning, it was found that there was no delay. Fortunately, our monitoring and operation and maintenance script is relatively perfect. The monitoring script is executed every minute, and the above two information will be recorded if the delay is found. Therefore, from the monitoring and retention log, it can be seen that the earliest is 2021-12-30 02:02:48, which has a delay of 9 seconds.

  • View replication status

It can be seen from the replication status that the running status of IO thread and SQL thread is normal Yes, and the replication delay is seconds_ Behind_ The master is 9 seconds, and the status of the SQL thread is Waiting for dependent transaction to commit, that is, waiting for the transaction to commit.

[kingly@rhel7]$ 
mysql > show slave status\G
        Relay_Master_Log_File: master-bin.039806
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 927883985
              Relay_Log_Space: 929519379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 9
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
                  Master_UUID: 157bcec0-fc94-11e9-a77d-48df372eb155
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:10146560435-10186081960
            Executed_Gtid_Set: 157bcec0-fc94-11e9-a77d-48df372eb155:1-10186081526,
68eb64a3-fc94-11e9-b9c7-48df372eaef5:1-228645
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name: 
           Master_TLS_Version:
  • View active sessions

From the perspective of active sessions, several DELETE statements are in the state of System lock, and the reasons for System lock usually include: 1) a large number of small transactions, such as updating a row of data in a large table; 2) There is no primary key on the table; 3) Similar to InnoDB lock, slave modifies data and SQL at the same time_ Thread modifies the same data; 4) IO pressure is too high. Moreover, other SQL playback threads are waiting for transaction submission.

[kingly@rhel7]$ 
mysql >show processlist;		   
2021-12-30 02:02:48     NULL    NULL    114314  Waiting for master to send event        NULL
2021-12-30 02:02:48     NULL    NULL    0       Waiting for dependent transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_48 WHERE  id=45702564
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_75 WHERE  id=47490222
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_84 WHERE  id=46076462
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      System lock     DELETE FROM offlinenews_95 WHERE  id=46705915
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    48      Waiting for preceding transaction to commit     NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    49      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL
2021-12-30 02:02:48     NULL    NULL    55      Waiting for an event from Coordinator   NULL

Therefore, it is necessary to further analyze whether there are transactions executed from the library that are not committed, resulting in blocking the playback action of SQL thread and these DELETE operations. By analyzing the binlog at about 2:00 a.m. on December 30, 2021, it is found that there are no large transactions in a short time, only a small number of single line DELETE operations based on primary key, which will be executed quickly. Moreover, we set the slave library to be read-only, and usually do not execute any transactions. Therefore, the troubleshooting direction is switched to the server.

  • View server load

By analyzing the vmstat results of the deployed OSWBB script, it is found that at about 2:00 on December 30, 2021, only about 1-2 threads of the server are running, and a large number of threads (more than 40) are in the block state, which is very abnormal, and about 10% of the CPU column can be seen in the IO wait state, indicating that the server IO pressure was very high at that time.

zzz ***Thu Dec 30 02:02:35 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 42 126088 1259068    404 110934224    0    0   169   166    0    0  0  0 99  0  0
 1 42 126088 1251800    404 110934512    0    0 612144   502 8380 4156  0  0 91  8  0
 2 41 126088 1252672    404 110934576    0    0 620544   358 8844 4115  0  1 90  9  0
zzz ***Thu Dec 30 02:02:50 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 41 126088 1256476    404 110936800    0    0   169   166    0    0  0  0 99  0  0
 1 41 126088 1249628    404 110937072    0    0 596992  2147 9037 4735  0  0 89 11  0
 2 42 126088 1247176    404 110937536    0    0 630704   605 12100 6991  1  0 87 12  0
zzz ***Thu Dec 30 02:03:05 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 42 126088 1251024    404 110939968    0    0   169   166    0    0  0  0 99  0  0
 3 37 126088 1242940    404 110940208    0    0 626448   475 8316 3920  0  0 89 10  0
 1 42 126088 1246844    404 110940472    0    0 606976 11179 12130 8121  0  0 88 11  0
zzz ***Thu Dec 30 02:03:20 CST 2021
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2 39 126088 1247112    404 110943920    0    0   169   166    0    0  0  0 99  0  0
 1 43 126088 1240348    404 110944080    0    0 629248   442 9168 4325  0  0 88 11  0
 1 41 126088 1242900    404 110944848    0    0 578576   498 7578 3871  0  0 89 10  0
  • Analyze database logs

Looking at the database error log, we also found clues. A large number of [note] InnoDB: page_ The prompt of clean indicates that the cleaning was slow at that time, which further proves that the server IO pressure is too high.

2021-12-30T01:55:17.492848+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270201857; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682298959400 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:57:18.156330+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270245889; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18682611334700 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T01:59:20.712082+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270290945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683053354600 waited (count) when Workers occupied = 2554253 waited when Workers occupied = 6778341143100
2021-12-30T02:01:20.422439+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 120; events assigned = 270330881; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18683501041500 waited (count) when Workers occupied = 2554254 waited when Workers occupied = 6778342662900
2021-12-30T02:01:33.788727+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5527ms. The settings might not be optimal. (flushed=74 and evicted=0, during the time.)
2021-12-30T02:01:52.969122+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18181ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:02:36.051478+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13227ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:03:22.916208+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 122; events assigned = 270349313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18770439067400 waited (count) when Workers occupied = 2554266 waited when Workers occupied = 6779684449500
2021-12-30T02:03:49.254684+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13291ms. The settings might not be optimal. (flushed=72 and evicted=0, during the time.)
2021-12-30T02:05:23.176365+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 121; events assigned = 270361601; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18859577400400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:06:25.751734+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 19666ms. The settings might not be optimal. (flushed=70 and evicted=0, during the time.)
2021-12-30T02:07:38.989406+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 18069ms. The settings might not be optimal. (flushed=73 and evicted=0, during the time.)
2021-12-30T02:07:44.028187+08:00 2 [Note] Multi-threaded slave statistics for channel '': seconds elapsed = 141; events assigned = 270376961; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 18970653888400 waited (count) when Workers occupied = 2554270 waited when Workers occupied = 6780589144300
2021-12-30T02:08:22.735042+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 15828ms. The settings might not be optimal. (flushed=76 and evicted=0, during the time.)
2021-12-30T02:09:32.283672+08:00 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 12917ms. The settings might not be optimal. (flushed=75 and evicted=0, during the time.)

The truth of the problem

Considering that we are a high configuration server, only running database services without other application services, and deploying a single instance database, we use SAN storage for data storage (more extravagant), why is there an IO bottleneck? Combined with the time point when the problem occurred, I think it should be caused by the backup task. The specific backup strategy is as follows. The backup will be added at 2 o'clock every day.

[kingly@rhel7]$crontab -l
00 00 * * 7 sh /mysql/scripts/backup/mysql_backup.sh FULL 1>/dev/null 2>&1
00 2 * * *   sh /mysql/scripts/backup/mysql_backup.sh INCR 1>/dev/null 2>&1

Why didn't it go wrong before? It went wrong recently. In fact, the data volume of this set of database is several T, and the business scale has increased greatly recently. Considering the long backup time, in order to avoid taking up too much IO in the backup process, which affects other systems with the switch, after discussing with the storage colleagues, the switch port of this server is limited. As a result, xtrabackup backup directly copies ibd files at the beginning of backup. The instantaneous IO traffic will be very large, and the IO capacity of the server itself is sufficient. Therefore, this IO traffic is overstocked at the switch end due to speed limit, which affects the server to write data to the SAN switch. Therefore, it is seen that there are a lot of IO waiting on the server, resulting in slow playback of SQL threads, Replication latency occurred.

Therefore, the reason for the MySQL replication delay this time is the insufficient IO capacity.

summary

There are many reasons for MySQL replication delay, and some cases will be sorted out later. However, replication delay caused by insufficient IO capacity may be relatively rare. If you have low server configuration or weak IO capacity, you can pay attention to it.

================Welcome to pay attention to my other platform articles at the same time===============
WeChat official account Lao Wang chat MySQL
Chat with the king of MySQL https://www.zhihu.com/people/old_king
CSDN | Lao Wang talks about MySQL | https://blog.csdn.net/wly619?spm=1010.2135.3001.5421
Jianshu Lao Wang talks about MySQL| https://www.jianshu.com/u/80e4c9cd3af5

Topics: Database MySQL SQL