By Ananias Tsalouchidis Translated by: Meng Wei Original text: https://www.percona.com/blog/2020/01/07/how-securing-mysql-with-tcp-wrappers-can-cause-an-outage/
case
Protecting MySQL is always a challenge. There are general best practices for installing reinforcement, but the more complex your settings are, the more likely you are to encounter problems that are difficult to troubleshoot.
We recently looked at a case where MySQL began to become unavailable when active threads were high and exceeded a threshold (but not always the same).
During this period, there are many logs like the following, mysqld does not respond for a few seconds.
2019-11-27T10:26:03.476282Z 7736563 [Note] Got an error writing communication packets 2019-11-27T10:26:03.476305Z 7736564 [Note] Got an error writing communication packets
**"Got an error writing communication packets" * "is a very common log message, which may be caused by many reasons. (for official documents, please refer to the link at the end)
How do we deal with this problem and find the root cause
The first thing to do is to execute a simple loop remotely to determine whether this is a random occurrence, a network problem or a problem related to mysqld itself.
[RDBA] percona@monitoring1: ~ $ time for i in {1..100}; \ do mysql -h 10.0.2.14 -Bsse "show status like '%uptime';"; \ done Uptime 3540 Uptime 3540 Uptime 3540 Uptime 3541 Uptime 3541 Uptime 3541 Uptime 3541 Uptime 3542 Uptime 3542 Uptime 3542 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3544 ^C
What I want to do at first is to confirm the behavior of customer reports. Therefore, since all application servers are in remote locations (so the client is linked via TCP), would you like to confirm if any remote connections have been dropped (this is due to network problems)? Or is MySQL not responding for any reason?). You also want to verify if there is a scenario where a connection in X is dropped or dropped after a certain period of time. Identifying scenarios often helps to identify what the root cause is. Another reason to perform this remote connection loop is to verify that the problem occurs only at remote connections or at local connections (local connections will be tested later). In the network layer troubleshooting, no problems were found, so we decided to use another loop to link to mysqld locally through TCP. This test shows that MySQL is really not available (or at least not randomly accessible). Unfortunately, the local connection was not tested through the socket at that time. The network layer is completely bypassed by a socket connection. If you try to connect using a socket, you immediately realize that this is not actually a MySQL problem, because MySQL is always available (so something is blocking the connection at the network level). Here are more details. Continue troubleshooting and netstat shows that many connections are in the time \. Time? Wait indicates that the source side has closed the connection. Here is an example of using netstat to identify a TCP connection in a test environment.
[RDBA] percona@db4-atsaloux: ~ $ sudo netstat -a -t Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:sunrpc 0.0.0.0:* LISTEN tcp 0 0 db4-atsaloux:42000 0.0.0.0:* LISTEN tcp 0 0 localhost:domain 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:nrpe 0.0.0.0:* LISTEN tcp 0 0 db4-atsaloux:ssh 10.0.2.10:35230 ESTABLISHED tcp 0 36 db4-atsaloux:ssh 10.0.2.10:39728 ESTABLISHED tcp 0 0 db4-atsaloux:49154 10.0.2.11:mysql ESTABLISHED tcp6 0 0 [::]:mysql [::]:* LISTEN tcp6 0 0 [::]:sunrpc [::]:* LISTEN tcp6 0 0 [::]:ssh [::]:* LISTEN tcp6 0 0 [::]:nrpe [::]:* LISTEN tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50950 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50964 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50938 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50940 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51010 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50994 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50986 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44110 ESTABLISHED tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50984 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50978 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51030 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50954 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51032 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51042 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50996 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51046 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51000 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50942 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51004 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44108 ESTABLISHED tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50992 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50988 TIME_WAIT
This makes us realize that we may have exhausted the TCP connections on the TCP layer, because the number of TCP sessions has increased, and these sessions remain open until the time ﹐ wait timeout occurs. We wrote a related blog before (please refer to the link at the end of the article). This gives you a good idea of what a "time \. We initially tried to fine tune the port range IP [local] port [range] and adjust some kernel related options, such as TCP [TW] reuse, but unfortunately, we failed and still had the same problem. When checking network traffic, it was found that the host sent a large number of requests to the DNS server defined in / etc/resolv.conf. When it comes to network traffic detection, because the network infrastructure is not managed by us, we are unable to verify some things in the network layer. We confirmed from the customer's IT department that no errors were found at the network level. What we can do is to detect packets of traffic in and out of MySQL. Tcpdump helps to identify a large number of DNS requests and slow responses. The first command used for packet detection on the DB node is tcpdump dst port 3306 or src port 3306, and then use more specific filtering rules to exclude and filter out useless information, such as the traffic between the master server and the slave server. At that time, another thought occurred to verify that mysqld was trying DNS resolution for any reason. This explains one of the first problems. Check the variable skip ﹣ name ﹣ resolve and we find that it is set to ON, so mysqld should not perform any type of DNS resolution.
db4-atsaloux (none)> select @@skip_name_resolve; +---------------------+ | @@skip_name_resolve | +---------------------+ | 1 | +---------------------+ 1 row in set (0.00 sec)
In order to further debug what MySQL actually does, we strace mysqld process.
Root cause
We noticed that the mysqld process accessed the / etc/hosts.allow and / etc/hosts.deny files too frequently.
root@db4-atsaloux:~# strace -e open,read -p$(pidof mysqld) strace: Process 693 attached # /etc/hosts.deny: list of hosts that are _not_ allowed to access the system. read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 # /etc/hosts.allow: list of hosts that are allowed to access the system. read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0
As we can see, some new connections take a long time to connect to MySQL. strace on mysqld pid shows frequent access to / etc/hosts.allow and / etc/hosts.deny. These files are directly related to TCP wrappers! Many system administrators think that TCP wrappers are outdated software (software development has stopped, but there are many alternatives), but they are still widely used. When using TCP wrappers, you must check each new connection based on the ACL and decide whether to allow remote hosts to connect to the service based on this ACL. During troubleshooting, it is found that DNS resolution has nothing to do with MySQL's skip ﹣ name ﹣ resolve function or mysqld itself. We know that this is not actually a mysqld problem. MySQL has started and can handle requests. Personally, I think it's a fatal weakness of "designed" software. Continue to check and find that there is a wrong DNS configuration / etc/resolv.conf, so when DNS response is slow or DNS does not respond, TCP wrappers make the connection to mysql stagnate or be discarded while waiting for DNS response.
summary
1. If you need to use TCP wrappers for any reason, please always note that any DNS problems may cause a pause or interrupt. 2. With TCP wrappers, even if skip ﹣ name ﹣ resolve is enabled, resolution will occur.
- Even if you don't have TCP wrappers installed, some operating systems, such as some versions of CentOS or Ubuntu, are available by default.
- If you want to set security rules for MySQL, be careful. Note that not every binary can use them. Binaries should be connected to the library of TCP wrappers. 3. If the MySQL service has TCP wrappers enabled and you really need them, you should ensure that DNS response is fast and secure, even if it is not easy to manage.
- Add to the / etc/hosts file for hosts connected to mysqld service, so that no real DNS resolution will be done for each connection.
- There are also some best times to configure DNS resolution. In my opinion, you should configure at least multiple DNS servers in / etc/resolv.conf, and you should use some local servers or servers closest to Linux servers, and possibly enable caching. 4. If you encounter a similar problem, you can check whether mysqld is built for TCP wrappers, execute ldd on mysqld binary, and then check whether it is connected to the TCP wrappers library. If yes, and the same problem is encountered, check the DNS configuration of the system and whether DNS is performing properly.
[RDBA] percona@db4-atsaloux: ~ $ ldd /usr/sbin/mysqld | grep libwrap libwrap.so.0 => /lib/x86_64-linux-gnu/libwrap.so.0 (0x00007fa80532c000)
Related documents and blog links: <Communication Errors and Aborted Connections> https://dev.mysql.com/doc/refman/5.7/en/communication-errors.html <Application Cannot Open Another Connection to MySQL> https://www.percona.com/blog/2014/12/08/what-happens-when-your-application-cannot-open-yet-another-connection-to-mysql/>)