I just got a call from my colleague in the morning saying that there is a high delay in access to online services. He checked and found that there is still enough memory, cpu usage is not very high, just the load is surprisingly high. I will start a series of checks when I arrive.
1. View server resource usage
1.1 top
Looking at it first with top-c, you can see that the server cpu usage is normal, swap is not in use, there is still a lot of memory left, there are no zombie processes, and there are other exceptions
#top -c top - 11:26:23 up 364 days, 22:00, 2 users, load average: 60.13, 70.80, 71.02 Tasks: 327 total, 2 running, 325 sleeping, 0 stopped, 0 zombie %Cpu0 : 22.9 us, 3.0 sy, 0.0 ni, 73.4 id, 0.3 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu1 : 6.0 us, 2.0 sy, 0.0 ni, 92.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 8.0 us, 2.7 sy, 0.0 ni, 89.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu3 : 7.7 us, 2.3 sy, 0.0 ni, 90.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 8008668 total, 3247440 free, 2468668 used, 2292560 buff/cache KiB Swap: 0 total, 0 free, 0 used. 5232164 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13189 test 20 0 173660 19808 5056 R 18.2 0.2 0:06.89 php-fpm: pool www 13307 test 20 0 153120 7620 2256 S 2.6 0.1 1:47.54 nginx: worker process 13174 test 20 0 170408 17136 5156 S 2.0 0.2 0:05.99 php-fpm: pool www 13156 test 20 0 172616 18220 4784 S 1.7 0.2 0:05.98 php-fpm: pool www ...... Enter 1 to list each cpu Use Information input P with cpu Sort usage from high to low
1.2 vmstat
View Server io Information
# vmstat 2 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 437072 145988 4787532 0 0 0 2 0 0 2 0 97 0 0 0 0 0 429252 145988 4787552 0 0 0 22 5032 6082 6 1 92 1 0 0 0 0 432304 145988 4787564 0 0 0 0 4637 5141 7 1 92 0 0 0 0 0 433544 145988 4787576 0 0 0 106 4385 5080 6 1 93 0 0 0 0 0 436604 145988 4787588 0 0 0 0 3570 4752 3 1 97 0 0 1 0 0 436912 145988 4788972 0 0 0 46 4866 6283 5 1 93 0 0 0 0 0 432680 145988 4788984 0 0 0 76 3882 5349 3 1 97 0 0
From the above output, you can see that there are no abnormalities in the running and sleeping processes, and that io is OK. This is strange, it looks normal through these indicators, but the service load is very high
1.3 iostat
Let's follow up with iostat
# iostat Linux 3.10.0-1127.10.1.el7.x86_64 (tougao-web02) 07/15/2021 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 2.08 0.00 0.47 0.04 0.00 97.41 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn vda 0.79 0.04 8.42 1273836 265368220
Unexpectedly, iostat did not see any unusual information
1.4 ps
Try again with last hope
ps aux |grep 'D' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND www 1954 0.0 0.1 158272 13132 ? D Jul04 8:34 nginx: worker process www 1955 0.1 0.1 158096 12976 ? D Jul04 18:52 nginx: worker process www 1956 1.1 0.1 158020 13176 ? D Jul04 184:32 nginx: worker process www 1957 0.4 0.1 158500 13352 ? D Jul04 76:12 nginx: worker process root 10036 0.0 0.0 0 0 ? D 10:07 0:00 [10.0.4.23-mana] www 28037 0.1 0.2 352872 19420 ? D 09:30 0:02 php-fpm: pool www www 28038 0.0 0.1 272964 15460 ? D 09:30 0:01 php-fpm: pool www www 28041 0.0 0.2 276044 17636 ? D 09:30 0:02 php-fpm: pool www www 28045 0.0 0.2 276044 18572 ? D 09:30 0:01 php-fpm: pool www www 28047 0.0 0.2 275792 17388 ? D 09:30 0:02 php-fpm: pool www www 28050 0.1 0.2 276544 17992 ? D 09:30 0:02 php-fpm: pool www ...... Status is D Indicates that a process is a non-wakeable sleep state
See here probably know the reason, 10.0.4.23 is our mounted NFS network storage, php and nginx will go here to get write data, let's go to the mount directory of NFS, edit or view the files in it to try, found that there are some cartons in the cd, vim edit files more obvious. So here's basically where the problem lies
2. Problem solving
Quick recovery means stopping PHP and nginx, umount nfs network storage system, then reloading, then restarting php, nginx service, until this problem is temporarily solved
What we need is not only the solution to the problem, but also the cause of the problem and the way to avoid it.
#Query server RPC related information rpcinfo -p 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs
You can see that NFS2 and NFS3 are mixed. In NFS2, the limit is 8K, and NFS3's limit depends on NFSSVC_in the source code The size of MAXBLKSIZE (/usr/include/linux/nfsd/const.h) is also generally 8K. Overall, Linux2.4 In the kernel, the upper limit for NFS is usually 8K. If conditions permit, use Linux2 as much as possible. 6 cores, so the NFS upper limit can reach 32K, which is much cooler.
Of course, sometimes it's not easy to determine the size of a BlockSize, because in addition to the limitations of the NFS's own upper limit, there are also effects such as network MTU.
Suddenly I don't want to write, the rest Refer to this link