Remember a server overload problem

Posted by dmikester1 on Mon, 17 Jan 2022 02:40:00 +0100

I just got a call from my colleague in the morning saying that there is a high delay in access to online services. He checked and found that there is still enough memory, cpu usage is not very high, just the load is surprisingly high. I will start a series of checks when I arrive.

1. View server resource usage

1.1 top

Looking at it first with top-c, you can see that the server cpu usage is normal, swap is not in use, there is still a lot of memory left, there are no zombie processes, and there are other exceptions

#top -c
top - 11:26:23 up 364 days, 22:00,  2 users,  load average: 60.13, 70.80, 71.02
Tasks: 327 total,   2 running, 325 sleeping,   0 stopped,   0 zombie
%Cpu0  : 22.9 us,  3.0 sy,  0.0 ni, 73.4 id,  0.3 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  :  6.0 us,  2.0 sy,  0.0 ni, 92.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  8.0 us,  2.7 sy,  0.0 ni, 89.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu3  :  7.7 us,  2.3 sy,  0.0 ni, 90.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8008668 total,  3247440 free,  2468668 used,  2292560 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  5232164 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND       
13189 test      20   0  173660  19808   5056 R  18.2  0.2   0:06.89 php-fpm: pool www                                           13307 test      20   0  153120   7620   2256 S   2.6  0.1   1:47.54 nginx: worker process                                       13174 test      20   0  170408  17136   5156 S   2.0  0.2   0:05.99 php-fpm: pool www                                           13156 test      20   0  172616  18220   4784 S   1.7  0.2   0:05.98 php-fpm: pool www                                           
......
Enter 1 to list each cpu Use Information
 input P with cpu Sort usage from high to low

1.2 vmstat

View Server io Information

# vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 437072 145988 4787532    0    0     0     2    0    0  2  0 97  0  0
 0  0      0 429252 145988 4787552    0    0     0    22 5032 6082  6  1 92  1  0
 0  0      0 432304 145988 4787564    0    0     0     0 4637 5141  7  1 92  0  0
 0  0      0 433544 145988 4787576    0    0     0   106 4385 5080  6  1 93  0  0
 0  0      0 436604 145988 4787588    0    0     0     0 3570 4752  3  1 97  0  0
 1  0      0 436912 145988 4788972    0    0     0    46 4866 6283  5  1 93  0  0
 0  0      0 432680 145988 4788984    0    0     0    76 3882 5349  3  1 97  0  0

From the above output, you can see that there are no abnormalities in the running and sleeping processes, and that io is OK. This is strange, it looks normal through these indicators, but the service load is very high

1.3 iostat

Let's follow up with iostat

# iostat
Linux 3.10.0-1127.10.1.el7.x86_64 (tougao-web02) 	07/15/2021 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.08    0.00    0.47    0.04    0.00   97.41

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               0.79         0.04         8.42    1273836  265368220

Unexpectedly, iostat did not see any unusual information

1.4 ps

Try again with last hope

ps aux |grep 'D'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
www       1954  0.0  0.1 158272 13132 ?        D    Jul04   8:34 nginx: worker process
www       1955  0.1  0.1 158096 12976 ?        D    Jul04  18:52 nginx: worker process
www       1956  1.1  0.1 158020 13176 ?        D    Jul04 184:32 nginx: worker process
www       1957  0.4  0.1 158500 13352 ?        D    Jul04  76:12 nginx: worker process
root     10036  0.0  0.0      0     0 ?        D    10:07   0:00 [10.0.4.23-mana]
www      28037  0.1  0.2 352872 19420 ?        D    09:30   0:02 php-fpm: pool www
www      28038  0.0  0.1 272964 15460 ?        D    09:30   0:01 php-fpm: pool www
www      28041  0.0  0.2 276044 17636 ?        D    09:30   0:02 php-fpm: pool www
www      28045  0.0  0.2 276044 18572 ?        D    09:30   0:01 php-fpm: pool www
www      28047  0.0  0.2 275792 17388 ?        D    09:30   0:02 php-fpm: pool www
www      28050  0.1  0.2 276544 17992 ?        D    09:30   0:02 php-fpm: pool www
......
Status is D Indicates that a process is a non-wakeable sleep state 

See here probably know the reason, 10.0.4.23 is our mounted NFS network storage, php and nginx will go here to get write data, let's go to the mount directory of NFS, edit or view the files in it to try, found that there are some cartons in the cd, vim edit files more obvious. So here's basically where the problem lies

2. Problem solving

Quick recovery means stopping PHP and nginx, umount nfs network storage system, then reloading, then restarting php, nginx service, until this problem is temporarily solved

What we need is not only the solution to the problem, but also the cause of the problem and the way to avoid it.

#Query server RPC related information
rpcinfo -p
100003    2   udp   2049 nfs
100003    3   udp   2049 nfs
100003    2   tcp   2049 nfs
100003    3   tcp   2049 nfs

You can see that NFS2 and NFS3 are mixed. In NFS2, the limit is 8K, and NFS3's limit depends on NFSSVC_in the source code The size of MAXBLKSIZE (/usr/include/linux/nfsd/const.h) is also generally 8K. Overall, Linux2.4 In the kernel, the upper limit for NFS is usually 8K. If conditions permit, use Linux2 as much as possible. 6 cores, so the NFS upper limit can reach 32K, which is much cooler.

Of course, sometimes it's not easy to determine the size of a BlockSize, because in addition to the limitations of the NFS's own upper limit, there are also effects such as network MTU.

Suddenly I don't want to write, the rest Refer to this link

Topics: Linux nfs