Basic scheme for troubleshooting Linux server

Posted by xymbo on Wed, 02 Mar 2022 04:09:02 +0100

Server architecture

The server system is Centos7

First, we need to know the external architecture of the system

General architecture:

1. Domain name - > ECS

2. Domain name - > CDN - > cloud server (OSS)

3. Domain name - > CDN - > ECS + RDS + Redis cache

4. Domain name - > CDN - > Load Balancing - > ECS + RDS (master-slave) + Redis cache

5. Domain name - > CDN - > WAF Firewall - > Load Balancing - > ECS + RDS (master-slave) + Redis cache

Then check the problems according to the actual situation step by step.

discover problems

1, Find problems

First, find the problem and determine which service has a problem in time, so as to facilitate the rapid positioning of the problem. Find the corresponding domain name and device

Zabbix monitoring hair nail alarm

Alibaba cloud monitoring alarm SMS

[[Alibaba cloud] Dear***,Cloud monitoring-Cloud database RDS edition<South China 1(Shenzhen)-*****-read-only>to<09:54>An alarm occurs, CPU Utilization rate (91%).88>=80),Duration 4 minutes

3.shell script email alarm

4. Other colleagues

Customer service and marketing colleagues report problems by telephone

2, Fast location problem

Network bandwidth (whether CDN is abnormal)

Does the domain name resolve to the origin

Log in to the Alibaba cloud CDN background to view the corresponding traffic

load balancing

Check whether the load balancing operates normally and whether the traffic is abnormal

Application layer server

Whether the load of ECS server is normal, whether the load of cpu and memory is too high, and whether the utilization rate of hard disk reaches 100%

Cache server

Whether the load of redis server is normal and how is the memory utilization

database server

Is the number of database connections normal

List all connection information of the current user;

show full processlist; 

Kill the process and consume too long sql process

select concat('kill ', id, ';') from information_schema.processlist where command != 'Sleep' and time > 2*60 order by time desc; 

Send the sql statement to the backend for analysis

Remote connection server

Problem: high CPU, high load and slow access (database is normal)

System level

View load

View load, CPU, memory, online time, high resource processes

# top
 Installation: yum -y install htop
# htop 

Check the top server load, memory consumption, df -h check the hard disk

top
df 

View nginx logs

If there are nginx logs, enter the nginx log directory

Sort by log size

Judge log access, corresponding duration, url, etc

cd /data/wwwroot/log
ll -Srh
tail -f XXX.XXX.COM-access.log
 Analyze logs to find the most IP Logs, most URL etc.
GoAccess ,ELK View log in the background

View disk usage

df -h

View the current status of the disk

iostat -x -k 3 3
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.70    0.00    2.25    0.41    0.00   93.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.01     0.83    0.30    1.48    11.34    12.13    26.30     0.01    6.15    7.41    5.89   0.24   0.04
vdb               0.00     0.17    0.02    0.28     0.08     2.75    19.15     0.00    3.22    2.01    3.29   0.26   0.01
vdc               0.10     0.84    3.09    0.56   105.22    20.57    68.94     0.02    7.96    3.29   33.74   1.33   0.49

If you find that the current disk is busy, check which one PID Busy:
install yum install -y iotop
# iotop -o -P -k -d 5

View external services and ports

# netstat -tunpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:62920           0.0.0.0:*               LISTEN      29177/vsftpd        
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      4393/httpd      
tcp        0      0 0.0.0.0:7300            0.0.0.0:*               LISTEN      4697/php-fpm: maste 

Check the specific PID

install  yum install lsof

lsof -p PID
lsof -p 29177
lsof -p 4697 

View system log

tail -400f /var/log/messages
tail -f /var/log/messages
tail -n100 /var/log/messages
head -n100 /var/log/messages

View simplified thread tree

pstree -a >> /root/pstree.log

Network problems

ping domain name

ping www.XXX.com

View network nodes

Installation: yum install -y  traceroute

traceroute www.baidu.com

Problem: low CPU, high load, slow access (database)

Judgment database

1. Slow query

Check the slow query log, which may cause high load. Check the storage location according to the configuration file: log_slow_queries

2. Is there a system bottleneck

Upgrade system cpu, memory and hard disk,

Optimize the architecture, increase master-slave, one master and many slaves, etc.

3. Are there too many sleep connections

show full processlist;

4. View the maximum number of connections

View the maximum number of connections set
show variables like 'max_connections';
Reset maximum connections
set GLOBAL max_connections=300

Nginx protection basic command

If there are some abnormal accesses, you can join the WAF cooperating with Alibaba cloud.

IP address of the most accessed real users

cat www.XXXX.com-access.log |awk '{print $5}'| awk -F":" '{print $NF}' |sort|uniq -c|sort -nr|head -10

View the url to visit the top 10

cat  www.XXX.com-access.log | awk '{print $10}' | sort | uniq -c | sort -nr | head -n 10

Maximum execution time: 10

cat  www.XXX.com-access.log | sort -nr | head -n 10

View http_ Origin of referer:

cat www.XXX.com-access.log | awk -F"from:" '{print $NF}' |sort|uniq -c|sort -nr|head -10

Seal the IP address and view the specific referer source address

Server firewall sealing ip

Sealed IP segment

/sbin/iptables -I INPUT -s 61.37.80.0/24 -j DROP
#The command to mask a single IP is
 deny 123.45.6.7
 #Seal the entire segment, i.e. commands from 123.0.0.1 to 123.255.255.254
 deny 123.0.0.0/8

Prohibit specific User Agents from accessing

if ($http_user_agent ~* (wget|curl|Firefox) ) {
return 404;
}

Jump to specific address attacks

rewrite ^/accounts/\+\$str\+ http://127.0.0.1/ redirect;

According to user_agent controls client access

location / {
       if ($http_user_agent ~ 'bingbot/2.0|MJ12bot/v1.4.2|Spider/3.0|YoudaoBot|Tomato|Gecko/20100315'){
                return 403;
                }
       }

Picture anti-theft chain

valid_referers none blocked *.XXX.com server_names ~\.google\. ~\.baidu\.;
                if ($invalid_referer) {
                        # return 403;
                        rewrite ^/ http://www.XXX.com/daoling.png;
                }

host is not allowed for localhost access

if ($host = 'localhost') {
                return 403;
       }

agent is not allowed to be empty

if ($http_user_agent ~ ^$){
                return 403;
       }

Binding host access is not allowed

if ($http_x_forwarded_for ~ ^$){
                return 402;
       }

Topics: Linux Operation & Maintenance server