prometheus monitoring system
Recently, due to the company's need to build a large data platform, prometheus monitoring system has been replaced as required by consultants.
prometheus official website: https://prometheus.io/
Personal understanding (not necessarily right): prometheus monitoring consists of three parts: prometheus (server), exporter (agent), and alert manager (alarm).
Among them, the core of Prometheus is a time series database, through which we can capture and store data, and obtain the data we need through some query statements defined by prometheus; the core of exporter is a static web, exporter exposes metric s value through constantly updating static web; alert manager is an alarm interface, which receives and communicates the alarms pushed by prometheus. Alert yourself by defining some rules.
As mentioned earlier, prometheus core is a database, so if we need to show it, we need to use it with grafana to make a beautiful interface. This section will be mentioned in the next article.
The advantage of prometheus is that it is a service-based alarm system. Different exporters can achieve different effects for different services. Since I have not used other exporters for the first time, friends who want to know can visit the official website.
Here are some simple configurations. I have made some comments on the important configurations. We can initially build a prometheus monitoring system to monitor some basic information.
server terminal
deploy
cd /usr/local/
wget http://1.1.17.28/software/linux/prometheus/prometheus-1.7.1.linux-amd64.tar.gz
tar -zxvf prometheus-1.7.1.linux-amd64.tar.gz
cd prometheus-1.7.1.linux-amd64
nohup ./prometheus &
echo "/usr/local/prometheus-1.7.1.linux-amd64/prometheus"" >> /etc/rc.local
Main configuration file: prometheus.yml
[root@prometheus local]# cat prometheus-1.7.1.linux-amd64/prometheus.yml
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # Evaluate rules every 15 seconds.
# Attach these extra labels to all timeseries collected by this Prometheus instance.
external_labels:
monitor: 'codelab-monitor'
# Alarm rule file
rule_files:
- 'prometheus.rules'
scrape_configs:
# Monitor oneself, can be matched or not
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
# node_exporter target configuration, grasp the basic information of node (CPU, memory, etc.), can be based on different services to establish job s, lable
- job_name: 'node'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
static_configs:
- targets: ['1.1.17.28:9100']
labels:
severity: 'all'
group: 'tool'
hostname: 'yum-server'
- targets: ['1.1.11.27:9100']
labels:
severity: 'all'
group: 'dev'
hostname: 'app1'
- targets: ['1.1.11.28:9100']
labels:
severity: 'all'
group: 'dev'
hostname: 'app2'
- targets: ['1.1.11.15:9100']
labels:
severity: 'all'
group: 'hadoop'
hostname: 'hadoop1'
- targets: ['1.1.11.16:9100']
labels:
severity: 'all'
group: 'hadoop'
hostname: 'hadoop2'
- targets: ['1.1.11.17:9100']
labels:
severity: 'all'
group: 'hadoop'
hostname: 'hadoop2'
- targets: ['1.1.10.12:9100']
labels:
severity: 'all'
group: 'db_anl'
hostname: 'DB_ETL'
# Alert manager configuration
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "1.1.17.17:9093"
Warning rules: prometheus.rules
# CPU alarm
ALERT cpu_overload
IF node_load1 >= 80
FOR 3m
LABELS { severity = "all" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} cpu_load1 over 80% for 3 minutes",
description = "{{ $labels.instance }} of job {{ $labels.job }} cpu_load1 over 80% for 3 minutes.",
}
# Memory alarm
ALERT memory_overload
IF (node_memory_MemTotal-node_memory_MemFree)/node_memory_MemTotal >= 0.8
FOR 3m
LABELS { severity = "all" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} memory_load over 80% for 3 minutes",
description = "{{ $labels.instance }} of job {{ $labels.job }} memory_load over 80% for 3 minutes.",
}
node-export deployment
node-export exposes metric only through static web and starts after installation without configuration
cd /usr/local/
wget http://1.1.17.28/software/linux/prometheus/node_exporter-0.14.0.linux-amd64.tar.gz
tar -zxvf node_exporter-0.14.0.linux-amd64.tar.gz
cd node_exporter-0.14.0.linux-amd64
nohup ./node_exporter &
#Write Boot Start
echo "/usr/local/node_exporter-0.14.0.linux-amd64/node_exporter" >> /etc/rc.local
alert
deploy
cd /usr/local
wget http://1.1.17.28/software/linux/prometheus/alertmanager-0.8.0.linux-amd64.tar.gz
tar -zxvf alertmanager-0.8.0.linux-amd64.tar.gz
cd alertmanager-0.8.0.linux-amd64
nohup ./alertmanager &
echo "/usr/local/alertmanager-0.8.0.linux-amd64/alertmanager" >> /etc/rc.local
Alarm notification profile
Only mail alerts are configured
[root@prometheus local]# cat alertmanager-0.8.0.linux-amd64/alertmanager.yml
global:
smtp_smarthost: 'smtp.xxx.com:25'
resolve_timeout: 5m
smtp_from: '123@xxx.com'
smtp_auth_username: '123@xxx.com'
smtp_auth_password: '123123123'
smtp_require_tls: false
#templates:
#- '/usr/local/alertmanager-0.8.0.linux-amd64/alert_templates/123.tmpl'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'hwj'
routes:
# - match_re:
# service: ^(foo1|foo2|baz)$
# receiver: hwj
# routes:
- match:
severity: 'all'
receiver: 'hwj'
receivers:
- name: 'hwj'
email_configs:
- to: '123@xxx.com'
send_resolved: true
- to: '456@xxx.com'
send_resolved: true