Construction of monitoring and alarm platform based on prometheus+Grafana+Alertmanager

Posted by bcamp1973 on Mon, 07 Feb 2022 11:28:42 +0100

prometheus is used to collect data, Grafana is used as a mapping display, and Alertmanager outputs alarms

The monitoring alarm platform runs using docker compose

Refer to other documents to complete the installation of docker and docker compose

Basic environment

Docker, docker compose installation completed
centos7.9
IP: 192.168.3.10
The firewall and SELinux have been turned off (/ etc/selinux/config modify SELinux value to disable)
Data disk space: estimated according to the storage time and the number of monitoring points (mainly prometheus data)
Create monitoring folder

# Main monitoring folder
mkdir monitor
# Create various application folders of the monitoring platform. The following are in the main monitoring folder monitor

# Create prometheus folder
mkdir -p prometheus/data
chmod 777 data

# Create Grafana folder
mkdir -p grafana/grafana-storage
chmod 777 grafana-storage

# Create Alertmanager folder, rule folder and collected device information folder
mkdir alert rules targets

Write docker compose file

Write docker compose file and run the monitoring alarm platform; Under the monitor folder, create a new monitor docker compose YML configuration file, refer to the following for details (the port in the configuration file is the default and can be modified and tested according to requirements):

version: '3.2'
services:
  prometheus:
    image: prom/prometheus
    restart: "always"
    ports:
      - 9090:9090
    container_name: "prometheus"
    volumes:
      - "./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml"
      - "./rules:/etc/prometheus/rules"
      - "./prometheus/data:/prometheus"
      - "./targets:/etc/prometheus/targets"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  alertmanager:
    image: prom/alertmanager:latest
    restart: "always"
    ports:
      - 9093:9093
    container_name: "alertmanager"
    volumes:
      - "./alert/alertmanager.yml:/etc/alertmanager/alertmanager.yml"
      - "./alert/template/:/etc/alertmanager/template/"

  grafana:
    image: grafana/grafana
    restart: "always"
    ports:
      - 3000:3000
    container_name: "grafana"
    volumes:
      - "./grafana/grafana-storage:/var/lib/grafana"

Build prometheus configuration file

Create a new profile prometheus. In the prometheus folder YML, the specific configuration is as follows:

global:
  scrape_interval:     15s    # How often are data collected
  evaluation_interval: 15s    # How often are the rules evaluated
  scrape_timeout:      10s    # Timeout for each data collection

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['192.168.3.10:9090']   
        labels:
          instance: prometheus

# Here is the monitored node information. This file will be mentioned below
  - job_name: node
    file_sd_configs:
      - files: 
          - ./targets/*.yml
        refresh_interval: 10s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 192.168.3.10:9093
# The following is the alarm rule file
rule_files:
  - "/etc/prometheus/rules/*.yml"

Build Altermanager configuration file

The test alarm here is sent to the enterprise wechat. Altermanager also supports sending nails, emails, etc. For details, please refer to Official website link

global:
  resolve_timeout: 1m
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: 'Enterprise wechat secret key'
  wechat_api_secret: 'Application key'
route:
  receiver: 'monitor_node'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5m
  group_by: [alertname]
  routes:
  - receiver: 'monitor_node'
    group_wait: 10s
templates:
  - "template/*.tmpl"

receivers:
- name: 'monitor_node'
  wechat_configs: 
  - send_resolved: true
    message: '{{ template "wechat.default.message" . }}'
    to_party: 'id_num' # Department ID of robot
    agent_id: 'application ID'     # ID of the application created in enterprise wechat
    api_secret: 'Application key'      # In enterprise wechat, the Secret of application

Install node exporter on the monitored node

The following compressed packages can also be installed by docker and docker compose:

# Compressed package installation
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz

# Write service
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
ExecStart=  /xxx/node_exporter  # Node exporter directory
PrivateTmp=true
RestartSec=5
StartLimitInterval=0
Restart=always
[Install]
WantedBy=multi-user.target

# Overload configuration
systemctl daemon-reload

Monitored node targets configuration

File is used here_ sd_ Collect the monitored node exporter in configs mode. Refer to other methods Official documents ； There are many ways to follow the specific business environment for reference;
Create a new node in the targets folder YML, the reference configuration is as follows:

- targets: ['192.168.3.11:9100']
  labels:
    instance: vm01
 - targets: ['192.168.3.12:9100']
  labels:
    instance: vm02
 - targets: ['192.168.3.13:9100']
  labels:
    instance: vm03

Alarm rule configuration

This article mainly provides the implementation test of monitoring mapping alarm, non production environment; Specific alarm rules can be configured in combination with the actual environment;
Put the rule configuration file in the rules folder and name it * yml corresponds to the configuration item in prometheus.

For details, please refer to Official configuration

Here is a part of my test rules configuration reference:

groups:
  - name: node-alert
    rules:
    - alert: node-down
      expr: prometheus:up == 0
      for: 1m
      labels:
        severity: 'critical'
      annotations:
        summary: "instance: {{ $labels.instance }} Downtime "
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} Downtime "
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"



    - alert: node-cpu-high
      expr:  prometheus:cpu:total:percent > 80
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu Utilization rate higher than {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU The utilization rate continues to be higher than 80%% . "
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

    - alert: node-cpu-iowait-high
      expr:  prometheus:cpu:iowait:percent >= 12
      for: 3m
      labels:
        severity: info
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait higher than {{ $value }}"
        description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait Continuously higher than 12%"
        value: "{{ $value }}"
        instance: "{{ $labels.instance }}"

Run test environment

# Run under the monitor folder
docker-compose -f monitor-docker-compose.yml up -d
# docker ps check the status of three containers: prometheus, Altermanager and Grafana
# Log in to the Grafana management page on the web and configure the dashboard

After the running environment test is normal, you can continue to test the alarm silence and other functions on the Altermanager management page. Refer to the following figure for the import location of Grafana instrument panel:

The Grafana instrument cluster is available in Official documents find

Hover over + the upper edge and select Import

Fill in the dashboard id found on the official website and Load it.

END

You can check the official documents before troubleshooting, and the learning content is rich
Test each step and narrow the scope of troubleshooting
Welcome to exchange and correct~~

Topics: Docker Prometheus Grafana

Programmer Think