Prometheus monitors DELL server hardware

Posted by Ads on Sat, 04 Sep 2021 00:08:09 +0200

preface

Note that monitoring dell server hardware in the title refers to monitoring the status of server hardware (disk, memory, power supply, etc.), not monitoring hardware performance, disk space, memory usage, etc. it is similar to zabbix monitoring idrac's snmp to obtain hardware status

Now most companies use prometheus monitoring containers and services, zabbix monitoring hardware, ports, and of course other monitoring architectures. Here, we don't compare the advantages and disadvantages of each monitoring. We just make a document. The document doesn't explain the basic content in detail. It's only suitable for viewing some prometheus foundations. It's not suitable for non-contact people

prerequisite

<1> Each server to be monitored starts the snmp of idrac and sets the community name, which is similar to the password (public by default)
Pay attention to the password you set, which will be used later

<2> Due to security problems, the network is generally limited. Find a server that can ping the idrac IP address of each server and install the snmp monitoring component

<3> The Prometheus server needs to be able to connect to snmp_exporter

Component installation

Installation dependency

yum -y install gcc gcc-g++ make net-snmp net-snmp-utils net-snmp-libs net-snmp-devel golang git 

snmp_exporter installation

<1> Download snmp_exporter

https://github.com/prometheus/snmp_exporter/releases

cd /data
wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz
tar xf snmp_exporter-0.20.0.linux-amd64.tar.gz
mv snmp_exporter-0.20.0.linux-amd64 snmp_exporter

<2> Configure startup mode

The startup mode is configured according to the system version. Startup is not required temporarily (SNMP is not generated)

Centos7

cat /usr/lib/systemd/system/snmp-exporter.service 

[Unit]
Description=SNMP exporter
Documentation=https://github.com/prometheus/snmp_exporter


[Service]
ExecStart=/data/snmp_exporter/snmp_exporter \
--config.file=/data/snmp_exporter/snmp.yml \
--web.listen-address=:9116 \
--snmp.wrap-large-counters
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
Restart=on-failure

[Install]
WantedBy=multi-user.target


management style:
systemctl daemon-reload
systemctl enable snmp-exporter
systemctl restart snmp-exporter
systemctl status snmp-exporter
systemctl stop snmp-exporter
Centos6

cat /etc/init.d/snmp_exporter 
#!/bin/bash

# chkconfig: 2345 80 80
# description: Start and Stop snmp_exporter
# Source function library.

. /etc/init.d/functions

prog_name="snmp_exporter"
prog_path="/data/${prog_name}"
pidfile="/var/run/${prog_name}.pid"
prog_logs="/data/${prog_name}/${prog_name}.log"
options="--config.file=/data/snmp_exporter/snmp.yml --web.listen-address=:9116 --snmp.wrap-large-counters"
DESC="snmp_exporter"

[ -x "${prog_path}" ] || exit 1

RETVAL=0

start(){
action $"Starting $DESC..." su -s /bin/sh -c "nohup $prog_path $options >> $prog_logs 2>&1 &" 2> /dev/null
RETVAL=$?
PID=$(pidof ${prog_path})
[ ! -z "${PID}" ] && echo ${PID} > ${pidfile}
echo
[ $RETVAL -eq 0 ] && touch /var/lock/subsys/$prog_name
return $RETVAL
}

stop(){
echo -n $"Shutting down $prog_name: "
killproc -p ${pidfile}
RETVAL=$?
echo
[ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/$prog_name
return $RETVAL
}

restart() {
stop
start
}

case "$1" in
start)
start
;;
stop)
stop
;;
restart)
restart
;;
status)
status $prog_path
RETVAL=$?
;;
*)
echo $"Usage: $0 {start|stop|restart|status}"
RETVAL=1
esac

------------------------------------------------------------
cat  /etc/sysconfig/snmp_exporter
ARGS=""



------------------------------------------------------------
management style:
chmod +x /etc/init.d/snmp_exporter
chkconfig snmp_exporter on
/etc/init.d/snmp_exporter restart

mibs downloads and generates snmp.yml

MIB and OID

OID is the id provided by SNMP agent to uniquely identify an object or information. It is a string of numbers such as 1.3.6.1.4.1.4413.1.3.2.1.17
MIB is a database that stores the information corresponding to OID in a tree structure
Just as an organization has designated 134 as hands-on, MIB is

<1> Download the mib suitable for your server model and view the compatible system

https://www.dell.com/support/search/zh-cn#q=mibs&sort=relevancy&f:langFacet=[zh]

wget https://dl.dell.com/FOLDER06009600M/1/Dell-OM-MIBS-940_A00.zip
unzip Dell-OM-MIBS-940_A00.zip

<2> View OID

snmptranslate -Tz -m /root/support/station/mibs/iDRAC-SMIv2.mib
cp /usr/share/snmp/mibs/SNMPv2-SMI.txt /root/support/station/mibs/

<3> Generate snmp.yml

Official address:
https://github.com/prometheus/snmp_exporter/tree/main/generator#file-format

# Configuration variable
export GO111MODULE=on
export GOPROXY=https://mirrors.aliyun.com/goproxy/
export MIBDIRS=/root/support/station/mibs/

#Pull generator
go get github.com/prometheus/snmp_exporter/generator
cd ${GOPATH-$HOME/go}/pkg/mod/github.com/prometheus/snmp_exporter@v0.20.0/generator
go build


#Edit generator.yml
(community To set as you idrac of snmp Group name)

vim generator.yml

modules:
  idrac:
    walk:
      - 1.3.6.1
    version: 2
    timeout: 30s
    auth:
      community: public

#Generate monitoring indicators
./generator generate
cp -r snmp.yml /data/snmp_exporter/

<4> Start snmp_exporter

systemctl restart snmp-exporter
/etc/init.d/snmp_exporter restart

<5> Test whether the index grabbing is normal

http://snmp_ IP of exporter: 9116

remarks:
Target Fill in the remote management card of the server to be captured ip,Of the network card configured inside the server ip invalid 
Module:Fill in this snmp Module of,snmp.yml In the file walk above
 If you part of the server snmp Your password is something else,It is recommended to copy a new one snmp file,Modify the at the end of the file community: xxx

cat snmp.yml

Prometheus configuration

No matter how Prometheus is installed, it does not need to be installed again. The focus is to add an idrac configuration to prometheus.yml
You may write prometheus monitoring and alarm related documents later

prometheus configuration

<1> Configure where to read alarm rules

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
    - "rule/*.yml"  
  # - "second_rules.yml"

Directory for creating alarm rules,Write the file of the alarm rule in the directory
mkdir rule
vim idrac.yml

<2> Configure the job and set the indicators to be collected or excluded

Mode 1
static_configs mode

- job_name: 'IDRAC'
  scrape_interval: 180s             #Interval for fetching data
  scrape_timeout: 180s              #Timeout for fetching data
  static_configs:
    - targets:
        - 123.123.123.123           #idrac ip to monitor, default snmp port 161
#       - 123.123.123.123:161       #If it is other ports, you can also add ports
#      labels:                      #labels can be added according to requirements, such as the internal ip corresponding to the idrac, work room, etc
#        IP: 'xxx'
#        project: 'xxx'
  metrics_path: /snmp
  params:
    module: [dell]                  #
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: xxxxx:9116      #Your snmp_exporter server


Characteristics of this model,To monitor which ones, you need to targets Add several.If it is hundreds, it will lead to prometheus.yml The number of file lines is particularly large
Mode II
file_sd_configs mode


  - job_name: "IDRAC"
    params:
      module: 
      - idrac
    scrape_interval: 180s              
    scrape_timeout: 180s
    metrics_path: /snmp
    file_sd_configs:
      - files:
        - targets/*.json               #Read the json file. The directory name is arbitrary, but you have to create it
        refresh_interval: 5m           #How long does the file load and how often
    relabel_configs:
     - source_labels: [__address__]
       target_label: __param_target
     - source_labels: [__param_target]
       target_label: instance
     - target_label: __address__
       replacement: xxxx:9116         #Your snmp_exporter server   




Characteristics of this model,Need to create json file,Monitor item write json file,json The format is as follows:
cat targets/idrac.json

[
  {
    "targets": [
      "123.123.123.123:161"
    ],
    "labels": {
      "IP": "xxxx",
      "Project": "xxx"
    }
  },
  {
    "targets": [
      "123.123.123.124:161"
    ],
    "labels": {
      "IP": "xxx",
      "Project": "xxx"
    }
  }
]

or

[
  {
    "targets": [
      "123.123.123.123:161",
      "123.123.123.124:161"
    ],
    "labels": {
      "IP": "xxxx",
      "Project": "xxx"
    }
  }
]

Mode III
consul_sd_file mode
 This method is to register monitoring with consul In service,prometheus adopt consul Realize automatic discovery of services
 Not detailed here consul,Not used consul And configured prometheus This method is not recommended for the alarm,Not easy to understand


  - job_name: 'IDRAC'
    params:
      module:
      - idrac
    scrape_interval: 180s
    scrape_timeout: 180s
    metrics_path: /snmp
    consul_sd_configs:
    - server: 'monitor-consul.com:8500'           #This is the domain name of your consumer service, or you can directly fill in the ip address
      tag_separator: ','
      services: []
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*idrac.*                          #This is to classify the indicators in your consumer's tags that meet the regularity into the Job
        action: keep
      - source_labels: ['__meta_consul_service_metadata_eth-ip']     #This is the label marked by your consumer, which is displayed in Prometheus - > targets - > Idrac - > endpoint
        target_label: __param_target            
      - source_labels: ['__meta_consul_service_address']
        target_label: instance
      - target_label: __address__
        replacement: xxx:9116



Characteristics of this model,The service needs to be registered to consul,There are two registration methods: static and file:
json Examples are as follows,Write your own according to your needs(Label random,But it should match the keyword of the nail group you call the police,accord with alertmanger Related configuration)



cat consul-idrac.json 
{
  "ID": "IDRAC-xxx",
  "Name": "IDARC-xxx",
  "Tags": [
    "idrac"
  ],
  "Address": "xxx",                                  #IDRAC IP
  "Meta": {                                          #The label in consumer, and then the label will be rewritten as the label of prometheus
	"eth-ip":"xxx",                                  #Server service ip
	"project":"beijing"                              #Location 
  },
  "EnableTagOverride": false,
  "Check": {
	  "HTTP": "http://xxxx:9116/metrics ", # your snmp server IP and port. Health check
      "Interval": "10s"
  },
  "Weights": {
    "Passing": 10,
    "Warning": 1
  }
}


explain:Because the health check uses snmp_exporter What we're actually checking is snmp_exporter,So even the front IP And so on is wrong,consul The status is also normal.But it doesn't affect prometheus To monitor,Service registration to consul after,It's just from consul Gets the value and label of the service,then prometheus Then monitor according to your own configuration.about snmp Suitable for the second json

or
cat consul-idrac2.json

{
  "ID": "IDRAC-xxx",
  "Name": "IDARC-xxx",
  "Tags": [
    "idrac"
  ],
  "Address": "xxx:161",
  "Meta": {                                          #The label in consumer, and then the label will be rewritten as the label of prometheus
	"eth-ip":"xxx",                                  #Server service ip
	"project":"beijing"                              #Location 
  }
}




register
curl --request PUT --data @consul-idrac.json http://monitor-consul.com:8500/v1/agent/service/register?replace-existing-checks=1
 Unregister
curl -X PUT http://monitor-consul.com:8500/v1/agent/service/deregister/IDRAC-xxx

effect:

Alarm rule configuration

Pay attention to the indicators in snmp.yml, but not all indicators can be used. You can search on prometheus

cat rule/idrac.yml 
groups:
    - name: IDRAC-Physical machine hardware running status
      rules:

      - alert: IDRAC state
        expr: up{job=~"IDRAC.*"} == 0
        for: 1m
        labels:
          status: error
        annotations:
          description: "{{$labels.instance}} IDRAC abnormal"

      - alert: Overall status of chassis components
        expr: chassisStatus != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The overall running status of chassis components is abnormal. Please check it in time!!"
          description: "{{$labels.instance}}Abnormal chassis components"

      - alert: Chassis CMOS Overall battery status
        expr: systemBatteryStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "Chassis CMOS The overall state of the battery is abnormal. Please check it in time!!"
          description: "{{$labels}}Chassis CMOS Abnormal battery status"


      - alert: Memory module running status
        expr: memoryDeviceStatus != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The status of the memory module is abnormal. Please check it in time!!"
          description: "{{$labels.instance}} Memory module {{$labels.memoryDeviceIndex}}abnormal"


      - alert: processor CPU Overall status
        expr: processorDeviceStatusStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "processor CPU The overall status is abnormal. Please check it in time!!"
          description: "{{$labels.instance}} processor CPU{{$labels.processorDeviceStatusIndex}}abnormal"

      - alert: Network card status
        expr: networkDeviceStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          description: "{{$labels.instance}} network card{{$labels.networkDeviceIndex}}abnormal"

      - alert: ps Overall status of power supply
        expr: powerSupplyStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "ps The overall status of the power supply is abnormal. Please check it in time!!"
          description: "{{$labels.instance}} ps Power Supply {{ $labels.powerSupplyIndex }}Abnormal state"

      - alert: Storage controller overall status
        expr: globalStorageStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The status of the storage controller is abnormal. Please check it in time!!"
          description: "{{$labels.instance}} Storage controller exception"

      - alert: Overall status of physical system components
        expr: globalSystemStatus != 3 
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The overall components of the physical system are running abnormally. Please check it in time!!"
          description: "{{$labels.instance}} Physical system component exception"

      - alert: Physical disk running status
        expr: physicalDiskState != 3
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The physical disk is running abnormally. Please check it in time!!"
          description: "{{$labels.instance}} Physical disk{{$labels. physicalDiskNumber}}abnormal"

      - alert: Virtual disk running status
        expr: virtualDiskState != 2
        for: 1m
        labels:
          status: error
        annotations:
          summary: "The virtual disk is running abnormally. Please check it in time!!"
          description: "{{$labels.instance}} Virtual disk{{$labels.virtualDiskNumber}}abnormal"

Reload prometheus
curl -X POST http://XXXX: 9090 / - / reload #prometheus' IP

To alarm, you also need to configure the alarm plug-in Alertmanager and the nail plug-in Prometheus webhook dingtalk, and add robots in the dingding group. The alarm process is not demonstrated here

Topics: Linux