Prometheus alarm rules

Posted by alivec on Tue, 09 Nov 2021 08:55:01 +0100

concept

Alert rules allow you to define alert conditions based on Prometheus expression language expressions and send notifications about triggering alerts to external services. Whenever an alarm expression produces one or more vector elements at a given point in time, the alarm counts as active for the tag set of these elements.

Similar to the recording rule, the alerting rule is also defined in a separate file, and then Prometheus sets it in the rule_ The files configuration section is loaded as follows

Alarm Rule Template address: https://awesome-prometheus-alerts.grep.to/rules.html#host-and-hardware

rule_files:
  - recording_rules/*.yaml
  - alerting_rules/*.yml          # Alarm rule file path

Trigger alarm

Prometheus evaluates all alarm rules in a fixed cycle, and its value is evaluated by evaluate_ The interval parameter is defined as 15s by default; In each evaluation cycle, Prometheus will calculate the Boolean expression in each alarm rule and update the alarm status;

Alarms that do not use the for clause will automatically change from Inactive to filling, which can be triggered in only one evaluation cycle; The alarm state with the for clause will be changed to Pending first, and then to filling, so it needs at least two evaluation cycles to trigger;

For the alarm in Pending status, if the value of Boolean expression returns false before its duration meets the definition of for clause, the alarm status will return to Inactive;

It can be seen that the conversion from Pending to filling can ensure that the alarm is more effective and will not float back and forth;

Prometheus will create an indicator for each alarm in Pending and filling states. The indicator name is ALERT, its value is fixed to 1, and exists during the alarm in Pending and filling states; After that, it will not receive any updates until it expires;

Alarm rules

Alarm rules are configured in Prometheus in the same way as Record rules Same.

Index meaning

Options	meaning
groups	Configure the top level, which is used to define a monitoring group
- name	Rule name
rules	rule
- alert	Name of the alarm rule.
expr	The expression is based on the PromQL expression alarm trigger condition and is used to calculate whether a time series meets the condition.
for	Evaluate the waiting time. When waiting, the status is pending, the alarm status is firing, and the recovery status is inactive
labels	Custom label
annotations	It is used to specify a set of additional information, such as the text used to describe the alarm details. The contents of annotations will be sent to the Alertmanager as parameters when the alarm is generated. Summary describes the summary information of the alarm, and description describes the details of the alarm. At the same time, the UI of Alertmanager will also display alarm information according to these two tag values.

Instance rule

groups:
- name: Allinstances     # Group name
  rules:
  - alert: instanceDown  # Rule name
    expr: up == 0        # Matching rule instance downtime promQl expression
    for: 10s             # Detection duration
    annotations:         # The alarm itself is also a time series annotation
      title: "instance down" # Alarm title
      description: 'Instance has been down for more than 1 minute.' # Alarm details
    labels:              # new label
      serverity: "critical"
      group_by: "node"

  - alert: Insufficient host memory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 99
    for: 10s
    labels:
      severity: "critical"
    annotations:
      summary: Insufficient host memory (Host address {{ $labels.instance }})
      description: "The host memory is full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Alarm template

Prometheus supports the use of templates in alarms;

Alarm template refers to the method of quoting the label and sample value of time series on the label and annotation in the alarm;

It uses standard Go syntax and exposes some variables containing time tags and values;

- Label reference: {{$label. < label_name >}}
- Index sample value reference: {{$value}}

To reference the values of the instance and job tags on the time series that triggered the alarm in the description annotation, "{{$label.instance}}" and "{{$label.job}}" can be used respectively;

Example

[root@ops prometheus]# cat alerting_rules/instance-down.yml 
groups:
- name: Allinstances     # Group name
  rules:
  - alert: instanceDown  # Rule name
    expr: up == 0        # Matching rule instance downtime promQl expression
    for: 10s             # Detection duration
    annotations:         # The alarm itself is also a time series annotation
      summary: Host downtime (Host address {{ $labels.instance }})
      description: 'Instance downtime exceeds 1 m. Current value is: {{ $value }}' # Alarm details
    labels:              # new label
      serverity: "critical"

Programmer Think