Best practices for intelligent patrol alarm configuration

Posted by praeses on Fri, 21 Jan 2022 04:22:14 +0100

Introduction: the detection results of intelligent anomaly analysis are output to the notification channel configured by the user through the SLS alarm function. In the intelligent patrol scenario, a single task often patrols a large number of entity objects and involves many object rules. We can better manage patrol events through the new SLS alarm.

The detection results of intelligent anomaly analysis are output to the notification channel configured by the user through the SLS alarm function. In the intelligent patrol scenario, a single task often patrols a large number of entity objects and involves many object rules. We can better manage patrol events through the new SLS alarm.

Patrol Event Infrastructure

Here, let's briefly look at the basic logic of patrol inspection task:

For a single patrol operation, there are N patrol entities inside, and each patrol entity corresponds to a patrol model. After any abnormal event occurs, it will be notified to the user through the alarm system. Therefore, we need to be able to distribute and manage the results in different ways.

Let's first look at the infrastructure of patrol events. The specific built-in templates are as follows:

## data source
+ Project: ${results[0].project}
+ LogStore: ${results[0].store}

##  Exception object
+ Entity: ${labels}

## Abnormal degree
+ Score: ${annotations.anomaly_score}

## Abnormal sequence diagram
![image](${annotations.__plot_image__})

[[Data details](${query_url})]
[[Job details](${alert_url})]

[[confirm](${annotations.__ensure_url__})]
[[False alarm](${annotations.__mismatch_url__})]

Let's take a look at the specific examples of alarm messages. Next, all our descriptions will be described according to the corresponding results below.

{
  "results": [
    {
      "store_type": "log",
      "region": "cn-chengdu",
      "project": "sls-ml-demo",
      "store": "machine_metric_logtail",
      "start_time": 1641361140,
      "end_time": 1641361200
    }
  ],
  "labels": {
    "ip": "192.168.1.5",
    "name": "load_avg"
  },
  "annotations": {
    "__ensure_url__": "$url_path",
    "__mismatch_url__": "$url_path",
    "__plot_image__": "$url_path",
    "alert_msg_type": "ml_anomaly_msg",
    "anomaly_score": "0.8000",
    "anomaly_type_id": "1",
    "anomaly_type_name": "STAB_TYPE",
    "job_id": "29030-2bbf5beba0110fa869339708a8217b67",
    "model_id": "9c0f0d5ad4879eb75237e2ec8494f5f1",
    "title": "metric-logtail-sql"
  },
  "severity": 8,
  "drill_down_url": "$url_path"
}

Typical scenario configuration

Scene 1

Target: filter exceptions for specific entities

Operation steps

Find the [action strategy ID] of a patrol inspection task, which should be determined according to the user's actual configuration. The specific path is as follows:

In the action strategy, add the corresponding condition

According to the alarm fields provided above, we assume that only the alarm messages with the field of [ip] in [tag] and the value of [192.168.1.5] are sent to the specific [nailing robot]

Scene 2

Objective: filter exceptions for specific scores

Operation steps

Find the specific action strategy ID and add the condition

Configure alarms whose [abnormal score] exceeds [0.9] to specific channels

  • [name] - Anonymous_ score
  • [regular] - ^ ((1.0) | (0.9 [0-9]))$

Scene 3

Objective: to filter exceptions for specific scores of specific entities

Operation steps

Find the specific action strategy ID and add the condition

Configure the alarm that the [exception score] of [specific entity] exceeds the [0.9] score to a specific channel

  • The name of [annotation] is set to anonymous_ Score, [regular] - ^ ((1.0) | (0.9 [0-9]))$
  • The name of [tag] is set to ip, and the corresponding entity content is 192.168.1.5

Scenario 4

Target: filter exceptions of a specific exception type

Operation steps

Find the specific action strategy ID and add the condition
Configure [specific abnormal form]

  • Configure [label] abnormally_ type_ ID, which is determined according to the corresponding value. For details, please refer to [description of exception type]
    (https://help.aliyun.com/docum...)
  • Only specific exceptions of upward drift type are accepted here_ type_ id = 7

Scene 5

Objective: distribute according to patrol events and root cause event types

Operation steps

Find the specific action strategy ID and add the condition

Configure [event type of intelligent alarm]

  • Configure [label] alert_msg_type, the corresponding value is ml_anomaly_msg (this field indicates the alarm of Intelligent Patrol)

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Topics: Cloud Native