我用以下表达式定义了一些警报:
sum(rate(some_error_metric[1m])) BY (namespace,application) > 10
sum(rate(some_other_error_metric[1m])) BY (namespace,application) > 10
...
当前,当我们的任何应用程序以每分钟10次以上的速率发出这些指标时,就会触发上述警报。
我希望能够为每个应用程序指定不同的阈值,而不是硬编码阈值10。
例如application_1
的警报速度应为每分钟10次,application_2
的警报速度应为每分钟20次,等等。
是否可以在不为每个应用程序复制警报的情况下?
这个stackoverflow问题:Dynamic label values in Promethues alerting rules建议使用记录规则可以实现我想要的目标,但是按照该问题唯一答案中提出的模式,将导致Prometheus似乎没有记录规则能够解析:
- record: application_1_warning_threshold
expr: warning_threshold{application="application_1"} 10
- record: application_2_warning_threshold
expr: warning_threshold{application="application_2"} 20
...
答案 0 :(得分:0)
这是我的TasksMissing
警报的配置,该警报具有不同的按工作阈值:
groups:
- name: availability.rules
rules:
# Expected number of tasks per job and environment.
- record: job_env:up:count
expr: count(up) without (instance)
# Actually up and running tasks per job and environment.
- record: job_env:up:sum
expr: sum(up) without (instance)
# Ratio of up and running to expected tasks per job and environment.
- record: job_env:up:ratio
expr: job_env:up:sum / job_env:up:count
# Global warning and critical availability ratio thresholds.
- record: job:up:ratio_warning_threshold
expr: 0.7
- record: job:up:ratio_critical_threshold
expr: 0.5
# Job-specific warning and critical availability ratio thresholds.
# Always alert if one Prometheus instance is down.
- record: job:up:ratio_critical_threshold
labels:
job: prometheus
expr: 0.99
# Never alert for some-batch-job instances down:
- record: job:up:ratio_warning_threshold
labels:
job: some-batch-job
expr: 0
- record: job:up:ratio_critical_threshold
labels:
job: some-batch-job
expr: 0
# TasksMissing is fired when a certain percentage of tasks belonging to a job are down. Namely:
#
# job_env:up:ratio < job:up:ratio_(warning|critical)_threshold
#
# with a job-specific warning/critical threshold when defined, or the global default otherwise.
- alert: TasksMissing
expr: |
# Default warning threshold is < 70%
job_env:up:ratio
< on(job) group_left()
(
job:up:ratio_warning_threshold
or on(job)
count by(job) (job_env:up:ratio) * 0
+ on() group_left()
job:up:ratio_warning_threshold{job=""}
)
for: 2m
labels:
severity: warning
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'...'
- alert: TasksMissing
expr: |
# Default critical threshold is < 50%
job_env:up:ratio
< on(job) group_left()
(
job:up:ratio_critical_threshold
or on(job)
count by(job) (job_env:up:ratio) * 0
+ on() group_left()
job:up:ratio_critical_threshold{job=""}
)
for: 2m
labels:
severity: critical
annotations:
summary: Tasks missing for {{ $labels.job }} in {{ $labels.env }}
description:
'...'
答案 1 :(得分:0)
相同的问题:
这是application_10000,application_10001,application_10002,但我们不知道。因此我们需要其他应用程序的默认规则。我试过了:
record: record_rule_name_10000
expr: warning_threshold{application="application_10000"} > 500
labels:
team: record_rule_name_10000
record: record_rule_name_10001
expr: warning_threshold{application="application_10001"} > 501
labels:
team: record_rule_name_10001
alert: record_rule_name_10000
expr: warning_threshold
> on(team) group_left() (record_rule_name_10000
or on(team) count by(team) (warning_threshold) * 0 + 100)
for: 20s
annotations:
default_threshold: "100"
now_value: '{{ $value }}'
platform: vkmq
threshold: "500"
alert: record_rule_name_10001
expr: warning_threshold
> on(team) group_left() (record_rule_name_10001
or on(team) count by(team) (warning_threshold) * 0 + 102)
for: 20s
annotations:
default_threshold: "100"
now_value: '{{ $value }}'
platform: vkmq
threshold: "501"
application_10002仍然无法使用一个默认阈值