Question

我们即将设置Prometheus用于监控和警报我们的云服务，包括持续集成和Prometheus服务的配置管道和配置，如警报规则/阈值。为此，我正在考虑3个类别，我想编写自动化测试：

部署期间配置的基本语法检查（我们已使用promtool和amtool执行此操作）
在部署期间测试警报规则（导致警报的原因）
在部署期间测试警报路由（谁会收到有关内容的警报）
重复检查警报系统是否在生产中正常工作

我现在最重要的部分是测试警报规则（类别1），但我没有找到工具来做到这一点。我可以想象在部署期间设置一个Prometheus实例，为它提供一些公制样本（担心我将如何使用普罗米修斯的Pull架构？）然后针对它运行查询。

到目前为止，我唯一发现的是与第三类相关的blog post about monitoring the Prometheus Alertmanager chain as a whole。

有没有人做过这样的事情，或者有什么我错过的？

Answer 1

新版本的Prometheus（2.5）允许编写警报测试，这里是link。您可以检查第1点和第2点。您必须定义数据和预期输出（例如，在test.yml中）

rule_files:
    - alerts.yml
evaluation_interval: 1m
tests:
# Test 1.
- interval: 1m
  # Series data.
  input_series:
      - series: 'up{job="prometheus", instance="localhost:9090"}'
        values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
      - series: 'up{job="node_exporter", instance="localhost:9100"}'
        values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

  # Unit test for alerting rules.
  alert_rule_test:
      # Unit test 1.
      - eval_time: 10m
        alertname: InstanceDown
        exp_alerts:
            # Alert 1.
            - exp_labels:
                  severity: page
                  instance: localhost:9090
                  job: prometheus
              exp_annotations:
                  summary: "Instance localhost:9090 down"
                  description: "localhost:9090 of job prometheus has been down for more than 5 minutes."

您可以使用docker运行测试：

docker run \
-v $PROJECT/testing:/tmp \
--entrypoint "/bin/promtool" prom/prometheus:v2.5.0 \
test rules /tmp/test.yml

promtool将验证文件InstanceDown中的警报alerts.yml是否处于活动状态。这种方法的优点是您不必启动Prometheus。

如何自动测试Prometheus警报？

1 个答案: