如何从Prometheus查询API延迟错误预算

时间:2018-10-09 15:21:31

标签: prometheus

我有一个普罗米修斯直方图api_response_duration_seconds,其中有一个SLO定义为

histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) <= 0.5

我是否可以通过一种简单的方法查询此查询在过去28天中有很大一部分(以百分比为单位)?也就是说,我希望能够回答“是否有此查询在过去28天内失败的时间超过了0.1%?”。

1 个答案:

答案 0 :(得分:2)

所以这里的秘密是我想将范围向量转换为范围向量。这个isn't possible in Prometheus, but the workaround is to use a recording rule

所以,需要做的是这样:

groups:
  - name: SLOs
  - rules:
    - record: slo:api_response_duration_seconds:failing
      expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le)) > 0.5
    - record: slo:api_response_duration_seconds:all
      expr: histogram_quantile(0.95, sum(increase(api_response_duration_seconds_bucket[1m])) by (le))

,然后将错误预算查询为

count_over_time(slo:api_response_duration_seconds:failing[28d])
/
count_over_time(slo:api_response_duration_seconds:all[28d])