PromQL 查询中的动态阈值(组过滤器中使用的两个标签

时间:2021-03-09 21:14:30

标签: prometheus promql

以下 promql 查询使用一个组过滤器(实例)并按预期工作以生成动态过滤器。

    - record: threshold_NodeHighCpuLoad_warning
      expr: 10
      labels:
        instance: host.example.net:9100

    - record: threshold_NodeHighCpuLoad_critical
      expr: 85
      labels:
        instance: host.example.net:9100

    - record: query_NodeHighCpuLoad
      expr: 100 - (avg by(app,job,instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    - alert: NodeHighCpuLoadCritical
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_critical or on (instance) query_NodeHighCpuLoad * 0 + 90) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: NodeHighCpuLoadWarning
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_warning or on (instance) query_NodeHighCpuLoad * 0 + 80) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

以下 promql 查询尝试使用两个组过滤器(容器、pod)并且不起作用。我怀疑这是匹配标签的事情。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 1
      labels:
        container: prometheus

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 2
      labels:
        container: prometheus

    - record: query_ContainerHighCpuLoad
      expr: (sum by(pod, namespace, job, instance, image, name, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

    - alert: ContainerHighCpuLoadWarning
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + .5) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: ContainerHighCpuLoadCritical
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + 1) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

我尝试将容器添加为一个包罗万象的内容,如下所示,但没有奏效。

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: ".*"

我怀疑它被评估为“=”而不是“=~”,因此不匹配。

我发现如果我添加以下内容,我会得到预期的结果。但是,由于 pod 名称是动态的,我需要某种正则表达式匹配。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

有人知道如何解决这个问题吗?

谢谢! -肯德尔·切诺维斯

1 个答案:

答案 0 :(得分:0)

我想通了。我修改了查询以使用 sum without(label list)。

   - record: query_ContainerHighCpuLoad
      expr: (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

当不希望按某些标签值拆分时间序列数据时,例如在 query_ContainerHighCpuLoad 中,需要将这些标签标记为忽略。

要确定哪些标签正在拆分您的时间序列,请首先运行警报 promql(下一部分),扩展查询并在无参数列表中包括连接不需要的所有标签。一次删除一个,以确定哪些是有问题的,需要包含在最终的无列表中。

从查询开始。

(sum without(id, node, service, name, image, pod, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

此练习产生以下查询。

(sum without(id, node, service, name, image, pod, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

现在,您可以更新 query_ContainerHighCpuLoad 的定义并简化表达式。

query_ContainerHighCpuLoad > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) query_ContainerHighCpuLoad * 0 + .4)

由于某些标签(例如实例)被抑制并且与问题解决相关,因此可以使用以下 kubectl 命令恢复它们。

kubectl get po -n monitoring -o jsonpath='{range .items[*]}{"\n"}{"pod: "}{.metadata.name}/{.metadata.namespace}: {range .spec.containers[*]}{.name}{","}{end}{"\n"}' | grep "prometheus,"

在此命令中,命名空间、监控和容器名称 prometheus 可用于提取输出,例如

pod: prometheus-kubeprom-kube-prometheus-s-prometheus-0/monitoring: prometheus,config-reloader,

由于查询是针对容器名称运行的,因此可能会返回多个实例。

相关问题