Question

以下 promql 查询使用一个组过滤器（实例）并按预期工作以生成动态过滤器。

    - record: threshold_NodeHighCpuLoad_warning
      expr: 10
      labels:
        instance: host.example.net:9100

    - record: threshold_NodeHighCpuLoad_critical
      expr: 85
      labels:
        instance: host.example.net:9100

    - record: query_NodeHighCpuLoad
      expr: 100 - (avg by(app,job,instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    - alert: NodeHighCpuLoadCritical
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_critical or on (instance) query_NodeHighCpuLoad * 0 + 90) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: NodeHighCpuLoadWarning
      expr:  query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_warning or on (instance) query_NodeHighCpuLoad * 0 + 80) or absent (query_NodeHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load (instance {{ $labels.instance }})
        description: CPU load\n  VALUE = {{ $value }}

以下 promql 查询尝试使用两个组过滤器（容器、pod）并且不起作用。我怀疑这是匹配标签的事情。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 1
      labels:
        container: prometheus

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 2
      labels:
        container: prometheus

    - record: query_ContainerHighCpuLoad
      expr: (sum by(pod, namespace, job, instance, image, name, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

    - alert: ContainerHighCpuLoadWarning
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + .5) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

    - alert: ContainerHighCpuLoadCritical
      expr:  query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + 1) or absent(query_ContainerHighCpuLoad)*-1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
        description: CPU load\n  VALUE = {{ $value }}

我尝试将容器添加为一个包罗万象的内容，如下所示，但没有奏效。

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: ".*"

我怀疑它被评估为“=”而不是“=~”，因此不匹配。

我发现如果我添加以下内容，我会得到预期的结果。但是，由于 pod 名称是动态的，我需要某种正则表达式匹配。

    - record: threshold_ContainerHighCpuLoad_warning
      expr: 0
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

    - record: threshold_ContainerHighCpuLoad_critical
      expr: 1
      labels:
        container: gitlab
        pod: gitlab-67dd9b7d59-np4js

有人知道如何解决这个问题吗？

谢谢！ -肯德尔·切诺维斯

Answer 1

我想通了。我修改了查询以使用 sum without(label list)。

   - record: query_ContainerHighCpuLoad
      expr: (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))

当不希望按某些标签值拆分时间序列数据时，例如在 query_ContainerHighCpuLoad 中，需要将这些标签标记为忽略。

要确定哪些标签正在拆分您的时间序列，请首先运行警报 promql（下一部分），扩展查询并在无参数列表中包括连接不需要的所有标签。一次删除一个，以确定哪些是有问题的，需要包含在最终的无列表中。

从查询开始。

(sum without(id, node, service, name, image, pod, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

此练习产生以下查询。

(sum without(id, node, service, name, image, pod, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)

现在，您可以更新 query_ContainerHighCpuLoad 的定义并简化表达式。

query_ContainerHighCpuLoad > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) query_ContainerHighCpuLoad * 0 + .4)

由于某些标签（例如实例）被抑制并且与问题解决相关，因此可以使用以下 kubectl 命令恢复它们。

kubectl get po -n monitoring -o jsonpath='{range .items[*]}{"\n"}{"pod: "}{.metadata.name}/{.metadata.namespace}: {range .spec.containers[*]}{.name}{","}{end}{"\n"}' | grep "prometheus,"

在此命令中，命名空间、监控和容器名称 prometheus 可用于提取输出，例如

pod: prometheus-kubeprom-kube-prometheus-s-prometheus-0/monitoring: prometheus,config-reloader,

由于查询是针对容器名称运行的，因此可能会返回多个实例。

PromQL 查询中的动态阈值（组过滤器中使用的两个标签

1 个答案: