以下 promql 查询使用一个组过滤器(实例)并按预期工作以生成动态过滤器。
- record: threshold_NodeHighCpuLoad_warning
expr: 10
labels:
instance: host.example.net:9100
- record: threshold_NodeHighCpuLoad_critical
expr: 85
labels:
instance: host.example.net:9100
- record: query_NodeHighCpuLoad
expr: 100 - (avg by(app,job,instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- alert: NodeHighCpuLoadCritical
expr: query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_critical or on (instance) query_NodeHighCpuLoad * 0 + 90) or absent (query_NodeHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: CPU load\n VALUE = {{ $value }}
- alert: NodeHighCpuLoadWarning
expr: query_NodeHighCpuLoad > on (instance) group_left() ( threshold_NodeHighCpuLoad_warning or on (instance) query_NodeHighCpuLoad * 0 + 80) or absent (query_NodeHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: CPU load\n VALUE = {{ $value }}
以下 promql 查询尝试使用两个组过滤器(容器、pod)并且不起作用。我怀疑这是匹配标签的事情。
- record: threshold_ContainerHighCpuLoad_warning
expr: 0
labels:
container: gitlab
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
- record: threshold_ContainerHighCpuLoad_warning
expr: 1
labels:
container: prometheus
- record: threshold_ContainerHighCpuLoad_critical
expr: 2
labels:
container: prometheus
- record: query_ContainerHighCpuLoad
expr: (sum by(pod, namespace, job, instance, image, name, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))
- alert: ContainerHighCpuLoadWarning
expr: query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + .5) or absent(query_ContainerHighCpuLoad)*-1
for: 5m
labels:
severity: warning
annotations:
summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
description: CPU load\n VALUE = {{ $value }}
- alert: ContainerHighCpuLoadCritical
expr: query_ContainerHighCpuLoad > on (container,pod) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container,pod) query_ContainerHighCpuLoad * 0 + 1) or absent(query_ContainerHighCpuLoad)*-1
for: 5m
labels:
severity: critical
annotations:
summary: Host high CPU load ({{$labels.container}} {{ $labels.namespace }}/{{ $labels.pod }})
description: CPU load\n VALUE = {{ $value }}
我尝试将容器添加为一个包罗万象的内容,如下所示,但没有奏效。
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
pod: ".*"
我怀疑它被评估为“=”而不是“=~”,因此不匹配。
我发现如果我添加以下内容,我会得到预期的结果。但是,由于 pod 名称是动态的,我需要某种正则表达式匹配。
- record: threshold_ContainerHighCpuLoad_warning
expr: 0
labels:
container: gitlab
pod: gitlab-67dd9b7d59-np4js
- record: threshold_ContainerHighCpuLoad_critical
expr: 1
labels:
container: gitlab
pod: gitlab-67dd9b7d59-np4js
有人知道如何解决这个问题吗?
谢谢! -肯德尔·切诺维斯
答案 0 :(得分:0)
我想通了。我修改了查询以使用 sum without(label list)。
- record: query_ContainerHighCpuLoad
expr: (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m])))
当不希望按某些标签值拆分时间序列数据时,例如在 query_ContainerHighCpuLoad 中,需要将这些标签标记为忽略。
要确定哪些标签正在拆分您的时间序列,请首先运行警报 promql(下一部分),扩展查询并在无参数列表中包括连接不需要的所有标签。一次删除一个,以确定哪些是有问题的,需要包含在最终的无列表中。
从查询开始。
(sum without(id, node, service, name, image, pod, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance, cpu, endpoint, job, metrics_path) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)
此练习产生以下查询。
(sum without(id, node, service, name, image, pod, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) (sum without(id, node, service, pod, name, image, instance) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",namespace!~"kube-system"}[1m]))) * 0 + .4)
现在,您可以更新 query_ContainerHighCpuLoad 的定义并简化表达式。
query_ContainerHighCpuLoad > on (container, namespace) group_left() ( threshold_ContainerHighCpuLoad_warning or on (container) query_ContainerHighCpuLoad * 0 + .4)
由于某些标签(例如实例)被抑制并且与问题解决相关,因此可以使用以下 kubectl 命令恢复它们。
kubectl get po -n monitoring -o jsonpath='{range .items[*]}{"\n"}{"pod: "}{.metadata.name}/{.metadata.namespace}: {range .spec.containers[*]}{.name}{","}{end}{"\n"}' | grep "prometheus,"
在此命令中,命名空间、监控和容器名称 prometheus 可用于提取输出,例如
pod: prometheus-kubeprom-kube-prometheus-s-prometheus-0/monitoring: prometheus,config-reloader,
由于查询是针对容器名称运行的,因此可能会返回多个实例。