我有一个非常典型的show CPU使用率查询
100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80
产生的数据看起来像这样:
{instance="opus143.domain.com:9182"} 94.07140535559513
{instance="opus162.domain.com:9182"} 90.00755315803018
{instance="opus163.domain.com:9182"} 85.48084077380952
但是我只想查询未出现在另一个列表中的机器的值
opus_local_slaves_count > 0
opus_local_slaves_count{instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
我认为我已经能够使用label_replace来为每种情况提供主机给我
(label_replace((100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100) > 80), "host", "$1","instance","(.*?)[.].*"))
{host="opus143",instance="opus143.domain.com:9182"} 94.07140535559513
{host="opus162",instance="opus162.domain.com:9182"} 90.00755315803018
{host="opus163",instance="opus163.domain.com:9182"} 85.48084077380952
label_replace((opus_local_slaves_count > 0), "host", "$1","instance","(.*?)[.].*")
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus143",instance="opus143.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus145",instance="opus145.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5100",job="opus-live",runname="SimV3.1a"} 54
opus_local_slaves_count{host="opus146",instance="opus146.domain.com:5110",job="opus-live",runname="SimV3.1a"} 54
但是现在我真的很努力地尝试从第二个列表中排除第二个列表中的主机。在PromQL中甚至有可能吗?在SQL中,这将是一个简单的NOT IN subquery
更新:对于上下文,我要实现的目标是能够警告服务器上的CPU使用率过高,但第二个列表中的服务器除外,它们应该具有较高的CPU使用率。也许有更好的方法?
答案 0 :(得分:0)
解决了!
对于那些发现要执行类似操作的人... saliant关键字为UNLESS!
我首先通过创建记录规则来简化事情:
groups:
- name: custom_rules
rules:
- record: wmi_cpu_time_total_instance
expr: 100 - (avg by (instance) (irate(wmi_cpu_time_total{mode="idle"}[2m])) * 100)
- record: wmi_cpu_time_total_instance_host
expr: label_replace(wmi_cpu_time_total_instance, "host", "$1", "instance","(.*?)[.].*")
- record: opus_local_slaves_count_instance_host
expr: label_replace(opus_local_slaves_count, "host", "$1", "instance","(.*?)[.].*")
它封装了计算和添加主机标签的大部分复杂性,然后我找到了这个博客(谢谢Chris Siebenmann)https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusFindUnpairedMetrics,它向我指出了UNLESS关键字,因此我可以编写简单的查询>
wmi_cpu_time_total_instance_host unless on(host) (opus_local_slaves_count_instance_host > 0)
给出没有opus_local_slaves_count标签或opus_local_slaves_count = 0的主机列表
Voila!