Prometheus监控整个集群中过多的Pod抢占/重新安排

时间:2019-12-11 17:02:57

标签: prometheus prometheus-alertmanager promql

我正在尝试监视整个集群中过多的Pod抢占/重新安排。现在,我们已经建立了一个警报系统,它可以监视pod崩溃循环,如下所示。

 - name: KubePodCrashLooping
   message: '{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }})
          is restarting {{ printf "%.2f" $value }} / second'
   severity: medium
   impact: This pod is crashing continuously
   action: Look at the pods logs to find why it is crashing consistently, if that doesn't show any service level issues check Kubelet's logs
   expression: >
          sum(rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m])) by (namespace, pod)
          * on (namespace, pod) group_left(label_name)
          kube_pod_labels{label_monitoring="roger"} > 0
   for: 1h

但是最近发现,如果我们可以以稍微不同的方式对其进行监视,它将对我们更有用。我想指定一个值,例如x,如果pod崩溃循环超过x次,那么我应该收到一条警报,说pod崩溃循环在指定时间段内比平时增加了15%。这是普罗米修斯可以做到的吗?

0 个答案:

没有答案