我正在尝试监视整个集群中过多的Pod抢占/重新安排。现在,我们已经建立了一个警报系统,它可以监视pod崩溃循环,如下所示。
- name: KubePodCrashLooping
message: '{{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }})
is restarting {{ printf "%.2f" $value }} / second'
severity: medium
impact: This pod is crashing continuously
action: Look at the pods logs to find why it is crashing consistently, if that doesn't show any service level issues check Kubelet's logs
expression: >
sum(rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m])) by (namespace, pod)
* on (namespace, pod) group_left(label_name)
kube_pod_labels{label_monitoring="roger"} > 0
for: 1h
但是最近发现,如果我们可以以稍微不同的方式对其进行监视,它将对我们更有用。我想指定一个值,例如x,如果pod崩溃循环超过x次,那么我应该收到一条警报,说pod崩溃循环在指定时间段内比平时增加了15%。这是普罗米修斯可以做到的吗?