普罗米修斯(Prometheus)中有警报,但没有发生松弛触发。 Alertmanager说没有警报。我要附加Alertmanager和Prometheus规则的配置文件。
需要一些即时帮助,因为这是与生产相关的问题。 prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
creationTimestamp: null
name: prometheus-rules-conf
namespace: monitoring
data:
kubernetes_alerts.yml: |
groups:
- name: kubernetes_alerts
rules:
- alert: DeploymentGenerationOff
expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
for: 5m
labels:
severity: warning
annotations:
description: Deployment generation does not match expected generation {{ $labels.namespace }}/{{ $labels.deployment }}
summary: Deployment is outdated
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 5m
labels:
severity: warning
annotations:
description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
summary: Deployment replicas are outdated
- alert: PodzFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} was restarted {{ $value }} times within the last hour
summary: Pod is restarting frequently
- alert: KubeNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 1h
labels:
severity: warning
annotations:
description: The Kubelet on {{ $labels.node }} has not checked in with the API,
or has set itself to NotReady, for more than an hour
summary: Node status is NotReady
- alert: KubeManyNodezNotReady
expr: count(kube_node_status_condition{condition="Ready",status="true"} == 0)
> 1 and (count(kube_node_status_condition{condition="Ready",status="true"} ==
0) / count(kube_node_status_condition{condition="Ready",status="true"})) > 0.2
for: 1m
labels:
severity: critical
annotations:
description: '{{ $value }}% of Kubernetes nodes are not ready'
- alert: APIHighLatency
expr: apiserver_latency_seconds:quantile{quantile="0.99",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} > 4
for: 10m
labels:
severity: critical
annotations:
description: the API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}
- alert: APIServerErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
for: 10m
labels:
severity: critical
annotations:
description: API server returns errors for {{ $value }}% of requests
- alert: KubernetesAPIServerDown
expr: up{job="kubernetes-apiservers"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Apiserver {{ $labels.instance }} is down!
- alert: KubernetesAPIServersGone
expr: absent(up{job="kubernetes-apiservers"})
for: 10m
labels:
severity: critical
annotations:
summary: No Kubernetes apiservers are reporting!
description: Werner Heisenberg says - OMG Where are my apiserverz?
prometheus_alerts.yml: |
groups:
- name: prometheus_alerts
rules:
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 10m
labels:
severity: warning
annotations:
description: Reloading Prometheus configuration has failed on {{$labels.instance}}.
- alert: PrometheusNotConnectedToAlertmanagers
expr: prometheus_notifications_alertmanagers_discovered < 1
for: 1m
labels:
severity: warning
annotations:
description: Prometheus {{ $labels.instance}} is not connected to any Alertmanagers
node_alerts.yml: |
groups:
- name: node_alerts
rules:
- alert: HighNodeCPU
expr: instance:node_cpu:avg_rate5m > 80
for: 10s
labels:
severity: warning
annotations:
summary: High Node CPU of {{ humanize $value}}% for 1 hour
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[1h], 4*3600) < 0
for: 5m
labels:
severity: critical
annotations:
summary: Disk on {{ $labels.instance }} will fill in approximately 4 hours.
- alert: KubernetesServiceDown
expr: up{job="kubernetes-service-endpoints"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: Pod {{ $labels.instance }} is down!
- alert: KubernetesServicesGone
expr: absent(up{job="kubernetes-service-endpoints"})
for: 10m
labels:
severity: critical
annotations:
summary: No Kubernetes services are reporting!
description: Werner Heisenberg says - OMG Where are my servicez?
- alert: CriticalServiceDown
expr: node_systemd_unit_state{state="active"} != 1
for: 2m
labels:
severity: critical
annotations:
summary: Service {{ $labels.name }} failed to start.
description: Service {{ $labels.instance }} failed to (re)start service {{ $labels.name }}.
proxy_alert.yml: |
groups:
- name: proxy_alert
rules:
- alert: Proxy_Down
expr: probe_success{instance="http://ip",job="blackbox"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: Proxy Server {{ $labels.instance }} is down!
kubernetes_rules.yml: |
groups:
- name: kubernetes_rules
rules:
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.99, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
labels:
quantile: "0.99"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.9, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
labels:
quantile: "0.9"
- record: apiserver_latency_seconds:quantile
expr: histogram_quantile(0.5, rate(apiserver_request_latencies_bucket[5m])) / 1e+06
labels:
quantile: "0.5"
prometheus-configmap yaml
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
regex: alertmanager
action: keep
- source_labels: [__meta_kubernetes_namespace]
regex: monitoring
action: keep
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: 9093
rule_files:
- "/var/prometheus/rules/*_rules.yml"
- "/var/prometheus/rules/*_alerts.yml"
即使我在普罗米修斯获得了终点,警报仍然没有被触发。
答案 0 :(得分:2)
问题:您已对Prometheus设置了警报,但它不会触发事件。我收集了一些经验法则来验证您的警报是否已正确设置,更新并可以像魅力一样在Prometheus仪表板上工作:
prometheus.yml
键上的rule_files
并且在正确的内部路径中。(为此使用)
docker restart <alert-manager-service-name>
curl -X POST localhost:9090/-/reload
localhost:9090/alerts
上找到新警报,并验证条件语句是否有效并应触发一个事件(事件->路线->接收者)。