我刚刚在GKE上创建了具有2个n1-standard-2节点的集群,并使用官方的helm安装了prometheusOperator。
Prometheus似乎工作正常,但我收到这样的警报:
message: 33% throttling of CPU in namespace kube-system for container metrics-server in pod metrics-server-v0.3.1-8d4c5db46-zddql.
22 minutes agocontainer: metrics-serverpod: metrics-server-v0.3.1-8d4c5db46-zddql
message: 35% throttling of CPU in namespace kube-system for container heapster-nanny in pod heapster-v1.6.1-554bfbc7d-tg6fm.
an hour agocontainer: heapster-nannypod: heapster-v1.6.1-554bfbc7d-tg6fm
message: 77% throttling of CPU in namespace kube-system for container prometheus-to-sd in pod prometheus-to-sd-789b2.
20 hours agocontainer: prometheus-to-sdpod: prometheus-to-sd-789b2
message: 45% throttling of CPU in namespace kube-system for container heapster in pod heapster-v1.6.1-554bfbc7d-tg6fm.
20 hours agocontainer: heapsterpod: heapster-v1.6.1-554bfbc7d-tg6fm
message: 38% throttling of CPU in namespace kube-system for container default-http-backend in pod l7-default-backend-8f479dd9-9n77b.
所有这些Pod都是默认GKE安装的一部分,我尚未对其进行任何修改。我相信它们是我尚未真正尝试使用的某些Google云工具的一部分。
我的节点并不是真的很重:
kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
gke-psi-cluster-01-pool-1-d5650403-cl4g 230m 11% 2973Mi 52%
gke-psi-cluster-01-pool-1-d5650403-xn35 146m 7% 2345Mi 41%
这是我的普罗米修斯头盔配置:
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
config:
global:
resolve_timeout: 5m
receivers:
- name: "null"
- name: slack_k8s
slack_configs:
- api_url: REDACTED
channel: '#k8s'
send_resolved: true
text: |-
{{ range .Alerts }}
{{- if .Annotations.summary }}
*{{ .Annotations.summary }}*
{{- end }}
*Severity* : {{ .Labels.severity }}
{{- if .Labels.namespace }}
*Namespace* : {{ .Labels.namespace }}
{{- end }}
{{- if .Annotations.description }}
{{ .Annotations.description }}
{{- end }}
{{- if .Annotations.message }}
{{ .Annotations.message }}
{{- end }}
{{ end }}
title: '{{ (index .Alerts 0).Labels.alertname }}'
title_link: https://karma.REDACTED?q=alertname%3D{{ (index .Alerts 0).Labels.alertname
}}
route:
group_by:
- alertname
- job
group_interval: 5m
group_wait: 30s
receiver: slack_k8s
repeat_interval: 6h
routes:
- match:
alertname: Watchdog
receiver: "null"
- match:
alertname: KubeAPILatencyHigh
receiver: "null"
ingress:
enabled: false
hosts:
- alertmanager.REDACTED
coreDns:
enabled: false
grafana:
adminPassword: REDACTED
ingress:
annotations:
kubernetes.io/tls-acme: "true"
enabled: true
hosts:
- grafana.REDACTED
tls:
- hosts:
- grafana.REDACTED
secretName: grafana-crt-secret
persistence:
enabled: true
size: 5Gi
kubeControllerManager:
enabled: true
kubeDns:
enabled: true
kubeScheduler:
enabled: true
nodeExporter:
enabled: true
prometheus:
ingress:
enabled: false
hosts:
- prometheus.REDACTED
prometheusSpec:
additionalScrapeConfigs:
- basic_auth:
password: REDACTED
username: prometheus
retention: 30d
ruleSelectorNilUsesHelmValues: false
serviceMonitorSelectorNilUsesHelmValues: false
storageSpec:
volumeClaimTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 30Gi
prometheusOperator:
createCustomResource: false
我发现了这个git问题https://github.com/kubernetes-monitoring/kubernetes-mixin/issues/108 但我不确定这是否适用于我的情况,因为这是默认的GKE吊舱。 我想确保一切运行顺利,即使我还没有真正了解如何使用它,Stackdriver都能正确检索我的所有日志。
我应该修改kube系统中GKE默认部署的限制吗?在GKE上部署prometheusOperator有什么问题吗?
答案 0 :(得分:0)
浏览了许多链接之后,我认为我在这里理解了这个问题。
我认为这是您遇到的k8s问题。 [1]
Linux中的CFS配额似乎影响到包括Kubernetes在内的所有容器化云,您可以通过为集群添加更高的CPU限制或从容器中删除CPU限制来解决此问题。请在暂存环境中而不是在生产中直接进行测试。
好运!