我正在AWS上运行一个kube-upped Kubernetes 1.2.3集群,位于两个m4.large节点上,我正在使用自动安装的Influx-grafana pod进行集群监控。
我的问题是,在一两个星期后,流入容器会死亡,不会再出现。我有点不确定要检查相关错误消息的日志,但运行容器的minion上的syslog包含以下信息:
Jun 16 05:57:41 ip-172-22-29-244 kubelet[4434]: E0616 05:57:41.382751 4434 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade", GenerateName:"", Namespace:"kube-system", SelfLink:"", UID:"", ResourceVersion:"407635", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"monitoring-influxdb-grafana-v3-dlx9o", UID:"07c2a623-2b57-11e6-b7a9-068c6a09a769", APIVersion:"v1", ResourceVersion:"850776", FieldPath:""}, Reason:"FailedSync", Message:"Error syncing pod, skipping: failed to \"StartContainer\" for \"influxdb\" with CrashLoopBackOff: \"Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)\"\n", Source:api.EventSource{Component:"kubelet", Host:"ip-172-22-29-244.eu-west-1.compute.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63600960004, nsec:0, loc:(*time.Location)(0x2e38da0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63601653461, nsec:379098581, loc:(*time.Location)(0x2e38da0)}}, Count:11023, Type:"Warning"}': 'events "monitoring-influxdb-grafana-v3-dlx9o.145604121bcf8ade" not found' (will not retry!)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: I0616 05:57:54.378491 4434 manager.go:2050] Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)
Jun 16 05:57:54 ip-172-22-29-244 kubelet[4434]: E0616 05:57:54.378545 4434 pod_workers.go:138] Error syncing pod 07c2a623-2b57-11e6-b7a9-068c6a09a769, skipping: failed to "StartContainer" for "influxdb" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=influxdb pod=monitoring-influxdb-grafana-v3-dlx9o_kube-system(07c2a623-2b57-11e6-b7a9-068c6a09a769)"
我也看到有迹象表明该容器最初被OOM杀死了。 我的假设是潮流指数随着时间的推移而变得太大,因为没有自动清理,一旦来自清单的500MB内存限制被破坏而被Kubernetes杀死,并且由于相同的原因或因为它超时而未能重启阅读索引。
一旦发生这种情况,我能够完全恢复并再次运行的唯一方法是完全杀死pod,让Kubernetes从头开始重新创建它,这基本上意味着丢失所有现有数据。
但我该怎么办呢?改变kube-system pod的内存限制似乎并非易事,而且反正可能只会再花几天时间。 我可以建立自己的看门狗来清理数据,但只能保持1-2周的监控数据限制其价值。