kubernetes如何恢复卡在错误或终止中的Pod

时间:2018-10-23 17:06:31

标签: docker memory-management kubernetes out-of-memory cpu-usage

我有一个群集,最近节点上的可用内存下降到%5。发生这种情况时,节点CPU(负载)会在试图从缓存/缓冲区释放一些内存的时候达到峰值。高负载,低内存的结果之一是,有时我最终会遇到Pod,这些Pod进入错误状态或陷入终止状态。这些Pod会一直待在我手动干预之前,这会进一步加剧导致它的内存不足问题。

我的问题是Kubernetes为什么将这些Pod留在这种状态下?我的直觉是,kubernetes没有从Docker守护程序获得正确的反馈,并且再也不会尝试。我需要知道如何清理或修复Kubernetes错误和终止Pod。有什么想法吗?

我目前在:

~ # kubectl version
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

更新: 以下是窗格中列出的一些事件。您会看到其中一些坐了几天。您还将看到一个显示警告,而其他显示正常。

Events:
  Type     Reason         Age                  From                 Message
  ----     ------         ----                 ----                 -------
  Warning  FailedKillPod  25m                  kubelet, k8s-node-0  error killing pod: failed to "KillContainer" for "kubectl" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Normal   Killing        20m (x2482 over 3d)  kubelet, k8s-node-0  Killing container with id docker://docker:Need to kill Pod
  Normal   Killing        15m (x2484 over 3d)  kubelet, k8s-node-0  Killing container with id docker://maven:Need to kill Pod
  Normal   Killing        8m (x2487 over 3d)   kubelet, k8s-node-0  Killing container with id docker://node:Need to kill Pod
  Normal   Killing        4m (x2489 over 3d)   kubelet, k8s-node-0  Killing container with id docker://jnlp:Need to kill Pod

Events:
  Type    Reason   Age                 From                 Message
  ----    ------   ----                ----                 -------
  Normal  Killing  56m (x125 over 5h)  kubelet, k8s-node-2  Killing container with id docker://owasp-zap:Need to kill Pod
  Normal  Killing  47m (x129 over 5h)  kubelet, k8s-node-2  Killing container with id docker://jnlp:Need to kill Pod
  Normal  Killing  38m (x133 over 5h)  kubelet, k8s-node-2  Killing container with id docker://dind:Need to kill Pod
  Normal  Killing  13m (x144 over 5h)  kubelet, k8s-node-2  Killing container with id docker://maven:Need to kill Pod
  Normal  Killing  8m (x146 over 5h)   kubelet, k8s-node-2  Killing container with id docker://docker-cmds:Need to kill Pod
  Normal  Killing  1m (x149 over 5h)   kubelet, k8s-node-2  Killing container with id docker://pmd:Need to kill Pod

Events:
  Type    Reason   Age                  From                 Message
  ----    ------   ----                 ----                 -------
  Normal  Killing  56m (x2644 over 4d)  kubelet, k8s-node-0  Killing container with id docker://openssl:Need to kill Pod
  Normal  Killing  40m (x2651 over 4d)  kubelet, k8s-node-0  Killing container with id docker://owasp-zap:Need to kill Pod
  Normal  Killing  31m (x2655 over 4d)  kubelet, k8s-node-0  Killing container with id docker://pmd:Need to kill Pod
  Normal  Killing  26m (x2657 over 4d)  kubelet, k8s-node-0  Killing container with id docker://kubectl:Need to kill Pod
  Normal  Killing  22m (x2659 over 4d)  kubelet, k8s-node-0  Killing container with id docker://dind:Need to kill Pod
  Normal  Killing  11m (x2664 over 4d)  kubelet, k8s-node-0  Killing container with id docker://docker-cmds:Need to kill Pod
  Normal  Killing  6m (x2666 over 4d)   kubelet, k8s-node-0  Killing container with id docker://maven:Need to kill Pod
  Normal  Killing  1m (x2668 over 4d)   kubelet, k8s-node-0  Killing container with id docker://jnlp:Need to kill Pod

3 个答案:

答案 0 :(得分:1)

这通常与对象(吊舱,部署等)上的metadata.finalizers相关

您还可以详细了解Foreground Cascading Deleting及其如何使用metas.finalizers。

如果不是网络问题,则可以检查kubelet日志,通常:

journalctl -xeu kubelet 

您还可以检查docker守护进程日志,通常:

cat /var/log/syslog | grep dockerd

答案 1 :(得分:0)

我必须重新启动所有节点。我注意到一个小兵慢而无反应,很可能是罪魁祸首。重新启动后,所有终止Pod都消失了。

答案 2 :(得分:0)

通过运行kubectl patch,删除终结器是一种解决方法。这可能发生在不同类型的资源上,例如持久卷或部署。根据我的经验,PV / PVC更常见。

# for pods
$ kubectl patch pod pod-name-123abc -p '{"metadata":{"finalizers":null}}' -n your-app-namespace

# for pvc
$ kubectl patch pvc pvc-name-123abc -p '{"metadata":{"finalizers":null}}' -n your-app-namespace