当CPU使用率较高时,Jenkins会使Kubernetes节点卡住

时间:2019-04-05 00:15:11

标签: jenkins kubernetes kops

我注意到启动某些Jenkins构建时,有时托管Jenkins的节点会永远卡住。这意味着整个节点均无法访问,并且其所有Pod均处于关闭状态(仪表板中尚未就绪)。

要重新整理,我需要将其从集群中删除并再次添加(我在GCE上,因此需要将其从实例组中删除才能删除)。

注意:在数小时之内,我无法通过SSH连接到该节点,显然该服务已停用^^

据我了解,达到内存最高值会使一个节点崩溃,但是达到最高CPU使用率只会减慢服务器的速度,并且不会像我所经历的那样产生太大的影响。在最坏的情况下,除非CPU变得更好,否则Kubelet应该不可用。

有人可以帮助我确定此问题的由来吗?什么原因会导致这种问题?

Node metrics 1

Node metrics 2

Jenkins slave metrics

Node metrics from GCE

另一方面,等待了几个小时,我已经能够通过SSH访问该节点,并且我运行sudo journalctl -u kubelet来查看发生了什么。我没有在晚上7点看到任何具体的信息,但是我能够看到类似的反复错误:

Apr 04 19:00:58 nodes-s2-2g5v systemd[43508]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:00:58 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 19:01:00 nodes-s2-2g5v systemd[43511]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 19:01:00 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 19:01:02 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.

我转到较旧的日志,并在下午5:30发现此类消息的开始:

Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:05.168402    1841 prober.go:111] Readiness probe for "...
Apr 04 17:26:50 nodes-s2-2g5v kubelet[1841]: I0404 17:25:04.021125    1841 prober.go:111] Readiness probe for "...
-- Reboot --
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1699]: kubelet.service: Failed at step EXEC spawning /home/kubernetes/bin/kubelet: Permission denied
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Unit entered failed state.
Apr 04 17:31:31 nodes-s2-2g5v systemd[1]: kubelet.service: Failed with result 'exit-code'.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Stopped Kubernetes Kubelet Server.
Apr 04 17:31:33 nodes-s2-2g5v systemd[1]: Started Kubernetes Kubelet Server.

这时,节点kubelet重新启动,它对应于Jenkins构建。存在相同的模式,CPU使用率很高。我不知道为什么它早些时候才重新启动,而在晚上7点左右节点才卡住:/

非常抱歉,这是很多信息,但我完全迷失了,这不是我第一次遇到这种情况^^

谢谢

1 个答案:

答案 0 :(得分:0)

如@Brandon所述,这与应用于我的Jenkins奴隶的资源限制有关。

就我而言,即使在我的Helm图表YAML文件中进行了精确化,也不会设置这些值。我必须深入UI才能手动设置它们。

通过此修改,现在一切都稳定了! :)