我有一个Kubernetes 1.7.5群集,它已经以某种方式进入半破碎状态。在此群集上安排新部署部分失败:1/2 pod正常启动,但第二个pod未启动。事件是:
default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Pulled kubelet, k8s-agentpool1-18117938-2 Successfully pulled image "myregistry.azurecr.io/mybiz/hello"
default 2017-09-28 03:57:02 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Created kubelet, k8s-agentpool1-18117938-2 Created container
default 2017-09-28 03:57:03 -0400 EDT 2017-09-28 03:57:03 -0400 EDT 1 hello-4059723819-8s35v Pod spec.containers{hello} Normal Started kubelet, k8s-agentpool1-18117938-2 Started container
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 2 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:13 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 2 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
default 2017-09-28 03:57:24 -0400 EDT 2017-09-28 03:57:01 -0400 EDT 3 hello-4059723819-tj043 Pod Warning FailedSync kubelet, k8s-agentpool1-18117938-3 Error syncing pod
default 2017-09-28 03:57:25 -0400 EDT 2017-09-28 03:57:02 -0400 EDT 3 hello-4059723819-tj043 Pod Normal SandboxChanged kubelet, k8s-agentpool1-18117938-3 Pod sandbox changed, it will be killed and re-created.
[...]
最后两条日志消息不断重复。
失败的广告连接的信息中心显示:
最后,仪表板显示错误:
Error: failed to start container "hello": Error response from daemon: {"message":"cannot join network of a non running container: 7e95918c6b546714ae20f12349efcc6b4b5b9c1e84b5505cf907807efd57525c"}
此群集使用CNI Azure网络插件在Azure上运行。在我启用--runtime-config=batch/v2alpha1=true
以使用CronJob
功能后,一切正常工作。现在,即使在删除该API级别并重新启动主服务器之后,问题仍然存在。
节点上的kubelet日志显示无法分配IP地址:
E0928 20:54:01.733682 1750 pod_workers.go:182] Error syncing pod 65127a94-a425-11e7-8d64-000d3af4357e ("hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)"), skipping: failed to "CreatePodSandbox" for "hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)" with CreatePodSandboxError: "CreatePodSandbox for pod \"hello-4059723819-xx16n_default(65127a94-a425-11e7-8d64-000d3af4357e)\" failed: rpc error: code = 2 desc = NetworkPlugin cni failed to set up pod \"hello-4059723819-xx16n_default\" network: Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses"
答案 0 :(得分:1)
这是Azure CNI的错误,并不总是正确地从已终止的pod中回收IP地址。请参阅此问题:https://github.com/Azure/azure-container-networking/issues/76。
启用CronJob
功能后发生这种情况的原因是cronjob容器(通常)是短暂的,每次运行时都会分配一个IP。如果这些IP不被基础网络系统回收和重复使用 - 在这种情况下是CNI--它们很快就会耗尽。