我在GKE上有一个大小为1的可抢占节点池。我已经运行了这个大小为1的节点池,差不多一个月了。节点每天24小时后重新启动并重新加入群集。今天它重新启动但没有重新加入集群。
相反,我注意到根据gcloud compute instances list
,基础实例正在运行但未包含在kubectl get node
的输出中。我将节点池大小增加到2,然后启动了第二个实例。该节点立即加入了我的GKE集群,并且已经安排了pod。第一个节点仍在根据gcloud
运行,但它不会加入群集。
发生了什么?我该如何调试这个问题?
更新:
我通过SSH连接到实例,并立即收到了这条出色的错误消息:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
sudo systemctl status kube-node-installation
的结果:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
sudo systemctl status kube-node-configuration
的结果:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
看起来kube-node-configuration
失败。我跑了sudo systemctl restart kube-node-configuration
,现在状态输出是:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...并且节点加入了集群:)。但是,我原来的问题是:发生了什么?
答案 0 :(得分:1)
我们在具有抢占式节点的 GKE 上遇到了类似的问题,看到来自节点的错误消息:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
在与 Google 支持人员交流了大约一个月后,我们了解到节点正在被抢占和替换,并且进来的新节点使用相同的名称,并且这一切都没有发生正常的 pod 中断被驱逐的节点。
背景故事:我们遇到了这个问题,因为 Jenkins 正在节点上运行它的工作程序,在大约 2 分钟的节点“重启”和返回期间,Jenkins 主节点会断开连接并导致工作失败。
tldr;不要将抢占式节点用于此类工作。