我有一个节点目前正在给我带来一些麻烦。尚未找到解决方案,但这可能是技术水平问题,Google空洞了,或者我发现了一些无法解决的问题。后者的可能性很小。
kubectl version v1.8.5
docker version 1.12.6
在我的节点上进行一些常规维护时,我注意到以下内容:
NAME STATUS ROLES AGE VERSION
ip-192-168-4-14.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-143.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-174.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-182.ourdomain.pro Ready <none> 46d v1.8.5
ip-192-168-4-221.ourdomain.pro Ready <none> 213d v1.8.5
ip-192-168-4-249.ourdomain.pro Ready master 213d v1.8.5
ip-192-168-4-251.ourdomain.pro NotReady <none> 206d v1.8.5
在未就绪节点上,我无法附加或 exec 自己,而在未就绪状态下,这似乎很正常除非我误读了它,否则状态为“强”。出于相同原因,无法查看该节点上的任何特定日志。
这时,我重新启动了 kubelet 并同时将自己附加到日志中,以查看是否会出现异常情况。
我已经附上了谷歌搜索一整天的东西,但我无法确定是与问题实际联系在一起的。
错误1
unable to connect to Rkt api service
我们没有使用它,所以我将其放在忽略列表中。
错误2
unable to connect to CRI-O api service
我们没有使用它,所以我将其放在忽略列表中。
错误3
Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
我无法将其排除为潜在的陷阱,但是到目前为止,我发现的内容似乎与我正在运行的版本无关。
错误4
skipping pod synchronization - [container runtime is down PLEG is not healthy
除了上面的垃圾收集错误在此消息后第二次出现以外,我没有其他答案。
错误5
Registration of the rkt container factory failed
不使用它,除非我误会,否则它将失败。
错误6
Registration of the crio container factory failed
不使用它,因此除非我弄错了,否则它应该会失败。
错误7
28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container
为此找到了一张Github票,但似乎是固定的,因此不确定其关系。
错误8
28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}
这里节点进入NotReady。
最新的日志消息和状态
systemctl status kubelet
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
Docs: http://kubernetes.io/docs/
Main PID: 28087 (kubelet)
Tasks: 21
Memory: 42.3M
CGroup: /system.slice/kubelet.service
└─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504 28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530 28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs
这是 kubectl get po -o wide 输出。
NAME READY STATUS RESTARTS AGE IP NODE
docker-image-prune-fhjkl 1/1 Running 4 213d 100.96.67.87 ip-192-168-4-249
docker-image-prune-ltfpf 1/1 Running 4 213d 100.96.152.74 ip-192-168-4-143
docker-image-prune-nmg29 1/1 Running 3 213d 100.96.22.236 ip-192-168-4-221
docker-image-prune-pdw5h 1/1 Running 7 213d 100.96.90.116 ip-192-168-4-174
docker-image-prune-swbhc 1/1 Running 0 46d 100.96.191.129 ip-192-168-4-182
docker-image-prune-vtsr4 1/1 NodeLost 1 206d 100.96.182.197 ip-192-168-4-251
fluentd-es-4bgdz 1/1 Running 6 213d 192.168.4.249 ip-192-168-4-249
fluentd-es-fb4gw 1/1 Running 7 213d 192.168.4.14 ip-192-168-4-14
fluentd-es-fs8gp 1/1 Running 6 213d 192.168.4.143 ip-192-168-4-143
fluentd-es-k572w 1/1 Running 0 46d 192.168.4.182 ip-192-168-4-182
fluentd-es-lpxhn 1/1 Running 5 213d 192.168.4.174 ip-192-168-4-174
fluentd-es-pjp9w 1/1 Unknown 2 206d 192.168.4.251 ip-192-168-4-251
fluentd-es-wbwkp 1/1 Running 4 213d 192.168.4.221 ip-192-168-4-221
grafana-76c7dbb678-p8hzb 1/1 Running 3 213d 100.96.90.115 ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp 2/2 Running 2 101d 100.96.22.234 ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m 2/2 Running 2 101d 100.96.22.235 ip-192-168-4-221
prometheus-65b4b68d97-82vr7 1/1 Running 3 213d 100.96.90.87 ip-192-168-4-174
pushgateway-79f575d754-75l6r 1/1 Running 3 213d 100.96.90.83 ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb 2/2 Running 4 181d 100.96.90.117 ip-192-168-4-174
replicator-56x7v 1/1 Running 3 213d 100.96.90.84 ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv 1/1 Running 3 213d 100.96.90.85 ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk 1/1 Running 4 213d 100.96.152.73 ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n 1/1 Running 3 213d 100.96.22.232 ip-192-168-4-221
kubectl的输出得到-n kube-system -o宽
NAME READY STATUS RESTARTS AGE IP
calico-kube-controllers-78f554c7bb-s7tmj 1/1 Running 4 213d 192.168.4.14
calico-node-5cgc6 2/2 Running 9 213d 192.168.4.249
calico-node-bbwtm 2/2 Running 8 213d 192.168.4.14
calico-node-clwqk 2/2 NodeLost 4 206d 192.168.4.251
calico-node-d2zqz 2/2 Running 0 46d 192.168.4.182
calico-node-m4x2t 2/2 Running 6 213d 192.168.4.221
calico-node-m8xwk 2/2 Running 9 213d 192.168.4.143
calico-node-q7r7g 2/2 Running 8 213d 192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk 1/1 Running 10 207d 100.96.67.88
kube-apiserver-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-apiserver-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-apiserver-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-controller-manager-ip-192-168-4-14 1/1 Running 5 213d 192.168.4.14
kube-controller-manager-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-controller-manager-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kube-dns-545bc4bfd4-rt7qp 3/3 Running 13 213d 100.96.19.197
kube-proxy-2bn42 1/1 Running 0 46d 192.168.4.182
kube-proxy-95cvh 1/1 Running 4 213d 192.168.4.174
kube-proxy-bqrhw 1/1 NodeLost 2 206d 192.168.4.251
kube-proxy-cqh67 1/1 Running 6 213d 192.168.4.14
kube-proxy-fbdvx 1/1 Running 4 213d 192.168.4.221
kube-proxy-gcjxg 1/1 Running 5 213d 192.168.4.249
kube-proxy-mt62x 1/1 Running 4 213d 192.168.4.143
kube-scheduler-ip-192-168-4-14 1/1 Running 6 213d 192.168.4.14
kube-scheduler-ip-192-168-4-143 1/1 Running 6 213d 192.168.4.143
kube-scheduler-ip-192-168-4-249 1/1 Running 6 213d 192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2 1/1 Running 5 213d 100.96.22.230
tiller-deploy-6d9f596465-svpql 1/1 Running 3 213d 100.96.22.231
在从这里出发的这一点上,我有点迷茫。欢迎任何建议。
答案 0 :(得分:0)
很可能是kubelet必须降下。
共享下面命令的输出
journalctl -u kubelet
共享以下命令的输出
kubectl get po -n kube-system -owide
似乎节点无法与控制平面通信。 您可以按照以下步骤