对未就绪节点进行故障排除

时间:2019-05-06 06:37:08

标签: kubernetes

我有一个节点目前正在给我带来一些麻烦。尚未找到解决方案,但这可能是技术水平问题,Google空洞了,或者我发现了一些无法解决的问题。后者的可能性很小。

kubectl version v1.8.5
docker version 1.12.6

在我的节点上进行一些常规维护时,我注意到以下内容:

NAME                            STATUS   ROLES     AGE       VERSION
ip-192-168-4-14.ourdomain.pro   Ready    master    213d      v1.8.5
ip-192-168-4-143.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-174.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-182.ourdomain.pro  Ready    <none>    46d       v1.8.5
ip-192-168-4-221.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-249.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-251.ourdomain.pro  NotReady <none>    206d      v1.8.5

未就绪节点上,我无法附加 exec 自己,而在未就绪状态下,这似乎很正常

这时,我重新启动了 kubelet 并同时将自己附加到日志中,以查看是否会出现异常情况。

我已经附上了谷歌搜索一整天的东西,但我无法确定是与问题实际联系在一起的。

错误1

unable to connect to Rkt api service

我们没有使用它,所以我将其放在忽略列表中。

错误2

unable to connect to CRI-O api service

我们没有使用它,所以我将其放在忽略列表中。

错误3

Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /

我无法将其排除为潜在的陷阱,但是到目前为止,我发现的内容似乎与我正在运行的版本无关。

错误4

skipping pod synchronization - [container runtime is down PLEG is not healthy

除了上面的垃圾收集错误在此消息后第二次出现以外,我没有其他答案。

错误5

Registration of the rkt container factory failed

不使用它,除非我误会,否则它将失败。

错误6

Registration of the crio container factory failed

不使用它,因此除非我弄错了,否则它应该会失败。

错误7

28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container

为此找到了一张Github票,但似乎是固定的,因此不确定其关系。

错误8

28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}

这里节点进入NotReady。

最新的日志消息和状态

    systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
     Docs: http://kubernetes.io/docs/
 Main PID: 28087 (kubelet)
    Tasks: 21
   Memory: 42.3M
   CGroup: /system.slice/kubelet.service
           └─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530   28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs

这是 kubectl get po -o wide 输出。

NAME                                              READY     STATUS     RESTARTS   AGE       IP               NODE
docker-image-prune-fhjkl                          1/1       Running    4          213d      100.96.67.87     ip-192-168-4-249
docker-image-prune-ltfpf                          1/1       Running    4          213d      100.96.152.74    ip-192-168-4-143
docker-image-prune-nmg29                          1/1       Running    3          213d      100.96.22.236    ip-192-168-4-221
docker-image-prune-pdw5h                          1/1       Running    7          213d      100.96.90.116    ip-192-168-4-174
docker-image-prune-swbhc                          1/1       Running    0          46d       100.96.191.129   ip-192-168-4-182
docker-image-prune-vtsr4                          1/1       NodeLost   1          206d      100.96.182.197   ip-192-168-4-251
fluentd-es-4bgdz                                  1/1       Running    6          213d      192.168.4.249    ip-192-168-4-249
fluentd-es-fb4gw                                  1/1       Running    7          213d      192.168.4.14     ip-192-168-4-14
fluentd-es-fs8gp                                  1/1       Running    6          213d      192.168.4.143    ip-192-168-4-143
fluentd-es-k572w                                  1/1       Running    0          46d       192.168.4.182    ip-192-168-4-182
fluentd-es-lpxhn                                  1/1       Running    5          213d      192.168.4.174    ip-192-168-4-174
fluentd-es-pjp9w                                  1/1       Unknown    2          206d      192.168.4.251    ip-192-168-4-251
fluentd-es-wbwkp                                  1/1       Running    4          213d      192.168.4.221    ip-192-168-4-221
grafana-76c7dbb678-p8hzb                          1/1       Running    3          213d      100.96.90.115    ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp   2/2       Running    2          101d      100.96.22.234    ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m   2/2       Running    2          101d      100.96.22.235    ip-192-168-4-221
prometheus-65b4b68d97-82vr7                       1/1       Running    3          213d      100.96.90.87     ip-192-168-4-174
pushgateway-79f575d754-75l6r                      1/1       Running    3          213d      100.96.90.83     ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb                 2/2       Running    4          181d      100.96.90.117    ip-192-168-4-174
replicator-56x7v                                  1/1       Running    3          213d      100.96.90.84     ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv                  1/1       Running    3          213d      100.96.90.85     ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk                  1/1       Running    4          213d      100.96.152.73    ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n                  1/1       Running    3          213d      100.96.22.232    ip-192-168-4-221

kubectl的输出得到-n kube-system -o宽

NAME                                       READY     STATUS     RESTARTS   AGE       IP          
calico-kube-controllers-78f554c7bb-s7tmj   1/1       Running    4          213d      192.168.4.14
calico-node-5cgc6                          2/2       Running    9          213d      192.168.4.249
calico-node-bbwtm                          2/2       Running    8          213d      192.168.4.14
calico-node-clwqk                          2/2       NodeLost   4          206d      192.168.4.251
calico-node-d2zqz                          2/2       Running    0          46d       192.168.4.182
calico-node-m4x2t                          2/2       Running    6          213d      192.168.4.221
calico-node-m8xwk                          2/2       Running    9          213d      192.168.4.143
calico-node-q7r7g                          2/2       Running    8          213d      192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk        1/1       Running    10         207d      100.96.67.88
kube-apiserver-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-apiserver-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-apiserver-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kube-controller-manager-ip-192-168-4-14    1/1       Running    5          213d      192.168.4.14
kube-controller-manager-ip-192-168-4-143   1/1       Running    6          213d      192.168.4.143
kube-controller-manager-ip-192-168-4-249   1/1       Running    6          213d      192.168.4.249
kube-dns-545bc4bfd4-rt7qp                  3/3       Running    13         213d      100.96.19.197
kube-proxy-2bn42                           1/1       Running    0          46d       192.168.4.182
kube-proxy-95cvh                           1/1       Running    4          213d      192.168.4.174
kube-proxy-bqrhw                           1/1       NodeLost   2          206d      192.168.4.251
kube-proxy-cqh67                           1/1       Running    6          213d      192.168.4.14
kube-proxy-fbdvx                           1/1       Running    4          213d      192.168.4.221
kube-proxy-gcjxg                           1/1       Running    5          213d      192.168.4.249
kube-proxy-mt62x                           1/1       Running    4          213d      192.168.4.143
kube-scheduler-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-scheduler-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-scheduler-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2      1/1       Running    5          213d      100.96.22.230
tiller-deploy-6d9f596465-svpql             1/1       Running    3          213d      100.96.22.231

在从这里出发的这一点上,我有点迷茫。欢迎任何建议。

1 个答案:

答案 0 :(得分:0)

很可能是kubelet必须降下。

共享下面命令的输出

journalctl -u kubelet

共享以下命令的输出

kubectl get po -n kube-system -owide

似乎节点无法与控制平面通信。 您可以按照以下步骤

  1. 将节点从群集中分离(封锁节点,排空节点,最后删除节点)
  2. 重置节点
  3. 重新加入节点以进行群集