如何处理 eks 集群中的“节点有污点”错误?

时间:2021-03-04 07:54:50

标签: kubernetes amazon-eks

我在 AWS EKS 上部署了 K8S 集群,但是当我将 Pod 部署到集群时,Pod 的状态为 pending。在描述吊舱时,我看到以下消息。我该如何解决这个问题?我曾尝试删除并重新部署 pod,但仍然出现相同的错误。

$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
sidecar-app-59dd47fbdf-pjrfq   0/1     Pending   0          62s
Joey-Zeller:k8s joey$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
sidecar-app-59dd47fbdf-pjrfq   0/1     Pending   0          2m26s
Joey-Zeller:k8s joey$ kubectl describe pod sidecar-app-59dd47fbdf-pjrfq
Name:           sidecar-app-59dd47fbdf-pjrfq
Namespace:      default
Priority:       0
Node:           <none>
Labels:         name=sidecar-app
                pod-template-hash=59dd47fbdf
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/sidecar-app-59dd47fbdf
Containers:
  nginx:
    Image:        nginx:latest
    Port:         8080/TCP
    Host Port:    0/TCP
    Environment:  <none>
    Mounts:
      /etc/nginx/nginx.conf from nginx-conf (ro,path="nginx.conf")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-4dxhl (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  nginx-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nginx-conf
    Optional:  false
  default-token-4dxhl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-4dxhl
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  75s (x3 over 2m31s)  default-scheduler  0/2 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

我在描述节点时看到以下错误:

$ kubectl describe node ip-192-168-44-226.ap-southeast-2.compute.internal
Name:               ip-192-168-44-226.ap-southeast-2.compute.internal
Roles:              <none>
Labels:             alpha.eksctl.io/cluster-name=elk
                    alpha.eksctl.io/instance-id=i-00dcf85feec486f1e
                    alpha.eksctl.io/nodegroup-name=ng-32b00a62
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.medium
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=ap-southeast-2
                    failure-domain.beta.kubernetes.io/zone=ap-southeast-2b
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-192-168-44-226.ap-southeast-2.compute.internal
                    kubernetes.io/os=linux
                    node-lifecycle=on-demand
                    node.kubernetes.io/instance-type=t3.medium
                    topology.kubernetes.io/region=ap-southeast-2
                    topology.kubernetes.io/zone=ap-southeast-2b
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 19 Feb 2021 09:41:13 +1100
Taints:             node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-192-168-44-226.ap-southeast-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Mar 2021 18:51:15 +1100
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   192.168.44.226
  ExternalIP:   3.26.77.215
  Hostname:     ip-192-168-44-226.ap-southeast-2.compute.internal
  InternalDNS:  ip-192-168-44-226.ap-southeast-2.compute.internal
  ExternalDNS:  ec2-3-26-77-215.ap-southeast-2.compute.amazonaws.com
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3977864Ki
  pods:                        17
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           76224326324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3422856Ki
  pods:                        17
System Info:
  Machine ID:                 ec27cd0668c882d838f572a1981b762f
  System UUID:                EC27CD06-68C8-82D8-38F5-72A1981B762F
  Boot ID:                    6d26c69a-69ee-4a64-9cd4-48a289ec7d62
  Kernel Version:             4.14.214-160.339.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.6
  Kubelet Version:            v1.18.9-eks-d1db3c
  Kube-Proxy Version:         v1.18.9-eks-d1db3c
ProviderID:                   aws:///ap-southeast-2b/i-00dcf85feec486f1e
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                             ------------  ----------  ---------------  -------------  ---
  cert-manager                cert-manager-649c5f88bc-mfx67                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  cert-manager                cert-manager-cainjector-9747d56-9xlvv            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  cert-manager                cert-manager-webhook-849c7b574f-kg6hr            0 (0%)        0 (0%)      0 (0%)           0 (0%)         29h
  kube-system                 aws-load-balancer-controller-64dbfb945b-pjg88    100m (5%)     200m (10%)  200Mi (5%)       500Mi (14%)    44h
  kube-system                 aws-load-balancer-controller-64dbfb945b-q9hsh    100m (5%)     200m (10%)  200Mi (5%)       500Mi (14%)    29h
  kube-system                 aws-node-mctgj                                   10m (0%)      0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 coredns-67997b9dbd-4vrxq                         100m (5%)     0 (0%)      70Mi (2%)        170Mi (5%)     29h
  kube-system                 coredns-67997b9dbd-7zgn9                         100m (5%)     0 (0%)      70Mi (2%)        170Mi (5%)     29h
  kube-system                 kube-proxy-rbjrx                                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         13d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         510m (26%)   400m (20%)
  memory                      540Mi (16%)  1340Mi (40%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:                       <none>

它说 overcommitted. 错误。然后我可以看到在 kube-systemcert-manager 中创建了许多 Pod:

$ kubectl get pods --all-namespaces
NAMESPACE      NAME                                            READY   STATUS        RESTARTS   AGE
cert-manager   cert-manager-649c5f88bc-b25gt                   1/1     Terminating   9          34h
cert-manager   cert-manager-649c5f88bc-mfx67                   1/1     Running       0          29h
cert-manager   cert-manager-cainjector-9747d56-9xlvv           1/1     Running       0          29h
cert-manager   cert-manager-cainjector-9747d56-p7pxl           1/1     Terminating   17         34h
cert-manager   cert-manager-webhook-849c7b574f-kg6hr           1/1     Running       9          29h
cert-manager   cert-manager-webhook-849c7b574f-nhjxd           1/1     Terminating   12         34h
default        sidecar-app-59dd47fbdf-pjrfq                    0/1     Pending       0          13m
kube-system    aws-load-balancer-controller-64dbfb945b-ccd5d   1/1     Terminating   13         34h
kube-system    aws-load-balancer-controller-64dbfb945b-pjg88   0/1     Terminating   5          44h
kube-system    aws-load-balancer-controller-64dbfb945b-q9hsh   1/1     Running       2          29h
kube-system    aws-load-balancer-controller-64dbfb945b-ww65p   1/1     Terminating   1          7d
kube-system    aws-node-mctgj                                  1/1     Running       0          13d
kube-system    aws-node-prcps                                  1/1     Running       0          13d
kube-system    coredns-67997b9dbd-4vrxq                        1/1     Running       1          29h
kube-system    coredns-67997b9dbd-7zgn9                        1/1     Running       1          29h
kube-system    coredns-67997b9dbd-gjfqc                        1/1     Terminating   1          34h
kube-system    coredns-67997b9dbd-q9t7l                        1/1     Terminating   1          34h
kube-system    kube-proxy-l9mrq                                1/1     Running       0          13d
kube-system    kube-proxy-rbjrx                                1/1     Running       0          13d

它们是我通过 eksctl create cluster 命令部署 EKS 集群时创建的,cert-managercert-manager 是由它们创建的。我不知道什么有用,什么没用。我应该删除所有这些吗?或者删除后如何重新创建它们?

1 个答案:

答案 0 :(得分:0)

将此作为社区答案发布,请编辑此帖子并分享您对此问题的发现:

查看 pod 描述:

Warning  FailedScheduling  75s (x3 over 2m31s)  default-scheduler  0/2 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.

和节点描述:

Taints:             node.kubernetes.io/unreachable:NoSchedule
.
.
.

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 03 Mar 2021 18:50:05 +1100   Wed, 03 Mar 2021 18:52:04 +1100   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:

看起来节点或网络连接有问题。 Kubelet 停止发布他的状态并且该节点已被标记为污点:

node.kubernetes.io/unreachable:NoSchedule

请查看您的控制台,检查您的 kubelet 状态节点状态以了解更多详细信息。 使用:

kubectl get pods -o wide
kubectl get nodes -o wide
sudo systemctl status kubelet
sudo journalctl -u kubelet

注意

<块引用>

如果 Ready 条件的 Status 保持 Unknown 或 False 的时间长于 pod-eviction-timeout(传递给 kube-controller-manager 的参数),则节点控制器将安排节点上的所有 Pod 删除.默认驱逐超时持续时间为五分钟。在某些节点不可达的情况下,API 服务器无法与节点上的 kubelet 通信。在重新建立与 API 服务器的通信之前,无法将删除 Pod 的决定传达给 kubelet。同时,计划删除的 Pod 可能会继续在分区节点上运行。

附加信息: