我在 AWS EKS 上部署了 K8S 集群,但是当我将 Pod 部署到集群时,Pod 的状态为 pending
。在描述吊舱时,我看到以下消息。我该如何解决这个问题?我曾尝试删除并重新部署 pod,但仍然出现相同的错误。
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
sidecar-app-59dd47fbdf-pjrfq 0/1 Pending 0 62s
Joey-Zeller:k8s joey$ kubectl get pods
NAME READY STATUS RESTARTS AGE
sidecar-app-59dd47fbdf-pjrfq 0/1 Pending 0 2m26s
Joey-Zeller:k8s joey$ kubectl describe pod sidecar-app-59dd47fbdf-pjrfq
Name: sidecar-app-59dd47fbdf-pjrfq
Namespace: default
Priority: 0
Node: <none>
Labels: name=sidecar-app
pod-template-hash=59dd47fbdf
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/sidecar-app-59dd47fbdf
Containers:
nginx:
Image: nginx:latest
Port: 8080/TCP
Host Port: 0/TCP
Environment: <none>
Mounts:
/etc/nginx/nginx.conf from nginx-conf (ro,path="nginx.conf")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-4dxhl (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
nginx-conf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nginx-conf
Optional: false
default-token-4dxhl:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4dxhl
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 75s (x3 over 2m31s) default-scheduler 0/2 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
我在描述节点时看到以下错误:
$ kubectl describe node ip-192-168-44-226.ap-southeast-2.compute.internal
Name: ip-192-168-44-226.ap-southeast-2.compute.internal
Roles: <none>
Labels: alpha.eksctl.io/cluster-name=elk
alpha.eksctl.io/instance-id=i-00dcf85feec486f1e
alpha.eksctl.io/nodegroup-name=ng-32b00a62
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.medium
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=ap-southeast-2
failure-domain.beta.kubernetes.io/zone=ap-southeast-2b
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-192-168-44-226.ap-southeast-2.compute.internal
kubernetes.io/os=linux
node-lifecycle=on-demand
node.kubernetes.io/instance-type=t3.medium
topology.kubernetes.io/region=ap-southeast-2
topology.kubernetes.io/zone=ap-southeast-2b
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Fri, 19 Feb 2021 09:41:13 +1100
Taints: node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-192-168-44-226.ap-southeast-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 03 Mar 2021 18:51:15 +1100
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 192.168.44.226
ExternalIP: 3.26.77.215
Hostname: ip-192-168-44-226.ap-southeast-2.compute.internal
InternalDNS: ip-192-168-44-226.ap-southeast-2.compute.internal
ExternalDNS: ec2-3-26-77-215.ap-southeast-2.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 83873772Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3977864Ki
pods: 17
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 1930m
ephemeral-storage: 76224326324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 3422856Ki
pods: 17
System Info:
Machine ID: ec27cd0668c882d838f572a1981b762f
System UUID: EC27CD06-68C8-82D8-38F5-72A1981B762F
Boot ID: 6d26c69a-69ee-4a64-9cd4-48a289ec7d62
Kernel Version: 4.14.214-160.339.amzn2.x86_64
OS Image: Amazon Linux 2
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.6
Kubelet Version: v1.18.9-eks-d1db3c
Kube-Proxy Version: v1.18.9-eks-d1db3c
ProviderID: aws:///ap-southeast-2b/i-00dcf85feec486f1e
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-649c5f88bc-mfx67 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
cert-manager cert-manager-cainjector-9747d56-9xlvv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
cert-manager cert-manager-webhook-849c7b574f-kg6hr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 29h
kube-system aws-load-balancer-controller-64dbfb945b-pjg88 100m (5%) 200m (10%) 200Mi (5%) 500Mi (14%) 44h
kube-system aws-load-balancer-controller-64dbfb945b-q9hsh 100m (5%) 200m (10%) 200Mi (5%) 500Mi (14%) 29h
kube-system aws-node-mctgj 10m (0%) 0 (0%) 0 (0%) 0 (0%) 13d
kube-system coredns-67997b9dbd-4vrxq 100m (5%) 0 (0%) 70Mi (2%) 170Mi (5%) 29h
kube-system coredns-67997b9dbd-7zgn9 100m (5%) 0 (0%) 70Mi (2%) 170Mi (5%) 29h
kube-system kube-proxy-rbjrx 100m (5%) 0 (0%) 0 (0%) 0 (0%) 13d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 510m (26%) 400m (20%)
memory 540Mi (16%) 1340Mi (40%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events: <none>
它说 overcommitted.
错误。然后我可以看到在 kube-system
和 cert-manager
中创建了许多 Pod:
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-649c5f88bc-b25gt 1/1 Terminating 9 34h
cert-manager cert-manager-649c5f88bc-mfx67 1/1 Running 0 29h
cert-manager cert-manager-cainjector-9747d56-9xlvv 1/1 Running 0 29h
cert-manager cert-manager-cainjector-9747d56-p7pxl 1/1 Terminating 17 34h
cert-manager cert-manager-webhook-849c7b574f-kg6hr 1/1 Running 9 29h
cert-manager cert-manager-webhook-849c7b574f-nhjxd 1/1 Terminating 12 34h
default sidecar-app-59dd47fbdf-pjrfq 0/1 Pending 0 13m
kube-system aws-load-balancer-controller-64dbfb945b-ccd5d 1/1 Terminating 13 34h
kube-system aws-load-balancer-controller-64dbfb945b-pjg88 0/1 Terminating 5 44h
kube-system aws-load-balancer-controller-64dbfb945b-q9hsh 1/1 Running 2 29h
kube-system aws-load-balancer-controller-64dbfb945b-ww65p 1/1 Terminating 1 7d
kube-system aws-node-mctgj 1/1 Running 0 13d
kube-system aws-node-prcps 1/1 Running 0 13d
kube-system coredns-67997b9dbd-4vrxq 1/1 Running 1 29h
kube-system coredns-67997b9dbd-7zgn9 1/1 Running 1 29h
kube-system coredns-67997b9dbd-gjfqc 1/1 Terminating 1 34h
kube-system coredns-67997b9dbd-q9t7l 1/1 Terminating 1 34h
kube-system kube-proxy-l9mrq 1/1 Running 0 13d
kube-system kube-proxy-rbjrx 1/1 Running 0 13d
它们是我通过 eksctl create cluster
命令部署 EKS 集群时创建的,cert-manager
和 cert-manager
是由它们创建的。我不知道什么有用,什么没用。我应该删除所有这些吗?或者删除后如何重新创建它们?
答案 0 :(得分:0)
将此作为社区答案发布,请编辑此帖子并分享您对此问题的发现:
查看 pod 描述:
Warning FailedScheduling 75s (x3 over 2m31s) default-scheduler 0/2 nodes are available: 2 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate.
和节点描述:
Taints: node.kubernetes.io/unreachable:NoSchedule
.
.
.
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 03 Mar 2021 18:50:05 +1100 Wed, 03 Mar 2021 18:52:04 +1100 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
看起来节点或网络连接有问题。 Kubelet 停止发布他的状态并且该节点已被标记为污点:
node.kubernetes.io/unreachable:NoSchedule
请查看您的控制台,检查您的 kubelet 状态节点状态以了解更多详细信息。 使用:
kubectl get pods -o wide
kubectl get nodes -o wide
sudo systemctl status kubelet
sudo journalctl -u kubelet
注意:
<块引用>如果 Ready 条件的 Status 保持 Unknown 或 False 的时间长于 pod-eviction-timeout(传递给 kube-controller-manager 的参数),则节点控制器将安排节点上的所有 Pod 删除.默认驱逐超时持续时间为五分钟。在某些节点不可达的情况下,API 服务器无法与节点上的 kubelet 通信。在重新建立与 API 服务器的通信之前,无法将删除 Pod 的决定传达给 kubelet。同时,计划删除的 Pod 可能会继续在分区节点上运行。
附加信息: