现在,我刚刚终止了一个AWS K8S节点。
K8S重新创建了一个新的,并安装了新的Pod。到目前为止,一切似乎都很好。
但是当我这样做时:
kubectl get po -A
我得到:
kube-system cluster-autoscaler-648b4df947-42hxv 0/1 Evicted 0 3m53s
kube-system cluster-autoscaler-648b4df947-45pcc 0/1 Evicted 0 47m
kube-system cluster-autoscaler-648b4df947-46w6h 0/1 Evicted 0 91m
kube-system cluster-autoscaler-648b4df947-4tlbl 0/1 Evicted 0 69m
kube-system cluster-autoscaler-648b4df947-52295 0/1 Evicted 0 3m54s
kube-system cluster-autoscaler-648b4df947-55wzb 0/1 Evicted 0 83m
kube-system cluster-autoscaler-648b4df947-57kv5 0/1 Evicted 0 107m
kube-system cluster-autoscaler-648b4df947-69rsl 0/1 Evicted 0 98m
kube-system cluster-autoscaler-648b4df947-6msx2 0/1 Evicted 0 11m
kube-system cluster-autoscaler-648b4df947-6pphs 0 18m
kube-system dns-controller-697f6d9457-zswm8 0/1 Evicted 0 54m
当我这样做时:
kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
我得到:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
Name: dns-controller-697f6d9457-zswm8
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 12:35:06 +0200
Labels: k8s-addon=dns-controller.addons.k8s.io
k8s-app=dns-controller
pod-template-hash=697f6d9457
version=v1.12.0
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Failed
Reason: Evicted
Message: The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
IP:
IPs: <none>
Controlled By: ReplicaSet/dns-controller-697f6d9457
Containers:
dns-controller:
Image: kope/dns-controller:1.12.0
Port: <none>
Host Port: <none>
Command:
/usr/bin/dns-controller
--watch-ingress=false
--dns=aws-route53
--zone=*/ZDOYTALGJJXCM
--zone=*/*
-v=2
Requests:
cpu: 50m
memory: 50Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from dns-controller-token-gvxxd (ro)
Volumes:
dns-controller-token-gvxxd:
Type: Secret (a volume populated by a Secret)
SecretName: dns-controller-token-gvxxd
Optional: false
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
Normal Killing 59m kubelet, ip-172-20-57-13.eu-west-3.compute.internal Killing container with id docker://dns-controller:Need to kill Pod
并且:
➜ monitoring git:(master) ✗ kubectl describe pod -n kube-system cluster-autoscaler-648b4df947-2zcrz
Name: cluster-autoscaler-648b4df947-2zcrz
Namespace: kube-system
Priority: 0
Node: ip-172-20-57-13.eu-west-3.compute.internal/
Start Time: Mon, 07 Oct 2019 13:26:26 +0200
Labels: app=cluster-autoscaler
k8s-addon=cluster-autoscaler.addons.k8s.io
pod-template-hash=648b4df947
Annotations: prometheus.io/port: 8085
prometheus.io/scrape: true
scheduler.alpha.kubernetes.io/tolerations: [{"key":"dedicated", "value":"master"}]
Status: Failed
Reason: Evicted
Message: Pod The node was low on resource: [DiskPressure].
IP:
IPs: <none>
Controlled By: ReplicaSet/cluster-autoscaler-648b4df947
Containers:
cluster-autoscaler:
Image: gcr.io/google-containers/cluster-autoscaler:v1.15.1
Port: <none>
Host Port: <none>
Command:
./cluster-autoscaler
--v=4
--stderrthreshold=info
--cloud-provider=aws
--skip-nodes-with-local-storage=false
--nodes=0:1:pamela-nodes.k8s-prod.sunchain.fr
Limits:
cpu: 100m
memory: 300Mi
Requests:
cpu: 100m
memory: 300Mi
Liveness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
AWS_REGION: eu-west-3
Mounts:
/etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
/var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-hld2m (ro)
Volumes:
ssl-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs/ca-certificates.crt
HostPathType:
cluster-autoscaler-token-hld2m:
Type: Secret (a volume populated by a Secret)
SecretName: cluster-autoscaler-token-hld2m
Optional: false
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/role=master
Tolerations: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11m default-scheduler Successfully assigned kube-system/cluster-autoscaler-648b4df947-2zcrz to ip-172-20-57-13.eu-west-3.compute.internal
Warning Evicted 11m kubelet, ip-172-20-57-13.eu-west-3.compute.internal The node was low on resource: [DiskPressure].
这似乎是资源问题。奇怪的是,在杀死EC2实例之前,我没有这个问题。
为什么会发生,该怎么办?是否必须添加更多资源?
➜ scripts kubectl describe node ip-172-20-57-13.eu-west-3.compute.internal
Name: ip-172-20-57-13.eu-west-3.compute.internal
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=t3.small
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=eu-west-3
failure-domain.beta.kubernetes.io/zone=eu-west-3a
kops.k8s.io/instancegroup=master-eu-west-3a
kubernetes.io/hostname=ip-172-20-57-13.eu-west-3.compute.internal
kubernetes.io/role=master
node-role.kubernetes.io/master=
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 28 Aug 2019 09:38:09 +0200
Taints: node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Wed, 28 Aug 2019 09:38:36 +0200 Wed, 28 Aug 2019 09:38:36 +0200 RouteCreated RouteController created a route
OutOfDisk False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Mon, 07 Oct 2019 14:14:32 +0200 Mon, 07 Oct 2019 14:11:02 +0200 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:09 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Mon, 07 Oct 2019 14:14:32 +0200 Wed, 28 Aug 2019 09:38:35 +0200 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.20.57.13
ExternalIP: 35.180.187.101
InternalDNS: ip-172-20-57-13.eu-west-3.compute.internal
Hostname: ip-172-20-57-13.eu-west-3.compute.internal
ExternalDNS: ec2-35-180-187-101.eu-west-3.compute.amazonaws.com
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7797156Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2013540Ki
pods: 110
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 2
ephemeral-storage: 7185858958
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1911140Ki
pods: 110
System Info:
Machine ID: ec2b3aa5df0e3ad288d210f309565f06
System UUID: EC2B3AA5-DF0E-3AD2-88D2-10F309565F06
Boot ID: f9d5417b-eba9-4544-9710-a25d01247b46
Kernel Version: 4.9.0-9-amd64
OS Image: Debian GNU/Linux 9 (stretch)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.6.3
Kubelet Version: v1.12.10
Kube-Proxy Version: v1.12.10
PodCIDR: 100.96.1.0/24
ProviderID: aws:///eu-west-3a/i-03bf1b26313679d65
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system etcd-manager-events-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system etcd-manager-main-ip-172-20-57-13.eu-west-3.compute.internal 200m (10%) 0 (0%) 100Mi (5%) 0 (0%) 40d
kube-system kube-apiserver-ip-172-20-57-13.eu-west-3.compute.internal 150m (7%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-controller-manager-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-proxy-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
kube-system kube-scheduler-ip-172-20-57-13.eu-west-3.compute.internal 100m (5%) 0 (0%) 0 (0%) 0 (0%) 40d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 750m (37%) 0 (0%)
memory 200Mi (10%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeHasNoDiskPressure 55m (x324 over 40d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure
Warning EvictionThresholdMet 10m (x1809 over 16d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage
Warning ImageGCFailed 4m30s (x6003 over 23d) kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 652348620 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29
我认为调试它的更好的命令是:
devops git:(master) ✗ kubectl get events --sort-by=.metadata.creationTimestamp -o wide
LAST SEEN TYPE REASON KIND SOURCE MESSAGE SUBOBJECT FIRST SEEN COUNT NAME
10m Warning ImageGCFailed Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal (combined from similar events): wanted to free 653307084 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29 23d 6004 ip-172-20-57-13.eu-west-3.compute.internal.15c4124e15eb1d33
2m59s Warning ImageGCFailed Node kubelet, ip-172-20-36-135.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 639524044 bytes, but freed 0 bytes 7d9h 2089 ip-172-20-36-135.eu-west-3.compute.internal.15c916d24afe2c25
4m59s Warning ImageGCFailed Node kubelet, ip-172-20-33-81.eu-west-3.compute.internal (combined from similar events): failed to garbage collect required amount of images. Wanted to free 458296524 bytes, but freed 0 bytes 4d14h 1183 ip-172-20-33-81.eu-west-3.compute.internal.15c9f3fe4e1525ec
6m43s Warning EvictionThresholdMet Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Attempting to reclaim ephemeral-storage 16d 1841 ip-172-20-57-13.eu-west-3.compute.internal.15c66e349b761219
41s Normal NodeHasNoDiskPressure Node kubelet, ip-172-20-57-13.eu-west-3.compute.internal Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure 40d 333 ip-172-20-57-13.eu-west-3.compute.internal.15bf05cec37981b6
现在df -h
admin@ip-172-20-57-13:/var/log$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 972M 0 972M 0% /dev
tmpfs 197M 2.3M 195M 2% /run
/dev/nvme0n1p2 7.5G 6.4G 707M 91% /
tmpfs 984M 0 984M 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 984M 0 984M 0% /sys/fs/cgroup
/dev/nvme1n1 20G 430M 20G 3% /mnt/master-vol-09618123eb79d92c8
/dev/nvme2n1 20G 229M 20G 2% /mnt/master-vol-05c9684f0edcbd876
答案 0 :(得分:2)
您的节点/主节点似乎存储空间不足?我看到只有1GB的临时存储可用。
您应该释放节点和主节点上的一些空间。它应该摆脱您的问题。