群集自动缩放器和dns控制器不断退出

时间:2019-10-07 11:34:58

标签: kubernetes

现在,我刚刚终止了一个AWS K8S节点。

K8S重新创建了一个新的,并安装了新的Pod。到目前为止,一切似乎都很好。

但是当我这样做时:

kubectl get po -A

我得到:

kube-system            cluster-autoscaler-648b4df947-42hxv                                  0/1     Evicted     0          3m53s
kube-system            cluster-autoscaler-648b4df947-45pcc                                  0/1     Evicted     0          47m
kube-system            cluster-autoscaler-648b4df947-46w6h                                  0/1     Evicted     0          91m
kube-system            cluster-autoscaler-648b4df947-4tlbl                                  0/1     Evicted     0          69m
kube-system            cluster-autoscaler-648b4df947-52295                                  0/1     Evicted     0          3m54s
kube-system            cluster-autoscaler-648b4df947-55wzb                                  0/1     Evicted     0          83m
kube-system            cluster-autoscaler-648b4df947-57kv5                                  0/1     Evicted     0          107m
kube-system            cluster-autoscaler-648b4df947-69rsl                                  0/1     Evicted     0          98m
kube-system            cluster-autoscaler-648b4df947-6msx2                                  0/1     Evicted     0          11m
kube-system            cluster-autoscaler-648b4df947-6pphs                                       0          18m
kube-system            dns-controller-697f6d9457-zswm8                                      0/1     Evicted     0          54m

当我这样做时:

kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8

我得到:

➜  monitoring git:(master) ✗ kubectl describe pod -n kube-system dns-controller-697f6d9457-zswm8
Name:           dns-controller-697f6d9457-zswm8
Namespace:      kube-system
Priority:       0
Node:           ip-172-20-57-13.eu-west-3.compute.internal/
Start Time:     Mon, 07 Oct 2019 12:35:06 +0200
Labels:         k8s-addon=dns-controller.addons.k8s.io
                k8s-app=dns-controller
                pod-template-hash=697f6d9457
                version=v1.12.0
Annotations:    scheduler.alpha.kubernetes.io/critical-pod: 
Status:         Failed
Reason:         Evicted
Message:        The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0. 
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/dns-controller-697f6d9457
Containers:
  dns-controller:
    Image:      kope/dns-controller:1.12.0
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/bin/dns-controller
      --watch-ingress=false
      --dns=aws-route53
      --zone=*/ZDOYTALGJJXCM
      --zone=*/*
      -v=2
    Requests:
      cpu:        50m
      memory:     50Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from dns-controller-token-gvxxd (ro)
Volumes:
  dns-controller-token-gvxxd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  dns-controller-token-gvxxd
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age   From                                                 Message
  ----     ------   ----  ----                                                 -------
  Warning  Evicted  59m   kubelet, ip-172-20-57-13.eu-west-3.compute.internal  The node was low on resource: ephemeral-storage. Container dns-controller was using 48Ki, which exceeds its request of 0.
  Normal   Killing  59m   kubelet, ip-172-20-57-13.eu-west-3.compute.internal  Killing container with id docker://dns-controller:Need to kill Pod

并且:

➜  monitoring git:(master) ✗ kubectl describe pod -n kube-system cluster-autoscaler-648b4df947-2zcrz 
Name:           cluster-autoscaler-648b4df947-2zcrz
Namespace:      kube-system
Priority:       0
Node:           ip-172-20-57-13.eu-west-3.compute.internal/
Start Time:     Mon, 07 Oct 2019 13:26:26 +0200
Labels:         app=cluster-autoscaler
                k8s-addon=cluster-autoscaler.addons.k8s.io
                pod-template-hash=648b4df947
Annotations:    prometheus.io/port: 8085
                prometheus.io/scrape: true
                scheduler.alpha.kubernetes.io/tolerations: [{"key":"dedicated", "value":"master"}]
Status:         Failed
Reason:         Evicted
Message:        Pod The node was low on resource: [DiskPressure]. 
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/cluster-autoscaler-648b4df947
Containers:
  cluster-autoscaler:
    Image:      gcr.io/google-containers/cluster-autoscaler:v1.15.1
    Port:       <none>
    Host Port:  <none>
    Command:
      ./cluster-autoscaler
      --v=4
      --stderrthreshold=info
      --cloud-provider=aws
      --skip-nodes-with-local-storage=false
      --nodes=0:1:pamela-nodes.k8s-prod.sunchain.fr
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:      100m
      memory:   300Mi
    Liveness:   http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:8085/health-check delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_REGION:  eu-west-3
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-hld2m (ro)
Volumes:
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-certificates.crt
    HostPathType:  
  cluster-autoscaler-token-hld2m:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-hld2m
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  kubernetes.io/role=master
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age   From                                                 Message
  ----     ------     ----  ----                                                 -------
  Normal   Scheduled  11m   default-scheduler                                    Successfully assigned kube-system/cluster-autoscaler-648b4df947-2zcrz to ip-172-20-57-13.eu-west-3.compute.internal
  Warning  Evicted    11m   kubelet, ip-172-20-57-13.eu-west-3.compute.internal  The node was low on resource: [DiskPressure].

这似乎是资源问题。奇怪的是,在杀死EC2实例之前,我没有这个问题。

为什么会发生,该怎么办?是否必须添加更多资源?

➜  scripts kubectl describe node ip-172-20-57-13.eu-west-3.compute.internal
Name:               ip-172-20-57-13.eu-west-3.compute.internal
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.small
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=eu-west-3
                    failure-domain.beta.kubernetes.io/zone=eu-west-3a
                    kops.k8s.io/instancegroup=master-eu-west-3a
                    kubernetes.io/hostname=ip-172-20-57-13.eu-west-3.compute.internal
                    kubernetes.io/role=master
                    node-role.kubernetes.io/master=
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 28 Aug 2019 09:38:09 +0200
Taints:             node-role.kubernetes.io/master:NoSchedule
                    node.kubernetes.io/disk-pressure:NoSchedule
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Wed, 28 Aug 2019 09:38:36 +0200   Wed, 28 Aug 2019 09:38:36 +0200   RouteCreated                 RouteController created a route
  OutOfDisk            False   Mon, 07 Oct 2019 14:14:32 +0200   Wed, 28 Aug 2019 09:38:09 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Mon, 07 Oct 2019 14:14:32 +0200   Wed, 28 Aug 2019 09:38:09 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         True    Mon, 07 Oct 2019 14:14:32 +0200   Mon, 07 Oct 2019 14:11:02 +0200   KubeletHasDiskPressure       kubelet has disk pressure
  PIDPressure          False   Mon, 07 Oct 2019 14:14:32 +0200   Wed, 28 Aug 2019 09:38:09 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 07 Oct 2019 14:14:32 +0200   Wed, 28 Aug 2019 09:38:35 +0200   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   172.20.57.13
  ExternalIP:   35.180.187.101
  InternalDNS:  ip-172-20-57-13.eu-west-3.compute.internal
  Hostname:     ip-172-20-57-13.eu-west-3.compute.internal
  ExternalDNS:  ec2-35-180-187-101.eu-west-3.compute.amazonaws.com
Capacity:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           7797156Ki
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      2013540Ki
 pods:                        110
Allocatable:
 attachable-volumes-aws-ebs:  25
 cpu:                         2
 ephemeral-storage:           7185858958
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      1911140Ki
 pods:                        110
System Info:
 Machine ID:                 ec2b3aa5df0e3ad288d210f309565f06
 System UUID:                EC2B3AA5-DF0E-3AD2-88D2-10F309565F06
 Boot ID:                    f9d5417b-eba9-4544-9710-a25d01247b46
 Kernel Version:             4.9.0-9-amd64
 OS Image:                   Debian GNU/Linux 9 (stretch)
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.3
 Kubelet Version:            v1.12.10
 Kube-Proxy Version:         v1.12.10
PodCIDR:                     100.96.1.0/24
ProviderID:                  aws:///eu-west-3a/i-03bf1b26313679d65
Non-terminated Pods:         (6 in total)
  Namespace                  Name                                                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                                  ------------  ----------  ---------------  -------------  ---
  kube-system                etcd-manager-events-ip-172-20-57-13.eu-west-3.compute.internal        100m (5%)     0 (0%)      100Mi (5%)       0 (0%)         40d
  kube-system                etcd-manager-main-ip-172-20-57-13.eu-west-3.compute.internal          200m (10%)    0 (0%)      100Mi (5%)       0 (0%)         40d
  kube-system                kube-apiserver-ip-172-20-57-13.eu-west-3.compute.internal             150m (7%)     0 (0%)      0 (0%)           0 (0%)         40d
  kube-system                kube-controller-manager-ip-172-20-57-13.eu-west-3.compute.internal    100m (5%)     0 (0%)      0 (0%)           0 (0%)         40d
  kube-system                kube-proxy-ip-172-20-57-13.eu-west-3.compute.internal                 100m (5%)     0 (0%)      0 (0%)           0 (0%)         40d
  kube-system                kube-scheduler-ip-172-20-57-13.eu-west-3.compute.internal             100m (5%)     0 (0%)      0 (0%)           0 (0%)         40d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         750m (37%)   0 (0%)
  memory                      200Mi (10%)  0 (0%)
  ephemeral-storage           0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:
  Type     Reason                 Age                     From                                                 Message
  ----     ------                 ----                    ----                                                 -------
  Normal   NodeHasNoDiskPressure  55m (x324 over 40d)     kubelet, ip-172-20-57-13.eu-west-3.compute.internal  Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure
  Warning  EvictionThresholdMet   10m (x1809 over 16d)    kubelet, ip-172-20-57-13.eu-west-3.compute.internal  Attempting to reclaim ephemeral-storage
  Warning  ImageGCFailed          4m30s (x6003 over 23d)  kubelet, ip-172-20-57-13.eu-west-3.compute.internal  (combined from similar events): wanted to free 652348620 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29

我认为调试它的更好的命令是:

devops git:(master) ✗ kubectl get events --sort-by=.metadata.creationTimestamp -o wide

LAST SEEN   TYPE      REASON                  KIND   SOURCE                                                 MESSAGE                                                                                                                                                                                                                                                                                                  SUBOBJECT   FIRST SEEN   COUNT   NAME
10m         Warning   ImageGCFailed           Node   kubelet, ip-172-20-57-13.eu-west-3.compute.internal    (combined from similar events): wanted to free 653307084 bytes, but freed 0 bytes space with errors in image deletion: rpc error: code = Unknown desc = Error response from daemon: conflict: unable to delete dd37681076e1 (cannot be forced) - image is being used by running container b1800146af29               23d          6004    ip-172-20-57-13.eu-west-3.compute.internal.15c4124e15eb1d33
2m59s       Warning   ImageGCFailed           Node   kubelet, ip-172-20-36-135.eu-west-3.compute.internal   (combined from similar events): failed to garbage collect required amount of images. Wanted to free 639524044 bytes, but freed 0 bytes                                                                                                                                                                               7d9h         2089    ip-172-20-36-135.eu-west-3.compute.internal.15c916d24afe2c25
4m59s       Warning   ImageGCFailed           Node   kubelet, ip-172-20-33-81.eu-west-3.compute.internal    (combined from similar events): failed to garbage collect required amount of images. Wanted to free 458296524 bytes, but freed 0 bytes                                                                                                                                                                               4d14h        1183    ip-172-20-33-81.eu-west-3.compute.internal.15c9f3fe4e1525ec
6m43s       Warning   EvictionThresholdMet    Node   kubelet, ip-172-20-57-13.eu-west-3.compute.internal    Attempting to reclaim ephemeral-storage                                                                                                                                                                                                                                                                              16d          1841    ip-172-20-57-13.eu-west-3.compute.internal.15c66e349b761219
41s         Normal    NodeHasNoDiskPressure   Node   kubelet, ip-172-20-57-13.eu-west-3.compute.internal    Node ip-172-20-57-13.eu-west-3.compute.internal status is now: NodeHasNoDiskPressure                                                                                                                                                                                                                                 40d          333     ip-172-20-57-13.eu-west-3.compute.internal.15bf05cec37981b6

现在df -h

admin@ip-172-20-57-13:/var/log$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            972M     0  972M   0% /dev
tmpfs           197M  2.3M  195M   2% /run
/dev/nvme0n1p2  7.5G  6.4G  707M  91% /
tmpfs           984M     0  984M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           984M     0  984M   0% /sys/fs/cgroup
/dev/nvme1n1     20G  430M   20G   3% /mnt/master-vol-09618123eb79d92c8
/dev/nvme2n1     20G  229M   20G   2% /mnt/master-vol-05c9684f0edcbd876

1 个答案:

答案 0 :(得分:2)

您的节点/主节点似乎存储空间不足?我看到只有1GB的临时存储可用。

您应该释放节点和主节点上的一些空间。它应该摆脱您的问题。