kubernetes HA群集主节点尚未就绪

时间:2018-06-11 10:26:31

标签: kubernetes etcd project-calico

我已使用下一个config.yaml:

部署了kubernetes HA群集
etcd:
  endpoints:
  - "http://172.16.8.236:2379"
  - "http://172.16.8.237:2379"
  - "http://172.16.8.238:2379"
networking:
  podSubnet: "192.168.0.0/16"
apiServerExtraArgs:
  endpoint-reconciler-type: lease

当我检查kubectl get nodes时:

NAME      STATUS     ROLES     AGE       VERSION
master1   Ready      master    22m       v1.10.4
master2   NotReady   master    17m       v1.10.4
master3   NotReady   master    16m       v1.10.4

如果我检查豆荚,我可以看到太多失败:

[ikerlan@master1 ~]$  kubectl get pods -n kube-system
NAME                                       READY     STATUS              RESTARTS   AGE
calico-etcd-5jftb                          0/1       NodeLost            0          16m
calico-etcd-kl7hb                          1/1       Running             0          16m
calico-etcd-z7sps                          0/1       NodeLost            0          16m
calico-kube-controllers-79dccdc4cc-vt5t7   1/1       Running             0          16m
calico-node-dbjl2                          2/2       Running             0          16m
calico-node-gkkth                          0/2       NodeLost            0          16m
calico-node-rqzzl                          0/2       NodeLost            0          16m
kube-apiserver-master1                     1/1       Running             0          21m
kube-controller-manager-master1            1/1       Running             0          22m
kube-dns-86f4d74b45-rwchm                  1/3       CrashLoopBackOff    17         22m
kube-proxy-226xd                           1/1       Running             0          22m
kube-proxy-jr2jq                           0/1       ContainerCreating   0          18m
kube-proxy-zmjdm                           0/1       ContainerCreating   0          17m
kube-scheduler-master1                     1/1       Running             0          21m

如果我运行kubectl describe node master2

[ikerlan@master1 ~]$ kubectl describe node master2
Name:               master2
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/hostname=master2
                    node-role.kubernetes.io/master=
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Mon, 11 Jun 2018 12:06:03 +0200
Taints:             node-role.kubernetes.io/master:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason                    Message
  ----             ------    -----------------                 ------------------                ------                    -------
  OutOfDisk        Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  MemoryPressure   Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  DiskPressure     Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
  PIDPressure      False     Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:00 +0200   KubeletHasSufficientPID   kubelet has sufficient PID available
  Ready            Unknown   Mon, 11 Jun 2018 12:06:15 +0200   Mon, 11 Jun 2018 12:06:56 +0200   NodeStatusUnknown         Kubelet stopped posting node status.
Addresses:
  InternalIP:  172.16.8.237
  Hostname:    master2
Capacity:
 cpu:                2
 ephemeral-storage:  37300436Ki

然后,如果我检查了豆荚,kubectl describe pod -n kube-system calico-etcd-5jftb

[ikerlan@master1 ~]$ kubectl describe pod -n kube-system  calico-etcd-5jftb
Name:                      calico-etcd-5jftb
Namespace:                 kube-system
Node:                      master2/
Labels:                    controller-revision-hash=4283683065
                           k8s-app=calico-etcd
                           pod-template-generation=1
Annotations:               scheduler.alpha.kubernetes.io/critical-pod=
Status:                    Terminating (lasts 20h)
Termination Grace Period:  30s
Reason:                    NodeLost
Message:                   Node master2 which was running pod calico-etcd-5jftb is unresponsive
IP:                        
Controlled By:             DaemonSet/calico-etcd
Containers:
  calico-etcd:
    Image:      quay.io/coreos/etcd:v3.1.10
    Port:       <none>
    Host Port:  <none>
    Command:
      /usr/local/bin/etcd
    Args:
      --name=calico
      --data-dir=/var/etcd/calico-data
      --advertise-client-urls=http://$CALICO_ETCD_IP:6666
      --listen-client-urls=http://0.0.0.0:6666
      --listen-peer-urls=http://0.0.0.0:6667
      --auto-compaction-retention=1
    Environment:
      CALICO_ETCD_IP:   (v1:status.podIP)
    Mounts:
      /var/etcd from var-etcd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-tj6d7 (ro)
Volumes:
  var-etcd:
    Type:          HostPath (bare host directory volume)
    Path:          /var/etcd
    HostPathType:  
  default-token-tj6d7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-tj6d7
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/master=
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:          <none>

我已尝试将etcd群集更新到版本3.3,现在我可以看到下一个日志(以及更多超时):

2018-06-12 09:17:51.305960 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices/v1beta1.authentication.k8s.io\" " took too long (190.475363ms) to execute
2018-06-12 09:18:06.788558 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (109.543763ms) to execute
2018-06-12 09:18:34.875823 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " took too long (136.649505ms) to execute
2018-06-12 09:18:41.634057 W | etcdserver: read-only range request "key:\"/registry/minions\" range_end:\"/registry/miniont\" count_only:true " took too long (106.00073ms) to execute
2018-06-12 09:18:42.345564 W | etcdserver: request "header:<ID:4449666326481959890 > lease_revoke:<ID:4449666326481959752 > " took too long (142.771179ms) to execute

我查了一下:kubectl get events

22m         22m          1         master2.15375fdf087fc69f   Node                  Normal    Starting                  kube-proxy, master2   Starting kube-proxy.
22m         22m          1         master3.15375fe744055758   Node                  Normal    Starting                  kubelet, master3      Starting kubelet.
22m         22m          5         master3.15375fe74d47afa2   Node                  Normal    NodeHasSufficientDisk     kubelet, master3      Node master3 status is now: NodeHasSufficientDisk
22m         22m          5         master3.15375fe74d47f80f   Node                  Normal    NodeHasSufficientMemory   kubelet, master3      Node master3 status is now: NodeHasSufficientMemory
22m         22m          5         master3.15375fe74d48066e   Node                  Normal    NodeHasNoDiskPressure     kubelet, master3      Node master3 status is now: NodeHasNoDiskPressure
22m         22m          5         master3.15375fe74d481368   Node                  Normal    NodeHasSufficientPID      kubelet, master3      Node master3 status is now: NodeHasSufficientPID

2 个答案:

答案 0 :(得分:1)

如果您使用了为您部署etcd的calico.yaml,我会看到多个calico-etcd pod正在尝试运行,这在多主环境中无效。

该清单不适用于生产部署,也不适用于多主环境,因为它部署的etcd未配置为尝试形成集群。

您仍然可以使用该清单,但您需要删除它部署的etcd pod并将etcd_endpoints设置为您已部署的etcd群集。

答案 1 :(得分:0)

我已经解决了它:

  1. 将所有主IP和LB IP添加到apiServerCertSAN

  2. 将kubernetes证书从第一个主人复制到其他主人。