Question

我刚刚将我的第一个集群从minikube移到了AWS EKS。到目前为止，一切都进行得很顺利，除了我认为我遇到了一些DNS问题外，但仅在群集节点之一上。

我在集群中运行v1.14的两个节点，以及一个类型的4个pod，另一个运行4个pod，每个工作3个，但每个工作1个-都在同一个节点上-开始，然后报错（CrashLoopBackOff）容器内的脚本错误，因为它无法解析数据库的主机名。删除错误的Pod，甚至所有Pod，都会导致同一节点上的一个Pod每次都失败。

该数据库位于其自己的容器中，并且已分配了服务，其他相同类型的容器都没有解析名称或连接的问题。数据库容器与无法解析主机名的容器位于同一节点上。我不确定如何将Pod迁移到其他节点，但这可能值得尝试看看问题是否随之而来。 coredns pod中没有错误。我不确定从哪里开始寻找从这里发现问题的方法，任何帮助或建议将不胜感激。

提供以下配置。如前所述，它们都在Minikube上工作，而且它们在一个节点上工作。

kubectl获取pod-注意年龄，所有pod1都被同时删除，他们重新创建了自己，其中3个工作正常，第4个没有。

NAME                          READY   STATUS             RESTARTS   AGE
pod1-85f7968f7-2cjwt         1/1     Running            0          34h
pod1-85f7968f7-cbqn6         1/1     Running            0          34h
pod1-85f7968f7-k9xv2         0/1     CrashLoopBackOff   399        34h
pod1-85f7968f7-qwcrz         1/1     Running            0          34h
postgresql-865db94687-cpptb   1/1     Running            0          3d14h
rabbitmq-667cfc4cc-t92pl      1/1     Running            0          34h
pod2-94b9bc6b6-6bzf7     1/1     Running            0          34h
pod2-94b9bc6b6-6nvkr     1/1     Running            0          34h
pod2-94b9bc6b6-jcjtb     0/1     CrashLoopBackOff   140        11h
pod2-94b9bc6b6-t4gfq     1/1     Running            0          34h

postgresql服务

apiVersion: v1
kind: Service
metadata: 
    name: postgresql
spec:
    ports:
        - port: 5432
    selector:
        app: postgresql

pod1部署：

apiVersion: apps/v1
kind: Deployment
metadata:
    name: pod1
spec:
    replicas: 4
    selector:
        matchLabels:
            app: pod1
    template:
        metadata:
            labels:
                app: pod1
        spec:
            containers:
                - name: pod1
                  image: us.gcr.io/gcp-project-8888888/pod1:latest
                  env:
                      - name: rabbitmquser
                        valueFrom:
                            secretKeyRef:
                                name: rabbitmq-secrets
                                key: rmquser
                  volumeMounts:
                      - mountPath: /data/files
                        name: datafiles
            volumes:
                - name: datafiles
                  persistentVolumeClaim:
                      claimName: datafiles-pv-claim
            imagePullSecrets:
                - name: container-readonly

pod2部署：

apiVersion: apps/v1
kind: Deployment
metadata:
    name: pod2
spec:
    replicas: 4
    selector:
        matchLabels:
            app: pod2
    template:
        metadata:
            labels:
                app: pod2
        spec:
            containers:
                - name: pod2
                  image: us.gcr.io/gcp-project-8888888/pod2:latest
                  env:
                      - name: rabbitmquser
                        valueFrom:
                            secretKeyRef:
                                name: rabbitmq-secrets
                                key: rmquser
                  volumeMounts:
                      - mountPath: /data/files
                        name: datafiles
            volumes:
                - name: datafiles
                  persistentVolumeClaim:
                      claimName: datafiles-pv-claim
            imagePullSecrets:
                - name: container-readonly

CoreDNS配置映射以将DNS转发到外部服务（如果内部无法解析）。我认为这是唯一会导致问题的地方-但正如所说的，它仅适用于一个节点。

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           upstream
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        proxy . 8.8.8.8
        cache 30
        loop
        reload
        loadbalance
    }

Pod输出错误。这两个Pod相同，因为这发生在两个Pod共同的库代码中。如前所述，并非所有Pod都会发生这种情况，因此问题可能不在于代码。

Error connecting to database (psycopg2.OperationalError) could not translate host name "postgresql" to address: Try again

错误的Pod1说明：

Name:           xyz-94b9bc6b6-jcjtb
Namespace:      default
Priority:       0
Node:           ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time:     Tue, 15 Oct 2019 19:43:11 +1030
Labels:         app=pod1
                pod-template-hash=94b9bc6b6
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             192.168.70.63
Controlled By:  ReplicaSet/xyz-94b9bc6b6
Containers:
  pod1:
    Container ID:   docker://f7dc735111bd94b7c7b698e69ad302ca19ece6c72b654057627626620b67d6de
    Image:          us.gcr.io/xyz/xyz:latest
    Image ID:       docker-pullable://us.gcr.io/xyz/xyz@sha256:20110cf126b35773ef3a8656512c023b1e8fe5c81dd88f19a64c5bfbde89f07e
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 16 Oct 2019 07:21:40 +1030
      Finished:     Wed, 16 Oct 2019 07:21:46 +1030
    Ready:          False
    Restart Count:  139
    Environment:
      xyz:    <set to the key 'xyz' in secret 'xyz-secrets'>           Optional: false
    Mounts:
      /data/xyz from xyz (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  xyz:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  xyz-pv-claim
    ReadOnly:   false
  default-token-m72kz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m72kz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From                                                   Message
  ----     ------   ----                    ----                                                   -------
  Warning  BackOff  2m22s (x3143 over 11h)  kubelet, ip-192-168-87-230.us-east-2.compute.internal  Back-off restarting failed container

错误的Pod 2说明：

Name:           xyz-85f7968f7-k9xv2
Namespace:      default
Priority:       0
Node:           ip-192-168-87-230.us-east-2.compute.internal/192.168.87.230
Start Time:     Mon, 14 Oct 2019 21:19:42 +1030
Labels:         app=pod2
                pod-template-hash=85f7968f7
Annotations:    kubernetes.io/psp: eks.privileged
Status:         Running
IP:             192.168.84.69
Controlled By:  ReplicaSet/pod2-85f7968f7
Containers:
  pod2:
    Container ID:   docker://f7c7379f92f57ea7d381ae189b964527e02218dc64337177d6d7cd6b70990143
    Image:          us.gcr.io/xyz-217300/xyz:latest
    Image ID:       docker-pullable://us.gcr.io/xyz-217300/xyz@sha256:b9cecdbc90c5c5f7ff6170ee1eccac83163ac670d9df5febd573c2d84a4d628d
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 16 Oct 2019 07:23:35 +1030
      Finished:     Wed, 16 Oct 2019 07:23:41 +1030
    Ready:          False
    Restart Count:  398
    Environment:
      xyz:    <set to the key 'xyz' in secret 'xyz-secrets'>     Optional: false
    Mounts:
      /data/xyz from xyz (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m72kz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  xyz:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  xyz-pv-claim
    ReadOnly:   false
  default-token-m72kz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-m72kz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                     From                                                   Message
  ----     ------   ----                    ----                                                   -------
  Warning  BackOff  3m28s (x9208 over 34h)  kubelet, ip-192-168-87-230.us-east-2.compute.internal  Back-off restarting failed container

Answer 1

在k8s社区成员的建议下，我对我的coredns配置进行了以下更改，以更符合最佳实践：

行：proxy . 8.8.8.8更改为forward . /etc/resolv.conf 8.8.8.8

然后我删除了吊舱，并用k8s重新创建了吊舱之后，问题不再出现。

编辑：

结果证明，这根本不是问题，因为此问题再次发生并持续存在。最后是：https://github.com/aws/amazon-vpc-cni-k8s/issues/641 按照Amazon的建议回滚到1.5.3，重新启动集群，问题已解决。

一些Kubernetes Pod始终无法仅在一个节点上解析内部DNS

1 个答案: