按名称查找Pod时未知主机,并通过pod重新启动来解决

时间:2018-09-07 18:55:03

标签: kubernetes unknown-host

我有一个安装程序,它在CI流中旋转了两个Pod,我们称它们为web和activemq。 Web Pod启动时,它将尝试使用分配给k8s的amq-deployment-0.activemq Pod名称与activemq pod进行通信。

随机地,当尝试访问amq-deployment1.activemq时,网络将获得未知的主机异常。如果在这种情况下重新启动Web Pod,则Web Pod与activemq Pod通信将没有问题。

发生这种情况时,我已经登录到Web窗格,并且/etc/resolv.conf和/ etc / hosts文件看起来不错。主机/etc/resolve.conf和/ etc / hosts稀疏,没有什么可疑的。

信息: 只有一个工作节点。

kubectl-版本 Kubernetes v1.8.3 + icp + ee

有关如何调试此问题的任何想法。我想不出它随机发生的充分理由,也无法在Pod重新启动时自行解决。

如果需要其他有用的信息,我可以得到。预先感谢

对于activeMQ,我们确实有此服务文件

apiVersion: v1 kind: Service
metadata:
    name: activemq
    labels:
            app: myapp
            env: dev
spec:
    ports:
        - port: 8161
          protocol: TCP
          targetPort: 8161
          name: http
        - port: 61616
          protocol: TCP
          targetPort: 61616
          name: amq
    selector:
        component: analytics-amq
        app: myapp
        environment: dev
        type: fa-core
    clusterIP: None

这个ActiveMQ有状态集(这是模板)

kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
  name: pa-amq-deployment
spec:
  replicas: {{ activemqs }}
  updateStrategy:
    type: RollingUpdate
  serviceName: "activemq"
  template:
      metadata:
          labels:
              component: analytics-amq
              app: myapp
              environment: dev
              type: fa-core
      spec:
          containers:
              - name: pa-amq
                image: default/myco/activemq:latest
                imagePullPolicy: Always
                resources:
                      limits:
                          cpu: 150m
                          memory: 1Gi
                livenessProbe:
                    exec:
                        command:
                        - /etc/init.d/activemq
                        - status
                    initialDelaySeconds: 10
                    periodSeconds: 15
                    failureThreshold: 16
                ports:
                    - containerPort: 8161
                      protocol: TCP
                      name: http
                    - containerPort: 61616
                      protocol: TCP
                      name: amq
                envFrom:
                    - configMapRef:
                        name: pa-activemq-conf-all
                    - secretRef:
                        name: pa-activemq-secret
                volumeMounts:
                    - name: timezone
                      mountPath: /etc/localtime
          volumes:
              - name: timezone
                hostPath:
                  path: /usr/share/zoneinfo/UTC

网络状态集:

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
    name: pa-web-deployment
spec:
    replicas: 1
    updateStrategy:
        type: RollingUpdate
    serviceName: "pa-web"
    template:
        metadata:
            labels:
                component: analytics-web
                app: myapp
                environment: dev
                type: fa-core
        spec:
            affinity:
              podAntiAffinity:
                preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 100
                  podAffinityTerm:
                    labelSelector:
                      matchExpressions:
                      - key: component
                        operator: In
                        values:
                        - analytics-web
                    topologyKey: kubernetes.io/hostname
            containers:
                - name: pa-web
                  image: default/myco/web:latest
                  imagePullPolicy: Always
                  resources:
                        limits:
                            cpu: 1
                            memory: 2Gi
                  readinessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 76
                  livenessProbe:
                      httpGet:
                          path: /versions
                          port: 8080
                      initialDelaySeconds: 30
                      periodSeconds: 15
                      failureThreshold: 80
                  securityContext:
                      privileged: true
                  ports:
                      - containerPort: 8080
                        name: http
                        protocol: TCP
                  envFrom:
                      - configMapRef:
                         name: pa-web-conf-all
                      - secretRef:
                         name: pa-web-secret
                  volumeMounts:
                      - name: shared-volume
                        mountPath: /MySharedPath
                      - name: timezone
                        mountPath: /etc/localtime
            volumes:
                - nfs:
                    server: 10.100.10.23
                    path: /MySharedPath
                  name: shared-volume
                - name: timezone
                  hostPath:
                    path: /usr/share/zoneinfo/UTC

此Web窗格在查找我们配置的外部数据库时也存在类似的“未知主机”问题。通过重新启动Pod,可以类似地解决该问题。这是该外部服务的配置。也许从这个角度解决问题更容易? ActiveMQ可以使用数据库服务名称来查找数据库并启动数据库。

apiVersion: v1
kind: Service
metadata:
  name: dbhost
  labels:
    app: myapp
    env: dev
spec:
  type: ExternalName
  externalName: mydb.host.com

2 个答案:

答案 0 :(得分:1)

是否有可能首先启动哪个容器和容器中的应用,然后是第二个启动该问题?

在任何情况下,建议使用Service而不是Pod名称进行连接,因为Kubernetes分配的Pod名称在Pod重新启动之间会改变。

测试连通性的一种方法是使用telnet(或curl用于其支持的协议)(如果在映像中找到):

telnet <host/pod/Service> <port>

答案 1 :(得分:0)

找不到解决方案,我创建了一个解决方法。我在映像中设置了entrypoint.sh来查找我需要访问并写入日志的域,错误退出:

#!/bin/bash

#disable echo and exit on error
set +ex

#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
  echo "service ${db_host_name} is known"
else
  echo "${db_host_name} service is NOT recognized. Exiting container..."
  exit 1
fi

下一步,因为只有重新启动Pod才能解决此问题。在我的ansible部署中,我进行了首次发布检查,查询日志以查看是否需要执行Pod重新启动。例如:

rollout-check.yml

- name: "Rollout status for {{rollout_item.statefulset}}"
  shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
  ignore_errors: yes

# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
  shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
  register: log_line

# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
  shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
  when: log_line.stdout.find('service is NOT recognized') > 0

我有时会重复进行6次此推出检查,因为有时即使在重新启动Pod之后也找不到该服务。吊舱成功启动后,立即进行附加检查。

- name: "Web rollout"
  include_tasks: rollout-check.yml
  loop:
  - { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  - { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
  loop_control:
    loop_var: rollout_item