我有一个安装程序,它在CI流中旋转了两个Pod,我们称它们为web和activemq。 Web Pod启动时,它将尝试使用分配给k8s的amq-deployment-0.activemq Pod名称与activemq pod进行通信。
随机地,当尝试访问amq-deployment1.activemq时,网络将获得未知的主机异常。如果在这种情况下重新启动Web Pod,则Web Pod与activemq Pod通信将没有问题。
发生这种情况时,我已经登录到Web窗格,并且/etc/resolv.conf和/ etc / hosts文件看起来不错。主机/etc/resolve.conf和/ etc / hosts稀疏,没有什么可疑的。
信息: 只有一个工作节点。
kubectl-版本 Kubernetes v1.8.3 + icp + ee
有关如何调试此问题的任何想法。我想不出它随机发生的充分理由,也无法在Pod重新启动时自行解决。
如果需要其他有用的信息,我可以得到。预先感谢
对于activeMQ,我们确实有此服务文件
apiVersion: v1 kind: Service
metadata:
name: activemq
labels:
app: myapp
env: dev
spec:
ports:
- port: 8161
protocol: TCP
targetPort: 8161
name: http
- port: 61616
protocol: TCP
targetPort: 61616
name: amq
selector:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
clusterIP: None
这个ActiveMQ有状态集(这是模板)
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
name: pa-amq-deployment
spec:
replicas: {{ activemqs }}
updateStrategy:
type: RollingUpdate
serviceName: "activemq"
template:
metadata:
labels:
component: analytics-amq
app: myapp
environment: dev
type: fa-core
spec:
containers:
- name: pa-amq
image: default/myco/activemq:latest
imagePullPolicy: Always
resources:
limits:
cpu: 150m
memory: 1Gi
livenessProbe:
exec:
command:
- /etc/init.d/activemq
- status
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 16
ports:
- containerPort: 8161
protocol: TCP
name: http
- containerPort: 61616
protocol: TCP
name: amq
envFrom:
- configMapRef:
name: pa-activemq-conf-all
- secretRef:
name: pa-activemq-secret
volumeMounts:
- name: timezone
mountPath: /etc/localtime
volumes:
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
网络状态集:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: pa-web-deployment
spec:
replicas: 1
updateStrategy:
type: RollingUpdate
serviceName: "pa-web"
template:
metadata:
labels:
component: analytics-web
app: myapp
environment: dev
type: fa-core
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: component
operator: In
values:
- analytics-web
topologyKey: kubernetes.io/hostname
containers:
- name: pa-web
image: default/myco/web:latest
imagePullPolicy: Always
resources:
limits:
cpu: 1
memory: 2Gi
readinessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 76
livenessProbe:
httpGet:
path: /versions
port: 8080
initialDelaySeconds: 30
periodSeconds: 15
failureThreshold: 80
securityContext:
privileged: true
ports:
- containerPort: 8080
name: http
protocol: TCP
envFrom:
- configMapRef:
name: pa-web-conf-all
- secretRef:
name: pa-web-secret
volumeMounts:
- name: shared-volume
mountPath: /MySharedPath
- name: timezone
mountPath: /etc/localtime
volumes:
- nfs:
server: 10.100.10.23
path: /MySharedPath
name: shared-volume
- name: timezone
hostPath:
path: /usr/share/zoneinfo/UTC
此Web窗格在查找我们配置的外部数据库时也存在类似的“未知主机”问题。通过重新启动Pod,可以类似地解决该问题。这是该外部服务的配置。也许从这个角度解决问题更容易? ActiveMQ可以使用数据库服务名称来查找数据库并启动数据库。
apiVersion: v1
kind: Service
metadata:
name: dbhost
labels:
app: myapp
env: dev
spec:
type: ExternalName
externalName: mydb.host.com
答案 0 :(得分:1)
是否有可能首先启动哪个容器和容器中的应用,然后是第二个启动该问题?
在任何情况下,建议使用Service而不是Pod名称进行连接,因为Kubernetes分配的Pod名称在Pod重新启动之间会改变。
测试连通性的一种方法是使用telnet
(或curl
用于其支持的协议)(如果在映像中找到):
telnet <host/pod/Service> <port>
答案 1 :(得分:0)
找不到解决方案,我创建了一个解决方法。我在映像中设置了entrypoint.sh来查找我需要访问并写入日志的域,错误退出:
#!/bin/bash
#disable echo and exit on error
set +ex
#####################################
# verfiy that the db service can be found or exit container
#####################################
# we do not want to install nslookup to determine if the db_host_name is valid name
# we have ping available though
# 0-success, 1-error pinging but lookup worked (services can not be pinged), 2-unreachable host
ping -W 2 -c 1 ${db_host_name} &> /dev/null
if [ $? -le 1 ]
then
echo "service ${db_host_name} is known"
else
echo "${db_host_name} service is NOT recognized. Exiting container..."
exit 1
fi
下一步,因为只有重新启动Pod才能解决此问题。在我的ansible部署中,我进行了首次发布检查,查询日志以查看是否需要执行Pod重新启动。例如:
rollout-check.yml
- name: "Rollout status for {{rollout_item.statefulset}}"
shell: timeout 4m kubectl rollout status -n {{fa_namespace}} -f {{ rollout_item.statefulset }}
ignore_errors: yes
# assuming that the first pod will be the one that would have an issue
- name: "Get {{rollout_item.pod_name}} log to check for issue with dns lookup"
shell: kubectl logs {{rollout_item.pod_name}} --tail=1 -n {{fa_namespace}}
register: log_line
# the entrypoint will write dbhost service is NOT recognized. Exiting container... to the log
# if there is a problem getting to the dbhost
- name: "Try removing {{rollout_item.component}} pod if unable to deploy"
shell: kubectl delete pods -l component={{rollout_item.component}} --force --grace-period=0 --ignore-not-found=true -n {{fa_namespace}}
when: log_line.stdout.find('service is NOT recognized') > 0
我有时会重复进行6次此推出检查,因为有时即使在重新启动Pod之后也找不到该服务。吊舱成功启动后,立即进行附加检查。
- name: "Web rollout"
include_tasks: rollout-check.yml
loop:
- { c: 1, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 2, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 3, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 4, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 5, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
- { c: 6, statefulset: "{{ dest_deploy }}/web.statefulset.yml", pod_name: "pa-web-deployment-0", component: "analytics-web" }
loop_control:
loop_var: rollout_item