Kubernetes中的Ansible AWX RabbitMQ容器无法通过具有nxdomain的k8s获取节点

时间:2018-07-09 14:58:15

标签: kubernetes rabbitmq ansible-tower ansible-awx

我试图在我的Kubernetes集群上安装Ansible AWX,但是RabbitMQ容器抛出“无法从k8s获取节点”错误。

以下是我使用的平台的版本

[node1 ~]# kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.5", 
GitCommit:"f01a2bf98249a4db383560443a59bed0c13575df", GitTreeState:"clean", 
BuildDate:"2018-03-19T15:50:45Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}

Kubernetes是通过kubespray剧本v2.5.0部署的,所有服务和Pod均已启动并正在运行。 (CoreDNS,Weave,IPtables)

我正在使用1.0.6映像通过1.0.6映像通过awx_web和awx_task部署AWX

我正在使用v10.4上的外部PostgreSQL数据库,并已验证表是由db中的awx创建的。

我尝试过的故障排除步骤。

  • 我尝试将带有etcd pod的AWX 1.0.5部署到同一群集中,并且按预期工作
  • 我已经在同一k8s集群中部署了一个独立的RabbitMQ cluster,试图尽可能地模仿AWX兔子的部署,并且它可以与Rabbit_peer_discovery_k8s后端一起使用。
  • 我已经尝试过将AWX 1.0.6的rabbitmq.conf打包,但没有运气,只是保持了相同的错误。
  • 我已验证/etc/resolv.conf文件具有kubernetes.default.svc.cluster.local条目

集群信息

[node1 ~]# kubectl get all -n awx
NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME         DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/awx   1         1         1            0           38m

NAME                DESIRED   CURRENT   READY     AGE
rs/awx-654f7fc84c   1         1         0         38m

NAME                      READY     STATUS             RESTARTS   AGE
po/awx-654f7fc84c-9ppqb   3/4       CrashLoopBackOff   11         38m

NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                          AGE
svc/awx-rmq-mgmt   ClusterIP   10.233.10.146   <none>        15672/TCP                        1d
svc/awx-web-svc    NodePort    10.233.3.75     <none>        80:31700/TCP                     1d
svc/rabbitmq       NodePort    10.233.37.33    <none>        15672:30434/TCP,5672:31962/TCP   1d

AWX RabbitMQ错误日志

[node1 ~]# kubectl logs -n awx awx-654f7fc84c-9ppqb awx-rabbit
2018-07-09 14:47:37.464 [info] <0.33.0> Application lager started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application os_mon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.767 [info] <0.33.0> Application crypto started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application cowlib started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.768 [info] <0.33.0> Application xmerl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application mnesia started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.851 [info] <0.33.0> Application recon started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application jsx started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application asn1 started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.852 [info] <0.33.0> Application public_key started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.897 [info] <0.33.0> Application ssl started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application ranch_proxy_protocol started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.901 [info] <0.33.0> Application rabbit_common started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.907 [info] <0.33.0> Application amqp_client started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.909 [info] <0.33.0> Application cowboy started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.957 [info] <0.33.0> Application inets started on node 'rabbit@10.233.120.5'
2018-07-09 14:47:37.964 [info] <0.193.0>
 Starting RabbitMQ 3.7.4 on Erlang 20.1.7
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.4. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2018-07-09 14:47:37.982 [info] <0.193.0>
 node           : rabbit@10.233.120.5
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : at619UOZzsenF44tSK3ulA==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5
2018-07-09 14:47:39.649 [info] <0.201.0> Memory high watermark set to 11998 MiB (12581714329 bytes) of 29997 MiB (31454285824 bytes) total
2018-07-09 14:47:39.652 [info] <0.203.0> Enabling free disk space monitoring
2018-07-09 14:47:39.653 [info] <0.203.0> Disk free limit set to 50MB
2018-07-09 14:47:39.658 [info] <0.205.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-07-09 14:47:39.658 [info] <0.206.0> FHC read buffering:  OFF
2018-07-09 14:47:39.658 [info] <0.206.0> FHC write buffering: ON
2018-07-09 14:47:39.660 [info] <0.193.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@10.233.120.5 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-07-09 14:47:39.660 [info] <0.193.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-07-09 14:47:39.660 [info] <0.193.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-07-09 14:47:39.665 [info] <0.193.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}},
                 {inet,[inet],nxdomain}]}
2018-07-09 14:47:39.665 [error] <0.192.0> CRASH REPORT Process <0.192.0> with 0 neighbours exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164 in application_master:init/4 line 134
2018-07-09 14:47:39.666 [info] <0.33.0> Application rabbit exited with reason: no case clause matching {error,"{failed_connect,[{to_address,{\"kubernetes.default.svc.cluster.local\",443}},\n                 {inet,[inet],nxdomain}]}"} in rabbit_mnesia:init_from_config/0 line 164
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,\"{failed_connect,[{to_address,{\\"kubernetes.default.svc.cluster.local\\",443}},\n                 {inet,[inet],nxdomain}]}\"}},[{rabbit_mnesia,init_from_config,0,[{file,\"src/rabbit_mnesia.erl\"},{line,164}]},{rabbit_mnesia,init_with_lock,3,[{file,\"src/rabbit_mnesia.erl\"},{line,144}]},{rabbit_mnesia,init,0,[{file,\"src/rabbit_mnesia.erl\"},{line,111}]},{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,run_step,2,[{file,\"src/rabbit_boot_steps.erl\"},{line,49}]},{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit_boot_steps,run_boot_steps,1,[{file,\"src/rabbit_boot_steps.erl\"},{line,26}]},{rabbit,start,2,[{file,\"src/rabbit.erl\"},{line,793}]}]}}}}}"}
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{{case_clause,{error,"{failed_connect,[{to_address,{\"kubernetes.defau

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Kubernetes API服务

[node1 ~]# kubectl describe service kubernetes
Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.233.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.237.34.19:6443,10.237.34.21:6443
Session Affinity:  ClientIP
Events:            <none>

同一kubernetes集群中的busybox中的nslookup

[node2 ~]# kubectl exec -it busybox -- sh
/ # nslookup  kubernetes.default.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local

Name:      kubernetes.default.svc.cluster.local
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local

请让我知道是否缺少任何有助于故障排除的信息。

1 个答案:

答案 0 :(得分:0)

相信解决方案是忽略the explicit kubernetes host。我想不出有什么好理由需要从集群内部指定 kubernetes api主机。

如果出于某种可怕的原因,RMQ插件需要它,请尝试交换Service IP(假设主服务器的SSL证书在SAN列表中具有其Service IP)。


关于为什么这样做很愚蠢,我能想到的唯一很好的理由是RMQ PodSpec已经以某种方式获得了dnsPolicyClusterFirst。如果您确实希望对RMQ Pod进行故障排除,则可以提供一个明确的command:来运行一些调试bash命令,以便在启动时询问容器的状态,然后exec /launch.sh来恢复启动RMQ(as they do